Barge-in is the real system test

insight/barge-in-is-the-real-system-test · topics: voice-agents, barge-in, conversation

Interruption turns demos into systems

The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md. A single-turn prompt/response demo does not prove that a voice agent is conversational. The user must be able to interrupt it, and the system must stop playout, cancel or truncate model/TTS streams, preserve transcript state, and listen to the new turn.

Barge-in is hard because the agent is listening while it speaks. Without echo cancellation or correct media timing, it can transcribe itself, reset its own VAD, or ignore the real user. Without semantic interruption policy, it can treat backchannels like "yeah" as stop commands or ignore actual corrections.

The test is media plus state

The note proposes testing interruption at different assistant-audio positions, with speaker playback, backchannels, true stop commands, and stale model/TTS streams. The key measurements are stop latency, stale audio leakage, transcript correctness, cancellation acknowledgement, and whether the next response addresses the user's interruption.

This insight is the practical bridge between endpointing and transport. VAD helps detect the user, WebRTC helps with echo/media timing, TTS needs cancellation, and the app needs a conversation-history contract for what the user actually heard.

Evidence Fragments

1. Realtime APIs expose interruption behavior as part of turn detection.
OpenAI Realtime turn detection includes fields such as interruption behavior alongside VAD configuration.
source trace: OpenAI Realtime API reference
2. Transport and AEC affect whether the agent hears itself.
The local transport research highlights WebRTC echo cancellation and media timing as a decisive advantage for browser voice.
source trace: Local transport deep dive, LiveKit transport docs

Sources

1. Voice agents barge-in insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md
2. LiveKit transport docs
WebRTC production substrate.
url: https://docs.livekit.io/transport/
local_ref: presentations/voice-agents/research/articles/livekit-transport.html

Caveats

There are fewer open barge-in benchmarks than ASR benchmarks.
AEC does not solve semantic backchannel classification.
Provider cancellation still needs app-side state handling.

Open Threads

1. What is Jarvis interruption stop latency at speaker volume in a real room?
The talk demo likely runs through speakers, where echo and false VAD matter.

Graph Edges

depends-on · insight/voice-agent-endpointing-is-turn-taking · strength=3
Endpointing is turn-taking
Interruption is turn-taking during assistant speech.
depends-on · insight/transport-is-media-correctness · strength=3
Transport is media correctness
Echo cancellation and media timing make interruption detectable.
packs-into · presentation/voice-agents-deck · strength=2
Building Real-Time Voice Agents deck
This is the systems test that makes the demo credible.

Interruption turns demos into systems

The test is media plus state

Evidence Fragments

1. Realtime APIs expose interruption behavior as part of turn detection.

2. Transport and AEC affect whether the agent hears itself.

Sources

Caveats

Open Threads

1. What is Jarvis interruption stop latency at speaker volume in a real room?

Links

Graph Edges