Barge-in is the real system test
insight/barge-in-is-the-real-system-test · topics: voice-agents, barge-in, conversation
Interruption turns demos into systems
The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md. A single-turn prompt/response demo does not prove that a voice agent is conversational. The user must be able to interrupt it, and the system must stop playout, cancel or truncate model/TTS streams, preserve transcript state, and listen to the new turn.
Barge-in is hard because the agent is listening while it speaks. Without echo cancellation or correct media timing, it can transcribe itself, reset its own VAD, or ignore the real user. Without semantic interruption policy, it can treat backchannels like "yeah" as stop commands or ignore actual corrections.
The test is media plus state
The note proposes testing interruption at different assistant-audio positions, with speaker playback, backchannels, true stop commands, and stale model/TTS streams. The key measurements are stop latency, stale audio leakage, transcript correctness, cancellation acknowledgement, and whether the next response addresses the user's interruption.
This insight is the practical bridge between endpointing and transport. VAD helps detect the user, WebRTC helps with echo/media timing, TTS needs cancellation, and the app needs a conversation-history contract for what the user actually heard.
Evidence Fragments
1. Realtime APIs expose interruption behavior as part of turn detection.
OpenAI Realtime turn detection includes fields such as interruption behavior alongside VAD configuration.
source trace: OpenAI Realtime API reference
2. Transport and AEC affect whether the agent hears itself.
The local transport research highlights WebRTC echo cancellation and media timing as a decisive advantage for browser voice.
source trace: Local transport deep dive, LiveKit transport docs
Sources
1. Voice agents barge-in insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_06_barge_in_is_the_real_system_test.md
WebRTC production substrate.
url: https://docs.livekit.io/transport/
local_ref: presentations/voice-agents/research/articles/livekit-transport.html
Caveats
- There are fewer open barge-in benchmarks than ASR benchmarks.
- AEC does not solve semantic backchannel classification.
- Provider cancellation still needs app-side state handling.
Open Threads
1. What is Jarvis interruption stop latency at speaker volume in a real room?
The talk demo likely runs through speakers, where echo and false VAD matter.
Links
- Topics
- voice-agents, barge-in, conversation
- Used In
- voice-agents-article-draft
Graph Edges
depends-on · insight/voice-agent-endpointing-is-turn-taking · strength=3
Endpointing is turn-takingInterruption is turn-taking during assistant speech.
depends-on · insight/transport-is-media-correctness · strength=3
Transport is media correctnessEcho cancellation and media timing make interruption detectable.
packs-into · presentation/voice-agents-deck · strength=2
Building Real-Time Voice Agents deck
This is the systems test that makes the demo credible.