Brain index

Endpointing Is Turn-Taking

Voice activity detection is not turn-taking. VAD answers a narrow acoustic question: "does this audio frame look like speech?" A voice agent has to answer a social and product question: "is the user done enough that the assistant should talk now?"

Those are different decisions. If they are collapsed into one silence timer, the product usually fails in one of two ways. A short timer makes the assistant interrupt unfinished thoughts. A long timer makes the assistant feel slow. The better architecture separates speech detection, acoustic endpointing, semantic end-of-utterance, and interruption.

The Three-Layer Model

The clean model is not "VAD fires, then agent responds." It is a layered turn-taking system.

flowchart TD audio[Audio frames] --> vad[VAD: speech or not speech] vad --> silence[Endpointing: has speech stopped long enough?] silence --> eou[Semantic EOU: is the thought complete?] eou --> respond[Agent allowed to respond] vad --> barge[Barge-in during assistant speech] barge --> cancel[Cancel playback and generation] cancel --> listen[Return to listening]

This distinction matters because each layer has different failure modes.

LayerInputOutputFailure mode
VADRaw audio framesSpeech probability or speech/non-speechNoise looks like speech, soft speech is missed.
EndpointingVAD state plus silence duration"User stopped speaking"Pauses inside a thought become false ends.
Semantic EOURecent audio/transcript/context"User's turn is complete"The model misreads hesitation, grammar, or intent.
Barge-inUser speech while assistant is speakingCancel assistant speechBackchannels or echo stop the assistant incorrectly.

The important product point: VAD quality is necessary, but it is not sufficient.

VAD Measures Speech Presence

Silero's quality metrics are useful because they show why modern neural VAD is attractive. The archived Silero wiki reports ROC-AUC on 31.25 ms segments and separately reports accuracy with validation-selected thresholds. On the multi-domain validation set, WebRTC VAD is far behind Silero v5/v6.

ModelMulti-domain ROC-AUCMulti-domain accuracySource
WebRTC VAD0.730.74R-VA-005
Silero v40.910.85R-VA-005
Silero v50.960.91R-VA-005
Silero v60.970.92R-VA-005

This supports a strong but limited conclusion. Silero is a better speech detector than WebRTC on this archived benchmark. It does not prove Silero knows when a user has finished their thought. A model can classify speech frames well and still make bad turn-taking decisions if it is wrapped in the wrong endpointing policy.

The local VAD deep dive also records the low-level knobs that often become accidental conversation policy:

ParameterTypical/default valueMeaning
threshold0.5Speech probability threshold.
window_size_samples512 at 16 kHzAbout 32 ms input chunks.
min_speech_duration_ms250 msMinimum segment before accepting speech.
min_silence_duration_ms100 msMinimum silence to split speech segments.
speech_pad_ms30 msPadding around detected speech.

These are acoustic parameters. They should not be the only conversation parameters.

Silence Duration Is A Product Decision

OpenAI's Realtime API reference makes the policy surface explicit. server_vad exposes threshold, prefix_padding_ms, and silence_duration_ms. In the archived reference, threshold defaults to 0.5, prefix_padding_ms defaults to 300 ms, and silence_duration_ms defaults to 500 ms. The reference text also states the tradeoff: shorter values respond faster but may jump in on short pauses.

That one setting can dominate the budget. If the silence timer is 500 ms, then the system waits half a second before STT finalization, LLM, and TTS even get their turn.

System/layerReported/default timingWhat it means
Silero frame sizeabout 32 msFast speech/non-speech updates.
OpenAI server_vad silence default500 msSilence required before speech-stop decision.
Local Jarvis VAD note700 msConservative local endpointing policy.
Pipecat Smart Turn v3under 100 ms local CPU inference after pauseSemantic turn model can run inline after acoustic pause.
Deepgram Flux EOT claimabout 260 msVendor-claimed conversational end-of-turn detection.

This table is the reason endpointing needs its own insight. A "fast model" behind a slow turn policy is still a slow product.

Semantic EOU Is A Different Class Of Decision

Semantic end-of-utterance asks whether the user's thought is complete, not whether the microphone is quiet. That is why modern voice infrastructure increasingly exposes turn detection as a separate feature.

sequenceDiagram participant User participant VAD participant Timer participant EOU participant Agent User->>VAD: "Can you book a flight to..." VAD->>Timer: speech stops briefly Timer->>EOU: pause detected EOU-->>Agent: incomplete turn, keep listening User->>VAD: "...Berlin next Tuesday?" VAD->>Timer: speech stops Timer->>EOU: pause detected EOU-->>Agent: complete turn, respond

Pipecat Smart Turn is a clear example. The local data file records v3 as running after VAD detects a pause, using the most recent 8 s of the user turn, with under 100 ms local CPU inference and around 65 ms on Pipecat Cloud 1x. LiveKit's docs similarly separate VAD-only, STT endpointing, realtime model detection, and turn-detector models. OpenAI exposes semantic_vad with eagerness settings and timeout behavior.

The stronger inference is that turn-taking is becoming a first-class subsystem. STT providers and agent frameworks are no longer only exposing transcripts. They are exposing permission-to-speak events.

Deepgram Flux Shows EOT As Product Surface

Deepgram Flux is useful because it packages conversational STT around end-of-turn events. The local turn-detection table captures these provider claims and defaults:

Deepgram Flux field/claimValueCaveat
Recommended audio chunks80 msProvider-specific streaming guidance.
eot_threshold default0.7Higher thresholds reduce false positives but add latency.
eot_timeout_ms default5,000 msForced completion safety net.
EOT detectionabout 260 msVendor claim.
Final EndOfTurn p95within 1.5 sVendor claim.
EagerEndOfTurn150-250 ms earlierTrades speed for more speculative downstream work.
Agent latency reduction200-600 msVendor claim versus traditional STT+VAD.

These claims should not be compared directly to Pipecat, OpenAI, or LiveKit without a shared test set. But they show the market direction: end-of-turn is part of the speech product, not an afterthought.

The Human Baseline Explains Why Silence-Only Feels Wrong

Stivers et al. report cross-language turn transitions with a full-dataset mean response offset around 208 ms. The important part is not the exact number. Humans can answer quickly because they anticipate turn completion from syntax, prosody, gaze, action type, and shared context.

Silence-only endpointing does the opposite. It waits for absence. That is why a pure silence timer tends to feel either sluggish or interruptive. Semantic EOU is an attempt to add a small amount of prediction back into the agent loop.

What To Measure

Endpointing needs its own metrics. WER does not catch these failures.

MetricQuestion it answers
Start-of-speech latencyHow quickly does the system know the user began speaking?
End-of-speech latencyHow quickly does acoustic speech stop get detected?
End-of-turn latencyHow quickly does the system decide the user is done?
False EOT rateHow often does the agent interrupt unfinished turns?
Missed EOT rateHow often does the agent wait after complete turns?
Backchannel false interruption rateHow often do "yeah", "mm-hm", or noise stop the assistant?
Barge-in success rateCan the user interrupt assistant speech and be heard?

The implementation should also preserve evidence. A turn decision should be explainable: which VAD state, silence duration, transcript fragment, semantic EOU score, and timeout produced the decision?

typescript
type TurnDecision = {
  requestId: string;
  decision: 'keep_listening' | 'respond' | 'cancel_assistant';
  vadSpeechProbability: number;
  silenceDurationMs: number;
  semanticEouScore?: number;
  transcriptFragment?: string;
  decidedAtMs: number;
};

This makes turn-taking debuggable instead of mystical.

Product Profiles

Different agents should use different turn-taking policies.

ModeTurn-taking preferenceLikely configuration
Quick commandFast response, tolerates occasional false endsShorter silence, eager semantic EOU, strong cancellation.
Support agentAvoid interrupting the userMore patient endpointing, semantic EOU, backchannel handling.
Noisy telephonyAvoid noise-triggered speechStricter VAD, noise suppression, provider EOT validation.
Stage demoReliability beats naturalnessPush-to-talk or explicit turn control may be better.

This is the core conclusion: endpointing is not an implementation detail. It is how the agent expresses politeness, patience, and timing.

Non-Claims

  • Silero's VAD ROC-AUC does not prove good turn-taking.
  • Semantic EOU does not remove the need for VAD.
  • Lower silence duration is not automatically better.
  • Vendor EOT latency claims are not apples-to-apples.
  • Human turn-taking timing is not a universal product SLA.

References