Brain index

Streaming STT Is Not Batch STT

Batch speech recognition asks a file-processing question: "what transcript can I produce for this audio?" Streaming STT in a voice agent asks a systems question: "what stable enough interpretation can I produce soon enough for the agent to decide whether to speak, call a tool, or keep listening?"

WER is still necessary. A voice agent that misunderstands users is bad. But WER alone is not a voice-agent metric. A low-WER model can still be the wrong choice if it produces stable text too late, rewrites partials aggressively, or cannot signal end-of-turn reliably.

The Two Evaluation Questions

The difference between batch STT and agent STT is easiest to see as two loops.

flowchart TB subgraph batch[Batch transcription] file[Complete audio file] --> model[ASR model] model --> transcript[Final transcript] transcript --> wer[WER / CER evaluation] end subgraph streaming[Voice-agent streaming STT] frames[Audio frames] --> partial[Interim partials] partial --> stability[Partial stability] stability --> endpoint[Endpoint / EOT] endpoint --> final[Stable final text] final --> agent[LLM or tool action] end

Batch evaluation can wait for the whole file. A voice agent cannot. It needs text early enough to drive turn-taking and response generation, while still avoiding premature or incorrect actions.

WER Is Necessary But Not Sufficient

WER measures edit distance between a transcript and a reference:

text
WER = (substitutions + deletions + insertions) / reference_words

That is a useful quality metric. But the agent needs several additional measurements.

MetricWhat it catchesWhy WER misses it
Time to first partialWhen the system can start forming context.WER is measured after transcription completes.
Partial churnWhether early text is stable enough to use.Final text can be correct even if partials were chaotic.
Finalization latencyWhen the agent can safely act.Batch benchmarks do not model turn boundaries.
False EOT rateWhether the agent interrupts unfinished users.The final transcript may still be accurate.
Entity WERNames, product terms, numbers, codes.Average WER can hide domain-critical mistakes.
P95/P99 latencyTail behavior under load.Mean accuracy does not reveal slow calls.
Noise/telephony robustnessReal deployment conditions.Clean audiobook datasets are not phone calls.

The practical rule: use WER as the floor, not the decision.

Open ASR Leaderboard Is Useful But Not Sufficient

The Open ASR Leaderboard is valuable because it standardizes average WER and inverse real-time factor, RTFx, across many systems. The paper defines RTFx as total audio duration divided by transcription time, so higher is faster. The reported short-form English results were measured on NVIDIA A100-SXM4-80GB hardware.

Copied subset from the paper's Table 3:

ModelOpenAvg. WERRTFxEncoderDecoderLanguages
Cohere Labs TranscribeYes5.42525FastConformerTransformer14
NVIDIA Canary Qwen 2.5BYes5.63418FastConformerLLM1
Qwen3 ASR 1.7BYes5.76148CustomLLM52
NVIDIA Parakeet TDT 0.6B v2Yes6.053390FastConformerTDT1
NVIDIA Parakeet TDT 0.6B v3Yes6.323330FastConformerTDT25
Google Chirp v2No6.42not reportednot listednot listed468
Mistral Voxtral Small 24BYes6.6254.1Whisper-FTLLM8
OpenAI Whisper Large v3Yes7.44146WhisperWhisper99
OpenAI Whisper Large v3 TurboYes7.83200WhisperWhisper99
NVIDIA FastConformer CTC LargeYes8.966400FastConformerCTC1

The table is excellent for understanding accuracy/throughput tradeoffs. It does not tell you when a voice agent can speak. RTFx can be computed over an utterance or a batch of audio; it is a throughput metric, not an end-of-turn latency metric.

The paper itself makes the architecture tradeoff visible: transformer/LLM decoders have the best average WER in the short-form table, while TDT/CTC systems offer much higher RTFx. For a voice agent, neither side automatically wins. The question is which system produces sufficiently correct, sufficiently stable text at the right point in the turn.

Moonshine v2 Measures The Live Question More Directly

Moonshine v2 is useful because it measures response latency in a live-transcription-like setup. The paper defines response latency as the time between VAD detecting the end of a speech segment and the returned transcript. The empirical latency table was measured on an Apple MacBook M3.

ModelParamsMoonshine Table 3 Libri clean WERMoonshine Table 3 average WERResponse latencyCompute load
Moonshine v2 Tiny34M4.49%12.01%50 ms8.03%
Moonshine v2 Small123M2.49%7.84%148 ms17.97%
Moonshine v2 Medium245M2.08%6.65%258 ms28.95%
Whisper Tiny39Mnot reported in that tablenot reported in that table289 ms8.46%
Whisper Base74Mnot reported in that tablenot reported in that table553 ms16.19%
Whisper Small244Mnot reported in that tablenot reported in that table1,940 ms56.84%
Whisper Large v31,550Mnot reported in that tablenot reported in that table11,286 ms330.65%

This is not a cloud provider benchmark. It is not a universal ranking. But it does show why "designed for live use" matters. The authors explicitly contrast online processing applications such as live captioning with offline processing scenarios where overall throughput is the main concern.

For a local voice-agent demo, this is a much better shape of evidence than WER alone.

For Whisper accuracy context, I keep a separate table from the local STT deep dive and the Whisper paper appendix rather than merging it into the Moonshine table. The local note records Whisper Tiny at 7.54%, Base at 5.01%, Small at 3.43%, Medium at 2.90%, Large v3 around 2.7%, and Turbo around 3.0% on LibriSpeech clean. Those numbers answer "how accurate is the recognizer on this clean benchmark?" The Moonshine response-latency table answers "how long after detected speech end did this benchmark return a transcript?" They are both useful, but they are not the same measurement.

Whisper Is The Baseline, Not The Whole Category

Whisper remains the default mental baseline because it is robust, well-known, open, and widely wrapped by tooling such as faster-whisper. The Whisper paper reports training on 680,000 hours of multilingual and multitask supervision. That matters: Whisper changed the baseline expectations for open speech recognition.

But using Whisper in a live agent normally means adding VAD, chunking, buffering, endpointing, and finalization rules around a model that was not originally framed as a low-latency turn-taking component. The failure is not "Whisper is bad." The failure is treating a strong batch transcription baseline as if it automatically solves streaming interaction.

The Streaming STT Timeline

The agent cares about a timeline, not only the final transcript.

sequenceDiagram participant Audio participant STT participant Turn participant Agent Audio->>STT: frame 1 STT-->>Agent: first partial Audio->>STT: more frames STT-->>Agent: revised partial Turn->>STT: endpoint / EOT candidate STT-->>Agent: stable final text Agent->>Agent: decide respond / tool / keep listening

A partial transcript can be useful before it is final. The LLM can prefetch context, prepare a likely answer, or start speculative work. But this is only safe if partial churn is measured. If the STT system repeatedly rewrites the user's intent, speculation becomes a source of wrong actions.

What A Voice-Agent STT Benchmark Should Include

For a real voice agent, the evaluation table should look more like this:

MetricWhy it mattersHow to measure
WER/CERBase transcription quality.Reference transcript comparison after normalization.
Entity WERNames, product terms, numbers, codes.Domain entity extraction plus transcript alignment.
Time to first partialEnables early context and perceived responsiveness.Audio start -> first interim transcript.
Partial churnDetermines whether early text is safe to use.Edit distance between successive partials and final.
End-of-turn latencyDetermines when the LLM can safely act.Human stop time -> EOT/stable final.
False EOT rateDetects interruption risk.Incomplete turns marked complete.
Missed EOT rateDetects dead air.Complete turns not closed within target.
P95/P99 latencyCaptures tail frustration.Percentiles over calls, not only mean.
Concurrency RTF/RTFxPredicts serving cost and capacity.Benchmark at expected stream count.
Telephony/noise robustnessMatches deployment reality.Test 8 kHz, packet loss, echo, background speech.

This benchmark should be domain-specific. A general model can score well on audiobook speech and fail on customer names, medication names, city names, stock tickers, or product SKUs.

Engineering Implications

Pick STT by conversation behavior, not by clean benchmark rank.

For local demos, Moonshine v2 is worth highlighting because it directly targets latency-critical local ASR and reports response latency in a live scenario. For broad ecosystem compatibility, Whisper/faster-whisper remains useful. For production systems, leaderboard leaders such as Canary, Parakeet, Qwen3 ASR, or provider systems should be evaluated on the product's own domain audio and turn boundaries.

The STT component should expose more than text:

typescript
type StreamingTranscriptEvent = {
  requestId: string;
  text: string;
  isFinal: boolean;
  startedAtMs: number;
  emittedAtMs: number;
  confidence?: number;
  partialRevision?: number;
  endpointReason?: 'silence' | 'semantic_eou' | 'timeout';
};

That event shape lets the rest of the agent distinguish "useful context" from "safe to act."

Non-Claims

  • Open ASR Leaderboard does not rank best voice-agent STT.
  • Moonshine v2 latency numbers do not prove best accuracy on every domain.
  • RTFx is not the same as end-of-turn latency.
  • WER on LibriSpeech clean does not predict noisy phone-call behavior.
  • Cloud vendor latency should be remeasured from the client environment.

References

  • R-VA-001: Local STT deep dive, presentations/voice-agents/STT-DEEP-DIVE.md
  • R-VA-003: Moonshine v2, presentations/voice-agents/research/paper-text/moonshine-v2-2602.12241.txt, https://arxiv.org/abs/2602.12241
  • R-VA-004: Open ASR Leaderboard, presentations/voice-agents/research/paper-text/open-asr-leaderboard-2510.06961.txt, https://arxiv.org/abs/2510.06961
  • R-VA-020: Deepgram Flux docs, presentations/voice-agents/research/articles/deepgram-flux-*.html, https://developers.deepgram.com/docs/flux/quickstart
  • R-VA-026: NVIDIA Parakeet model card, presentations/voice-agents/research/articles/nvidia-parakeet-tdt-06b-v3.html
  • R-VA-030: Whisper paper, presentations/voice-agents/research/paper-text/whisper-2212.04356.txt, https://arxiv.org/abs/2212.04356
  • Data: presentations/voice-agents/research/data/stt_models.csv