Voice-agent eval needs multiple metrics
insight/voice-agent-eval-needs-multiple-metrics · topics: voice-agents, evaluation, observability
No single metric describes a voice agent
The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md. WER measures transcript closeness, MOS/UTMOS measures voice quality, RTF/RTFx measures speed or throughput, and demos measure a selected happy path. None of those alone tells whether the agent works as a conversation system.
The research now spans incompatible metrics: WER, CER, RTFx, RTF, TTFA, UTMOS, SIM, ROC-AUC, EOT latency, jitter, packet loss, interruption success, and task success. The right output is a trace-based harness rather than a single leaderboard.
The eval should follow the turn trace
The proposed trace records audio capture, first VAD speech, speech end, endpoint decision, first partial, final transcript, LLM first token, TTS first audio, playback start, interruption, and cancellation. From that trace, the system can compute p50/p95/p99 latency, false EOT, missed EOT, entity WER, TTFA, barge-in success, and cost per live minute.
This is the strongest practical closing for the future article: benchmark your product conversation, not just the model demo. A small Jarvis regression set with pauses, corrections, backchannels, noisy playback, and domain terms would turn this research into a defensible engineering artifact.
Evidence Fragments
1. STT, TTS, VAD, and transport sources use incompatible metrics.
The research spans WER, RTFx, UTMOS, SIM, RTF, TTFA, ROC-AUC, EOT latency, and transport behavior.
source trace: Open ASR Leaderboard, F5-TTS, Fish Audio S2, Silero VAD quality metrics
2. The local notes now define a trace schema for turn-level instrumentation.
The canonical insight proposes timestamps for audio, VAD, endpointing, STT, LLM, TTS, playout, and interruption.
source trace: Voice agents eval insight
Sources
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md
WER/RTFx benchmark model.
url: https://arxiv.org/abs/2510.06961
local_ref: presentations/voice-agents/research/paper-text/open-asr-leaderboard-2510.06961.txt
Caveats
- The note proposes a harness; it does not yet include measured Jarvis trace data.
- Metric weights should depend on product domain.
- Human listening tests still matter for voice quality.
Open Threads
1. Which 20-50 test turns should become the Jarvis voice-agent regression set?
A small high-signal local eval set would make demo reliability and future posts much stronger.
Links
- Topics
- voice-agents, evaluation, observability
- Used In
- voice-agents-article-draft
Graph Edges
operationalizes · insight/voice-agent-latency-budget-is-product · strength=3
Voice-agent latency budget is the productThe latency budget becomes useful only when every stage emits trace events.
operationalizes · insight/voice-agent-endpointing-is-turn-taking · strength=3
Endpointing is turn-takingFalse EOT and missed EOT turn endpointing into measurable behavior.
packs-into · blog-post/voice-agents-article-draft · strength=3
Future voice agents article
This should be the practical checklist section.
packs-into · presentation/voice-agents-deck · strength=3
Building Real-Time Voice Agents deck
This should become the closing evaluation slide.