Voice-agent eval needs multiple metrics

insight/voice-agent-eval-needs-multiple-metrics · topics: voice-agents, evaluation, observability

No single metric describes a voice agent

The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md. WER measures transcript closeness, MOS/UTMOS measures voice quality, RTF/RTFx measures speed or throughput, and demos measure a selected happy path. None of those alone tells whether the agent works as a conversation system.

The research now spans incompatible metrics: WER, CER, RTFx, RTF, TTFA, UTMOS, SIM, ROC-AUC, EOT latency, jitter, packet loss, interruption success, and task success. The right output is a trace-based harness rather than a single leaderboard.

The eval should follow the turn trace

The proposed trace records audio capture, first VAD speech, speech end, endpoint decision, first partial, final transcript, LLM first token, TTS first audio, playback start, interruption, and cancellation. From that trace, the system can compute p50/p95/p99 latency, false EOT, missed EOT, entity WER, TTFA, barge-in success, and cost per live minute.

This is the strongest practical closing for the future article: benchmark your product conversation, not just the model demo. A small Jarvis regression set with pauses, corrections, backchannels, noisy playback, and domain terms would turn this research into a defensible engineering artifact.

Evidence Fragments

1. STT, TTS, VAD, and transport sources use incompatible metrics.
The research spans WER, RTFx, UTMOS, SIM, RTF, TTFA, ROC-AUC, EOT latency, and transport behavior.
source trace: Open ASR Leaderboard, F5-TTS, Fish Audio S2, Silero VAD quality metrics
2. The local notes now define a trace schema for turn-level instrumentation.
The canonical insight proposes timestamps for audio, VAD, endpointing, STT, LLM, TTS, playout, and interruption.
source trace: Voice agents eval insight

Sources

1. Voice agents eval insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_08_voice_agent_eval_needs_multiple_metrics.md
2. Open ASR Leaderboard
WER/RTFx benchmark model.
url: https://arxiv.org/abs/2510.06961
local_ref: presentations/voice-agents/research/paper-text/open-asr-leaderboard-2510.06961.txt

Caveats

The note proposes a harness; it does not yet include measured Jarvis trace data.
Metric weights should depend on product domain.
Human listening tests still matter for voice quality.

Open Threads

1. Which 20-50 test turns should become the Jarvis voice-agent regression set?
A small high-signal local eval set would make demo reliability and future posts much stronger.

Graph Edges

operationalizes · insight/voice-agent-latency-budget-is-product · strength=3
Voice-agent latency budget is the product
The latency budget becomes useful only when every stage emits trace events.
operationalizes · insight/voice-agent-endpointing-is-turn-taking · strength=3
Endpointing is turn-taking
False EOT and missed EOT turn endpointing into measurable behavior.
packs-into · blog-post/voice-agents-article-draft · strength=3
Future voice agents article
This should be the practical checklist section.
packs-into · presentation/voice-agents-deck · strength=3
Building Real-Time Voice Agents deck
This should become the closing evaluation slide.

No single metric describes a voice agent

The eval should follow the turn trace

Evidence Fragments

1. STT, TTS, VAD, and transport sources use incompatible metrics.

2. The local notes now define a trace schema for turn-level instrumentation.

Sources

Caveats

Open Threads

1. Which 20-50 test turns should become the Jarvis voice-agent regression set?

Links

Graph Edges