Native speech models change the boundary
insight/native-speech-models-change-the-boundary · topics: voice-agents, native-speech, architecture
Native speech attacks the cascade itself
The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md. Cascaded systems split STT, LLM, and TTS. Native speech-to-speech models move speech tokens into the model boundary, preserving prosody, overlap, backchannels, and timing signals that text discards.
Moshi is the strongest conceptual source: it names the cascade problems as compounded latency, text bottleneck, and turn-based segmentation. It reports 160 ms theoretical latency, 200 ms practical latency, Mimi at 12.5 Hz, and full-duplex multi-stream modeling. Qwen2.5-Omni, Mini-Omni, and GLM-4-Voice show adjacent designs using Thinker/Talker structures, parallel decoding, or low-bitrate speech tokenizers.
The cascade is still the practical baseline
The important caveat is that native speech does not automatically win in production. Cascades are easier to debug, moderate, log, tool-call, and swap component-by-component. Native systems often need custom serving and much richer audio-level evaluation.
For the article and deck, native speech should be framed as "where the boundary is moving," not as a replacement recommendation for every builder. Start with a debuggable cascade, add turn-taking and barge-in, then move native if the product really needs duplex, prosody, or lower boundary latency.
Evidence Fragments
1. Moshi explicitly targets cascade latency, text bottleneck, and turn-based limitations.
The paper reports 160 ms theoretical latency, 200 ms practical latency, 12.5 Hz Mimi codec, and full-duplex multi-stream modeling.
source trace: Moshi
2. Qwen2.5-Omni uses a Thinker-Talker architecture for multimodal reasoning and speech output.
The paper describes Talker producing audio tokens from Thinker representations with streaming audio decoding.
source trace: Qwen2.5-Omni
Sources
1. Voice agents native speech insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md
2. Moshi
Full-duplex native speech model.
url: https://arxiv.org/abs/2410.00037
local_ref: presentations/voice-agents/research/paper-text/moshi-2410.00037.txt
3. Qwen2.5-Omni
Thinker-Talker architecture.
url: https://arxiv.org/abs/2503.20215
local_ref: presentations/voice-agents/research/paper-text/qwen25-omni-2503.20215.txt
4. GLM-4-Voice
End-to-end spoken chatbot architecture.
url: https://arxiv.org/abs/2412.02612
local_ref: presentations/voice-agents/research/paper-text/glm-4-voice-2412.02612.txt
Caveats
- Native speech claims rarely include full product evaluation.
- Tool calls, moderation, and logs are simpler in cascaded text-centered systems.
- Open-source native systems often need custom serving and eval harnesses.
Open Threads
1. Which Jarvis interaction actually needs native speech rather than a cascade?
Native speech complexity is only worth it if the product needs duplex/prosody, not just novelty.
Links
- Topics
- voice-agents, native-speech, architecture
- Used In
- voice-agents-article-draft
Graph Edges
extends · insight/voice-agent-latency-budget-is-product · strength=2
Voice-agent latency budget is the productNative speech models try to remove cascade latency boundaries.
extends · insight/barge-in-is-the-real-system-test · strength=2
Barge-in is the real system testFull-duplex native models directly target overlap and interruption.
packs-into · presentation/voice-agents-deck · strength=2
Building Real-Time Voice Agents deck
This belongs in the future-facing section.