Native speech models change the boundary

insight/native-speech-models-change-the-boundary · topics: voice-agents, native-speech, architecture

Native speech attacks the cascade itself

The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md. Cascaded systems split STT, LLM, and TTS. Native speech-to-speech models move speech tokens into the model boundary, preserving prosody, overlap, backchannels, and timing signals that text discards.

Moshi is the strongest conceptual source: it names the cascade problems as compounded latency, text bottleneck, and turn-based segmentation. It reports 160 ms theoretical latency, 200 ms practical latency, Mimi at 12.5 Hz, and full-duplex multi-stream modeling. Qwen2.5-Omni, Mini-Omni, and GLM-4-Voice show adjacent designs using Thinker/Talker structures, parallel decoding, or low-bitrate speech tokenizers.

The cascade is still the practical baseline

The important caveat is that native speech does not automatically win in production. Cascades are easier to debug, moderate, log, tool-call, and swap component-by-component. Native systems often need custom serving and much richer audio-level evaluation.

For the article and deck, native speech should be framed as "where the boundary is moving," not as a replacement recommendation for every builder. Start with a debuggable cascade, add turn-taking and barge-in, then move native if the product really needs duplex, prosody, or lower boundary latency.

Evidence Fragments

1. Moshi explicitly targets cascade latency, text bottleneck, and turn-based limitations.
The paper reports 160 ms theoretical latency, 200 ms practical latency, 12.5 Hz Mimi codec, and full-duplex multi-stream modeling.
source trace: Moshi
2. Qwen2.5-Omni uses a Thinker-Talker architecture for multimodal reasoning and speech output.
The paper describes Talker producing audio tokens from Thinker representations with streaming audio decoding.
source trace: Qwen2.5-Omni

Sources

1. Voice agents native speech insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md
2. Moshi
Full-duplex native speech model.
url: https://arxiv.org/abs/2410.00037
local_ref: presentations/voice-agents/research/paper-text/moshi-2410.00037.txt
3. Qwen2.5-Omni
Thinker-Talker architecture.
url: https://arxiv.org/abs/2503.20215
local_ref: presentations/voice-agents/research/paper-text/qwen25-omni-2503.20215.txt
4. GLM-4-Voice
End-to-end spoken chatbot architecture.
url: https://arxiv.org/abs/2412.02612
local_ref: presentations/voice-agents/research/paper-text/glm-4-voice-2412.02612.txt

Caveats

Native speech claims rarely include full product evaluation.
Tool calls, moderation, and logs are simpler in cascaded text-centered systems.
Open-source native systems often need custom serving and eval harnesses.

Open Threads

1. Which Jarvis interaction actually needs native speech rather than a cascade?
Native speech complexity is only worth it if the product needs duplex/prosody, not just novelty.

Graph Edges

extends · insight/voice-agent-latency-budget-is-product · strength=2
Voice-agent latency budget is the product
Native speech models try to remove cascade latency boundaries.
extends · insight/barge-in-is-the-real-system-test · strength=2
Barge-in is the real system test
Full-duplex native models directly target overlap and interruption.
packs-into · presentation/voice-agents-deck · strength=2
Building Real-Time Voice Agents deck
This belongs in the future-facing section.

Native speech attacks the cascade itself

The cascade is still the practical baseline

Evidence Fragments

1. Moshi explicitly targets cascade latency, text bottleneck, and turn-based limitations.

2. Qwen2.5-Omni uses a Thinker-Talker architecture for multimodal reasoning and speech output.

Sources

Caveats

Open Threads

1. Which Jarvis interaction actually needs native speech rather than a cascade?

Links

Graph Edges