Brain index

Native speech models change the boundary

insight/native-speech-models-change-the-boundary · topics: voice-agents, native-speech, architecture

Native speech attacks the cascade itself

The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md. Cascaded systems split STT, LLM, and TTS. Native speech-to-speech models move speech tokens into the model boundary, preserving prosody, overlap, backchannels, and timing signals that text discards.

Moshi is the strongest conceptual source: it names the cascade problems as compounded latency, text bottleneck, and turn-based segmentation. It reports 160 ms theoretical latency, 200 ms practical latency, Mimi at 12.5 Hz, and full-duplex multi-stream modeling. Qwen2.5-Omni, Mini-Omni, and GLM-4-Voice show adjacent designs using Thinker/Talker structures, parallel decoding, or low-bitrate speech tokenizers.

The cascade is still the practical baseline

The important caveat is that native speech does not automatically win in production. Cascades are easier to debug, moderate, log, tool-call, and swap component-by-component. Native systems often need custom serving and much richer audio-level evaluation.

For the article and deck, native speech should be framed as "where the boundary is moving," not as a replacement recommendation for every builder. Start with a debuggable cascade, add turn-taking and barge-in, then move native if the product really needs duplex, prosody, or lower boundary latency.

Evidence Fragments

  1. 1. Moshi explicitly targets cascade latency, text bottleneck, and turn-based limitations.

    The paper reports 160 ms theoretical latency, 200 ms practical latency, 12.5 Hz Mimi codec, and full-duplex multi-stream modeling.

    source trace: Moshi

  2. 2. Qwen2.5-Omni uses a Thinker-Talker architecture for multimodal reasoning and speech output.

    The paper describes Talker producing audio tokens from Thinker representations with streaming audio decoding.

    source trace: Qwen2.5-Omni

Sources

  1. 1. Voice agents native speech insight

    Canonical long-form note.

    url: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md

    local_ref: presentations/voice-agents/research/insights/INSIGHT_07_native_speech_models_change_the_boundary.md

  2. 2. Moshi

    Full-duplex native speech model.

    url: https://arxiv.org/abs/2410.00037

    local_ref: presentations/voice-agents/research/paper-text/moshi-2410.00037.txt

  3. 3. Qwen2.5-Omni

    Thinker-Talker architecture.

    url: https://arxiv.org/abs/2503.20215

    local_ref: presentations/voice-agents/research/paper-text/qwen25-omni-2503.20215.txt

  4. 4. GLM-4-Voice

    End-to-end spoken chatbot architecture.

    url: https://arxiv.org/abs/2412.02612

    local_ref: presentations/voice-agents/research/paper-text/glm-4-voice-2412.02612.txt

Caveats

  • Native speech claims rarely include full product evaluation.
  • Tool calls, moderation, and logs are simpler in cascaded text-centered systems.
  • Open-source native systems often need custom serving and eval harnesses.

Open Threads

  1. 1. Which Jarvis interaction actually needs native speech rather than a cascade?

    Native speech complexity is only worth it if the product needs duplex/prosody, not just novelty.

Topics
voice-agents, native-speech, architecture
Used In
voice-agents-article-draft

Graph Edges

  1. extends · insight/voice-agent-latency-budget-is-product · strength=2

    Voice-agent latency budget is the product

    Native speech models try to remove cascade latency boundaries.

  2. extends · insight/barge-in-is-the-real-system-test · strength=2

    Barge-in is the real system test

    Full-duplex native models directly target overlap and interruption.

  3. packs-into · presentation/voice-agents-deck · strength=2

    Building Real-Time Voice Agents deck

    This belongs in the future-facing section.