Brain index

Transport is media correctness

insight/transport-is-media-correctness · topics: voice-agents, webrtc, transport

The transport choice is about owning media behavior

The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md. WebSocket can stream audio and is fine for local demos, server-to-server streams, and telephony integrations. The issue is that browser/mobile voice needs media behavior: AEC, jitter buffering, packet timing, Opus, NAT traversal, stats, and playout semantics.

OpenAI's Realtime docs and Pipecat's transport docs both point toward WebRTC for browser/mobile/client voice and WebSocket for server-side or controlled integrations. The local transport deep dive adds an important correction: WebSocket frame overhead is not the primary problem; ordered TCP behavior, playout, echo, and cancellation are.

WebRTC does not make the model stack fast

WebRTC is the media substrate. It does not solve endpointing, STT finalization, LLM first token, or TTS first audio. A perfect media path can still feel slow if the turn detector waits too long or the TTS stack has high TTFA.

The article should therefore avoid shallow protocol tribalism. The right recommendation is situational: WebRTC by default for production browser/mobile voice, WebSocket for server streams and controlled prototypes, and a measured waterfall either way.

Evidence Fragments

  1. 1. OpenAI recommends WebRTC for browser/mobile and WebSocket for server-side integrations.

    The official Realtime docs separate client-side WebRTC from server-side WebSocket use cases.

    source trace: OpenAI Realtime WebRTC docs, OpenAI Realtime WebSocket docs

  2. 2. WebSocket frame overhead is not the main transport problem.

    The local transport note estimates 2-14 bytes per frame, about 0.2-1.5% for a 960-byte 30 ms 16 kHz int16 audio chunk.

    source trace: Local transport deep dive

Sources

  1. 1. Voice agents transport insight

    Canonical long-form note.

    url: presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md

    local_ref: presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md

  2. 2. OpenAI Realtime WebRTC docs

    Browser/mobile transport guidance.

    url: https://developers.openai.com/api/docs/guides/realtime-webrtc

    local_ref: presentations/voice-agents/research/articles/openai-realtime-webrtc.html

  3. 3. OpenAI Realtime WebSocket docs

    Server-side transport guidance.

    url: https://developers.openai.com/api/docs/guides/realtime-websocket

    local_ref: presentations/voice-agents/research/articles/openai-realtime-websocket.html

  4. 4. Local transport deep dive

    Existing transport research.

    url: presentations/voice-agents/TRANSPORT-DEEP-DIVE.md

    local_ref: presentations/voice-agents/TRANSPORT-DEEP-DIVE.md

Caveats

  • WebRTC is not magic; TURN placement and app cancellation still matter.
  • WebSocket is valid for many server-side and controlled-network scenarios.
  • Transport does not remove endpointing/STT/LLM/TTS latency.

Open Threads

  1. 1. Should Jarvis default to WebRTC for the talk demo or keep WebSocket for simplicity?

    Stage reliability may favor the path that is simplest to test end to end.

Topics
voice-agents, webrtc, transport
Used In
voice-agents-article-draft

Graph Edges

  1. supports · insight/barge-in-is-the-real-system-test · strength=3

    Barge-in is the real system test

    Barge-in depends on media timing, echo cancellation, and playout control.

  2. supports · insight/voice-agent-latency-budget-is-product · strength=2

    Voice-agent latency budget is the product

    Transport is one segment of the user-perceived latency waterfall.

  3. packs-into · presentation/voice-agents-deck · strength=3

    Building Real-Time Voice Agents deck

    This belongs in the real-time transport section.