Transport is media correctness
insight/transport-is-media-correctness · topics: voice-agents, webrtc, transport
The transport choice is about owning media behavior
The canonical research note lives at presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md. WebSocket can stream audio and is fine for local demos, server-to-server streams, and telephony integrations. The issue is that browser/mobile voice needs media behavior: AEC, jitter buffering, packet timing, Opus, NAT traversal, stats, and playout semantics.
OpenAI's Realtime docs and Pipecat's transport docs both point toward WebRTC for browser/mobile/client voice and WebSocket for server-side or controlled integrations. The local transport deep dive adds an important correction: WebSocket frame overhead is not the primary problem; ordered TCP behavior, playout, echo, and cancellation are.
WebRTC does not make the model stack fast
WebRTC is the media substrate. It does not solve endpointing, STT finalization, LLM first token, or TTS first audio. A perfect media path can still feel slow if the turn detector waits too long or the TTS stack has high TTFA.
The article should therefore avoid shallow protocol tribalism. The right recommendation is situational: WebRTC by default for production browser/mobile voice, WebSocket for server streams and controlled prototypes, and a measured waterfall either way.
Evidence Fragments
1. OpenAI recommends WebRTC for browser/mobile and WebSocket for server-side integrations.
The official Realtime docs separate client-side WebRTC from server-side WebSocket use cases.
source trace: OpenAI Realtime WebRTC docs, OpenAI Realtime WebSocket docs
2. WebSocket frame overhead is not the main transport problem.
The local transport note estimates 2-14 bytes per frame, about 0.2-1.5% for a 960-byte 30 ms 16 kHz int16 audio chunk.
source trace: Local transport deep dive
Sources
1. Voice agents transport insight
Canonical long-form note.
url: presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md
local_ref: presentations/voice-agents/research/insights/INSIGHT_05_transport_is_media_correctness.md
2. OpenAI Realtime WebRTC docs
Browser/mobile transport guidance.
url: https://developers.openai.com/api/docs/guides/realtime-webrtc
local_ref: presentations/voice-agents/research/articles/openai-realtime-webrtc.html
3. OpenAI Realtime WebSocket docs
Server-side transport guidance.
url: https://developers.openai.com/api/docs/guides/realtime-websocket
local_ref: presentations/voice-agents/research/articles/openai-realtime-websocket.html
Existing transport research.
url: presentations/voice-agents/TRANSPORT-DEEP-DIVE.md
local_ref: presentations/voice-agents/TRANSPORT-DEEP-DIVE.md
Caveats
- WebRTC is not magic; TURN placement and app cancellation still matter.
- WebSocket is valid for many server-side and controlled-network scenarios.
- Transport does not remove endpointing/STT/LLM/TTS latency.
Open Threads
1. Should Jarvis default to WebRTC for the talk demo or keep WebSocket for simplicity?
Stage reliability may favor the path that is simplest to test end to end.
Links
- Topics
- voice-agents, webrtc, transport
- Used In
- voice-agents-article-draft
Graph Edges
supports · insight/barge-in-is-the-real-system-test · strength=3
Barge-in is the real system testBarge-in depends on media timing, echo cancellation, and playout control.
supports · insight/voice-agent-latency-budget-is-product · strength=2
Voice-agent latency budget is the productTransport is one segment of the user-perceived latency waterfall.
packs-into · presentation/voice-agents-deck · strength=3
Building Real-Time Voice Agents deck
This belongs in the real-time transport section.