A realtime API is a server-to-server stream over which voice input is streamed to a speech-capable model in real time and audio output is played back directly — typically over WebSocket, WebRTC or bidirectional gRPC. Compared to the classic STT → LLM → TTS pipeline this cuts several hundred milliseconds of latency.
Architecturally this changes conversational feel: shorter pauses, more natural turn-taking, much better barge-in. Prerequisites are a media bridge between telephony (SIP trunk → RTP bridge → realtime API) and a tool layer that runs function calls with the same low latency as the audio pipeline itself.
Operational risks: higher per-minute cost, abrupt model updates on the vendor side, and harder debugging (no cleanly separated transcript step). Robust setups record audio, a transcript snapshot, and tool calls in parallel so that incidents stay reproducible.