Glossary

Realtime API

Streaming interfaces (e.g. OpenAI Realtime, Google Live API) that process audio directly — without the STT→text→TTS intermediate step. Reduces latency below 500 ms.

A realtime API is a server-to-server stream over which voice input is streamed to a speech-capable model in real time and audio output is played back directly — typically over WebSocket, WebRTC or bidirectional gRPC. Compared to the classic STT → LLM → TTS pipeline this cuts several hundred milliseconds of latency.

Architecturally this changes conversational feel: shorter pauses, more natural turn-taking, much better barge-in. Prerequisites are a media bridge between telephony (SIP trunk → RTP bridge → realtime API) and a tool layer that runs function calls with the same low latency as the audio pipeline itself.

Operational risks: higher per-minute cost, abrupt model updates on the vendor side, and harder debugging (no cleanly separated transcript step). Robust setups record audio, a transcript snapshot, and tool calls in parallel so that incidents stay reproducible.

FAQ
Is a realtime API worth it for every use case?
For demanding dialogues — outbound sales, sensitive support, industries with high expectations on natural speech — yes. For plain appointment booking the classic pipeline stack with good latency tuning is often enough.
Go deeper in the docs

Next step

See BHOMY in a 15-minute demo on a real call example.

🍪

Cookies & Privacy

We use cookies to provide you with the best possible experience on our website. Some of them are technically necessary, others help us improve the website.