Turn-taking describes how a voice-AI system decides when to speak and when to listen. Poor turn-taking creates a "walkie-talkie" feel — either the system interrupts constantly or it waits painfully long before replying.

Good turn-taking heuristics combine voice-activity detection (VAD), prosodic end signals ("…alright then."), pause length, and the LLM’s semantic endpoint prediction. Typical target windows are 250–500 ms after a caller pause, with dynamic extension on detectable thinking pauses.

In production, context-dependent profiles pay off: outbound sales can react slightly faster, support for elderly callers slightly slower. Measurable KPIs are share of interrupted caller utterances, mean response latency, and drop-off rate after latency spikes.

Turn-Taking

Next step

Cookies & Privacy