Voice-activity detection (VAD) classifies audio frames as "speech" or "non-speech". It is the invisible foundation for barge-in, turn-taking and stopping STT at the end of an utterance. Poor VAD is the most common cause of robotic call behaviour.
Modern VAD models (Silero, WebRTC VAD, neural encoders with ms latency) emit a probability per frame. In production they are combined with hysteresis (a short hold after speech ends) and energy gating to suppress coughs, doors and background noise.
Operational knobs: frame size (20–30 ms), threshold (often 0.5 as a starting point), minimum speech duration (~150 ms), and minimum silence duration before endpoint (~400 ms). Pinning these values globally always hurts one use case — calibration per industry or per noise profile is standard.