Glossary

VAD (Voice Activity Detection)

Detection of whether the audio currently contains speech or just silence/background noise. Prerequisite for barge-in, turn-taking and efficient STT (no processing during silence).

Voice-activity detection (VAD) classifies audio frames as "speech" or "non-speech". It is the invisible foundation for barge-in, turn-taking and stopping STT at the end of an utterance. Poor VAD is the most common cause of robotic call behaviour.

Modern VAD models (Silero, WebRTC VAD, neural encoders with ms latency) emit a probability per frame. In production they are combined with hysteresis (a short hold after speech ends) and energy gating to suppress coughs, doors and background noise.

Operational knobs: frame size (20–30 ms), threshold (often 0.5 as a starting point), minimum speech duration (~150 ms), and minimum silence duration before endpoint (~400 ms). Pinning these values globally always hurts one use case — calibration per industry or per noise profile is standard.

Go deeper in the docs

Next step

See BHOMY in a 15-minute demo on a real call example.

🍪

Cookies & Privacy

We use cookies to provide you with the best possible experience on our website. Some of them are technically necessary, others help us improve the website.