Skip to main content
Glossary

STT / ASR (Speech-to-Text)

Converts spoken language to text. Also called ASR (Automatic Speech Recognition). Quality drives understanding rate; specialized models per language are essential.

Speech-to-text (STT), also called Automatic Speech Recognition (ASR), converts spoken language into text. In an AI phone assistant STT is the first step of the pipeline and dominates downstream quality: a misheard word poisons the intent.

Streaming STT is mandatory — the system has to emit partial hypotheses while the caller is still speaking, otherwise latency cascades. Models specialised for the target language (e.g. Whisper variants fine-tuned on German, Deepgram, Azure) typically outperform generic multilingual models by a wide margin.

In production three numbers matter: word error rate (WER) on realistic telephony audio (8 kHz, background noise), robustness on dialect and proper names, and latency to first hypothesis. WER above 12 % in a phone context is not production-ready.

See it applied

Next step

See BHOMY in a 15-minute demo on a real call example.

🍪

Cookies & Privacy

We use cookies to provide you with the best possible experience on our website. Some of them are technically necessary, others help us improve the website.