Speech-to-text (STT), also called Automatic Speech Recognition (ASR), converts spoken language into text. In an AI phone assistant STT is the first step of the pipeline and dominates downstream quality: a misheard word poisons the intent.
Streaming STT is mandatory — the system has to emit partial hypotheses while the caller is still speaking, otherwise latency cascades. Models specialised for the target language (e.g. Whisper variants fine-tuned on German, Deepgram, Azure) typically outperform generic multilingual models by a wide margin.
In production three numbers matter: word error rate (WER) on realistic telephony audio (8 kHz, background noise), robustness on dialect and proper names, and latency to first hypothesis. WER above 12 % in a phone context is not production-ready.