Word Error Rate (WER) is the standard metric for speech-to-text quality. It measures the share of words in a reference transcription that are wrong due to insertions, deletions or substitutions. A WER of 5 % means roughly every twentieth word is incorrect.
WER only becomes meaningful when measured on real calls from your own use case — not on the clean studio material a vendor publishes. Accents, telephone bandwidth (8 kHz), background noise and domain vocabulary (medication names, product SKUs) routinely push real-world WER 2–3× higher than vendor benchmarks.
Operationally more important than global WER is often entity-level WER: is the patient name correct? Is the phone number right? A pipeline with slot validation ("I understood ‘Meier’, is that correct?") can ship reliable outcomes even with higher raw WER.