Latency in a telephony context is the time between the end of the caller’s utterance and the first syllable of the assistant’s response. It is additive: STT processing + LLM inference + TTS synthesis + the phone system’s audio pipeline.
Field thresholds: under 700 ms feels natural; 700–1500 ms is perceptible; above 1500 ms produces "hello, are you still there?" awkwardness. Streaming STT and streaming TTS are mandatory — batch processing fundamentally cannot meet these targets.
Optimisation starts with measurement in production: where are the milliseconds? Model size, inference region, codec on the phone leg, and caching of frequent responses are the highest-leverage levers.