Text-to-speech (TTS) converts text produced by the language model back into spoken audio. Current neural TTS engines (ElevenLabs, OpenAI, Microsoft Neural, Google Wavenet) sound indistinguishable from a human voice to most callers.
For telephony four factors decide quality: low first-byte latency (streaming TTS), clean pronunciation of proper names and numbers (dates, phone numbers), language coverage in the target locale, and stability on long responses without pause glitches.
Brand voices are produced via voice cloning: a synthetic profile is built from 30 seconds to 10 minutes of source audio. Deployment requires a GDPR review and an explicit consent from the person whose voice is being cloned.