Skip to main content
Glossary

TTS (Text-to-Speech)

Converts text into spoken audio. Modern neural TTS sounds nearly human. Differs in latency, language coverage and voice-cloning capability.

Text-to-speech (TTS) converts text produced by the language model back into spoken audio. Current neural TTS engines (ElevenLabs, OpenAI, Microsoft Neural, Google Wavenet) sound indistinguishable from a human voice to most callers.

For telephony four factors decide quality: low first-byte latency (streaming TTS), clean pronunciation of proper names and numbers (dates, phone numbers), language coverage in the target locale, and stability on long responses without pause glitches.

Brand voices are produced via voice cloning: a synthetic profile is built from 30 seconds to 10 minutes of source audio. Deployment requires a GDPR review and an explicit consent from the person whose voice is being cloned.

See it applied

Next step

See BHOMY in a 15-minute demo on a real call example.

🍪

Cookies & Privacy

We use cookies to provide you with the best possible experience on our website. Some of them are technically necessary, others help us improve the website.