Glossary

SSML (Speech Synthesis Markup Language)

XML markup for TTS: pronunciation, pauses, emphasis, phone-number breakdown. W3C standard. Essential for clean pronunciation of domain terms and foreign proper names.

SSML (Speech Synthesis Markup Language) is an XML vocabulary that controls text-to-speech output: pauses, emphasis, speaking rate, pitch, custom pronunciations, spelling out phone numbers or IBANs. For an AI phone assistant SSML is the lever that turns usable TTS into a voice that sounds professional.

Practically important tags: `<break time="350ms"/>` for breaths before important sentences, `<say-as interpret-as="telephone">` for numbers, `<phoneme>` for proper nouns and jargon, `<prosody rate="95%">` for a slightly slower pace on sensitive topics. Maintaining a per-industry library of snippets pays off quickly.

Caveat: too many tags make speech sound artificial and break older TTS engines. Clean SSML practice means sparing use, automatic escaping of LLM-generated text, and linting the outgoing markup string before sending it to the TTS API.

FAQ
Should the LLM emit SSML directly?
Usually no. Better: the LLM produces clean plaintext plus metadata ("this is a phone number"), and a deterministic renderer layer turns that into SSML. Tags stay consistent and valid.
Go deeper in the docs

Next step

See BHOMY in a 15-minute demo on a real call example.

🍪

Cookies & Privacy

We use cookies to provide you with the best possible experience on our website. Some of them are technically necessary, others help us improve the website.