SSML (Speech Synthesis Markup Language) is an XML vocabulary that controls text-to-speech output: pauses, emphasis, speaking rate, pitch, custom pronunciations, spelling out phone numbers or IBANs. For an AI phone assistant SSML is the lever that turns usable TTS into a voice that sounds professional.
Practically important tags: `<break time="350ms"/>` for breaths before important sentences, `<say-as interpret-as="telephone">` for numbers, `<phoneme>` for proper nouns and jargon, `<prosody rate="95%">` for a slightly slower pace on sensitive topics. Maintaining a per-industry library of snippets pays off quickly.
Caveat: too many tags make speech sound artificial and break older TTS engines. Clean SSML practice means sparing use, automatic escaping of LLM-generated text, and linting the outgoing markup string before sending it to the TTS API.