Should the LLM emit SSML directly?

Usually no. Better: the LLM produces clean plaintext plus metadata ("this is a phone number"), and a deterministic renderer layer turns that into SSML. Tags stay consistent and valid.

SSML (Speech Synthesis Markup Language) — Glossary

SSML (Speech Synthesis Markup Language) is an XML vocabulary that controls text-to-speech output: pauses, emphasis, speaking rate, pitch, custom pronunciations, spelling out phone numbers or IBANs. For an AI phone assistant SSML is the lever that turns usable TTS into a voice that sounds professional.

Practically important tags: `<break time="350ms"/>` for breaths before important sentences, `<say-as interpret-as="telephone">` for numbers, `<phoneme>` for proper nouns and jargon, `<prosody rate="95%">` for a slightly slower pace on sensitive topics. Maintaining a per-industry library of snippets pays off quickly.

Caveat: too many tags make speech sound artificial and break older TTS engines. Clean SSML practice means sparing use, automatic escaping of LLM-generated text, and linting the outgoing markup string before sending it to the TTS API.

SSML (Speech Synthesis Markup Language)

Next step

Cookies & Privacy