Voice AI is the umbrella term for AI systems that understand and produce spoken language. The stack always has three layers: speech-to-text for input, a language model (with or without RAG) for generation, and text-to-speech for output.
AI phone assistants are the most commercially relevant application of voice AI today, but not the only one: in-app voice bots, in-car assistants, smart-home devices, dictation tools all share the same stack with different latency and domain requirements.
What separates voice AI from text-only conversational AI: real-time constraints, acoustic robustness, and natural prosody. That trio costs more engineering effort than pure chat — which is why many flashy "voice AI" demos fail in production.