Microsoft MAI-Voice-2
Microsoft's latest MAI-Voice-2 is an expressive text-to-speech model supporting voice cloning in 15 languages, fine-grained emotional control, and consistent voice identity, priced at $22 per million characters in Azure AI Foundry, with integrations into VSCode, Dynamics 365 Contact Center, and Teams.
MAI: Microsoft's top-tier model family | Product Hunt
Microsoft's top-tier model family
60 followers
Microsoft's top-tier model family
60 followers
Visit website
AI Infrastructure Tools
•
Foundation Models
Microsoft AI is pioneering the future of what AI can do and what technology can be.
Overview
Launches4
Reviews
Alternatives
Team
More
This is the 4th launch from MAI. View more
Microsoft MAI-Voice-2
Launching today
Expressive TTS with voice cloning in 15 languages
Microsoft's most expressive TTS model yet — voice cloning from short samples, fine-grained emotional control, and consistent voice identity across 15 languages. Now live in Azure AI Foundry at $22 per million characters, with integrations rolling out in VSCode, Dynamics 365 Contact Center, and Teams. For builders shipping voice agents who need production-grade prosody without the OpenAI Realtime API price tag.
Free
Launch tags:Productivity•Developer Tools•Artificial Intelligence
Launch Team
Subscribe
SocialX
Previous MAI Launches
MAI's 7 New ModelsReasoning, Code, Image, Voice & Transcription AI
Launched on June 3rd, 2026
MAI-Transcribe-1Production ASR for noisy multilingual audio
Launched on April 3rd, 2026
MAI-Image-2Microsoft's top-tier text-to-image model for creatives
Launched on March 20th, 2026
Reviews
No reviews yetBe the first to leave a review for MAI
Promoted
📌
I build voice agents for service businesses — mostly healthcare and home services — and the #1 unsolved problem in this space is prosody. The "is this a robot?" moment usually happens in the first 8 seconds of a call.
MAI-Voice-2 is the first TTS I've A/B tested where my pilot users couldn't tell. The $22/M chars pricing lands below ElevenLabs and matches gpt-realtime's TTS layer.
If you're shipping voice and wedded to OpenAI Realtime, worth running the side-by-side. Curious if Microsoft is planning sub-200ms first-token latency via WebRTC streaming next.
Report
10h ago
The consistent voice identity across 15 languages is what stands out to me here. I work on a voice companion that calls aging parents every day, and a lot of our families are immigrants whose parents are most at ease in their first language. A warm, familiar voice that holds up in Tagalog or Mandarin is often the difference between a call someone looks forward to and one they let ring out. Question for the team: how stable is the cloned identity and emotional control over a full 10-minute conversation, or does the prosody drift toward neutral as the session runs longer?
Report
29m ago