2026-06-04 21:38 UTCIn-site rewrite2 min readUpdated: 2026-06-30 13:03 UTC

Microsoft MAI-Voice-2

Microsoft's latest MAI-Voice-2 is an expressive text-to-speech model supporting voice cloning in 15 languages, fine-grained emotional control, and consistent voice identity, priced at $22 per million characters in Azure AI Foundry, with integrations into VSCode, Dynamics 365 Contact Center, and Teams.

SourceProduct Hunt AIAuthor: Habib Ferdous

MAI: Microsoft's top-tier model family | Product Hunt

Microsoft's top-tier model family

60 followers

Microsoft's top-tier model family

60 followers

Visit website

AI Infrastructure Tools

•

Foundation Models

Microsoft AI is pioneering the future of what AI can do and what technology can be.

Overview

Launches4

Reviews

Alternatives

Team

This is the 4th launch from MAI. View more

Microsoft MAI-Voice-2

Launching today

Expressive TTS with voice cloning in 15 languages

Microsoft's most expressive TTS model yet — voice cloning from short samples, fine-grained emotional control, and consistent voice identity across 15 languages. Now live in Azure AI Foundry at $22 per million characters, with integrations rolling out in VSCode, Dynamics 365 Contact Center, and Teams. For builders shipping voice agents who need production-grade prosody without the OpenAI Realtime API price tag.

Free

Launch tags:Productivity•Developer Tools•Artificial Intelligence

Launch Team

SocialX

Previous MAI Launches

MAI's 7 New ModelsReasoning, Code, Image, Voice & Transcription AI

Launched on June 3rd, 2026

MAI-Transcribe-1Production ASR for noisy multilingual audio

Launched on April 3rd, 2026

MAI-Image-2Microsoft's top-tier text-to-image model for creatives

Launched on March 20th, 2026

Reviews

No reviews yetBe the first to leave a review for MAI

Promoted

📌

I build voice agents for service businesses — mostly healthcare and home services — and the #1 unsolved problem in this space is prosody. The "is this a robot?" moment usually happens in the first 8 seconds of a call.

MAI-Voice-2 is the first TTS I've A/B tested where my pilot users couldn't tell. The $22/M chars pricing lands below ElevenLabs and matches gpt-realtime's TTS layer.

If you're shipping voice and wedded to OpenAI Realtime, worth running the side-by-side. Curious if Microsoft is planning sub-200ms first-token latency via WebRTC streaming next.

Report

10h ago

The consistent voice identity across 15 languages is what stands out to me here. I work on a voice companion that calls aging parents every day, and a lot of our families are immigrants whose parents are most at ease in their first language. A warm, familiar voice that holds up in Tagalog or Mandarin is often the difference between a call someone looks forward to and one they let ring out. Question for the team: how stable is the cloned identity and emotional control over a full 10-minute conversation, or does the prosody drift toward neutral as the session runs longer?

Report

29m ago