Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
5 min read·Updated June 19, 2026

MAI-Voice-2 is Microsoft's in-house expressive text-to-speech model, unveiled at Build 2026. It generates natural, emotionally expressive speech across 15 languages with fine-grained control, and ships with a low-latency Flash variant for real-time use. It already powers voice experiences across Microsoft products and is available to developers in Microsoft Foundry.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what MAI-Voice-2 is and what "expressive text-to-speech" means
  • Explain where a first-party voice model fits in Microsoft's AI stack
  • Know when to choose the standard model versus the low-latency Flash variant

📝Note

Newly launched and claims are Microsoft's own. Microsoft introduced MAI-Voice-2 at Build 2026. It powers voice features in Microsoft products and is available to developers in Microsoft Foundry, but the capability claims below are Microsoft-reported — test it on your own scripts and languages before relying on it.

What Is MAI-Voice-2?

MAI-Voice-2 is Microsoft's first-party text-to-speech model — software that turns written text into natural-sounding spoken audio — part of the in-house MAI (Microsoft AI) family unveiled at Build 2026. It is the more expressive, multilingual successor to MAI-Voice-1, designed to produce speech that sounds natural and carries appropriate emotion rather than flat, robotic narration.

Owning a strong voice model lets Microsoft power read-aloud, voice assistants, and audio features across Copilot and its other products without depending on a partner — the same first-party logic behind the MAI text and image models.

💡Key Concept

Text-to-speech versus speech-to-text. Text-to-speech (this model) reads written text aloud as audio. Speech-to-text (its sibling, MAI-Transcribe-1.5) does the reverse — it turns spoken audio into written text. The two are often used together in voice assistants and call-center tools.

What Microsoft Reports

By Microsoft's account, MAI-Voice-2 generates expressive speech across 15 languages with fine-grained control over delivery — letting developers shape tone and pacing rather than accept a single flat voice. Microsoft positions it as a clear step up in naturalness and emotional range over MAI-Voice-1.

It also ships with safeguards: Microsoft has highlighted protections against unauthorized voice cloning, a growing concern as synthetic voices become harder to distinguish from real ones.

AttributeMAI-Voice-2 (Microsoft-reported)
TypeExpressive text-to-speech (voice generation)
Languages15
FocusNatural delivery with fine-grained tone + pacing control
SafetyProtections against unauthorized voice cloning
VariantsMAI-Voice-2 + a low-latency Flash variant
AvailabilityPowers Microsoft product voice features; in Microsoft Foundry

The Flash Variant

MAI-Voice-2 ships alongside a Flash variant tuned for ultra-low-latency scenarios — real-time, interactive use where the audio needs to start almost instantly, such as live voice assistants and conversational agents. The full model targets the richest expressiveness; Flash trades a little of that for the speed that real-time interaction demands.

Strengths

  • Expressive, natural delivery: Tuned to sound human and carry emotion, not flat narration
  • Multilingual: Covers 15 languages out of one model
  • Fine-grained control: Developers can shape tone and pacing rather than accept a single default voice
  • Real-time option: The Flash variant targets the low latency that live voice interaction needs
  • Built-in safety: Voice-cloning protections address a real misuse risk
  • Deep Microsoft integration: Already powering product voice features, with developer access in Microsoft Foundry

Limitations & Considerations

  • Vendor-reported claims: Expressiveness and quality are Microsoft's framing; the real test is how it sounds on your scripts and languages
  • 15 languages, not universal: Strong coverage, but confirm your specific languages and accents are well supported
  • New and evolving: Availability and behavior may shift as it rolls out across products and Foundry
  • Ecosystem-leaning: Most convenient for teams already on Microsoft and Foundry
  • Synthetic-voice responsibilities: Even with cloning protections, generating realistic speech carries consent and disclosure obligations you must handle

Best Use Cases

ScenarioWhy MAI-Voice-2
Read-aloud and narrationNatural, expressive delivery for articles, lessons, and documents
Voice assistants and agentsThe Flash variant's low latency suits real-time interaction
Multilingual audioOne model covers 15 languages
Accessibility featuresClear, natural speech for screen-reading and read-aloud
Microsoft-stack appsNative in Microsoft products and Foundry

When to choose alternatives:

  • A specific language or accent not well covered → test a voice model with proven support for it
  • Studio-grade voice cloning of a consented speaker → a specialist voice platform built for that
  • Non-Microsoft pipelines → a text-to-speech model offered broadly across clouds and direct API

Getting Started

  1. Identify your use case — narration, a live assistant, or accessibility read-aloud — since that decides full model versus Flash
  2. Access MAI-Voice-2 through Microsoft Foundry and generate samples in the languages you actually need
  3. Test the expressiveness controls on real scripts, and compare the Flash variant's latency if you are building anything real-time
  4. Build voice-cloning consent and disclosure into your workflow, regardless of the model's built-in protections

Tip

Latency or expressiveness — pick per use. Use the full MAI-Voice-2 when richness matters and a slight delay is fine, like narrating a lesson. Switch to the Flash variant for live, back-and-forth voice where the audio must start instantly.

Key Takeaways

  • MAI-Voice-2 is Microsoft's first-party expressive text-to-speech model, unveiled at Build 2026 as part of the MAI family
  • It generates natural, emotionally expressive speech across 15 languages with fine-grained control, and includes voice-cloning protections
  • A low-latency Flash variant targets real-time voice assistants and agents; the full model targets maximum expressiveness
  • Its sibling MAI-Transcribe-1.5 handles the reverse job (speech-to-text), and the two pair naturally in voice apps
  • It powers Microsoft product voice features and is available to developers in Microsoft Foundry — treat the claims as vendor-reported until you test your own scripts

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

Tools Covered in This Lesson

🧭Recommended for you