Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
5 min read·Updated June 19, 2026

MAI-Transcribe-1.5 is Microsoft's in-house speech-to-text model, unveiled at Build 2026. Microsoft says it covers 43 languages, holds the No. 1 spot on the FLEURS accuracy benchmark, and transcribes an hour of audio in under 15 seconds — up to five times faster than rival models — at roughly 36 cents per hour. It powers Microsoft's speech features and is available to developers in Microsoft Foundry.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what MAI-Transcribe-1.5 is and how speech-to-text fits into AI products
  • Read its accuracy and speed claims with the right context
  • Know where it fits — and where a different transcription model might serve better

📝Note

Newly launched and benchmark claims are Microsoft's own. Microsoft introduced MAI-Transcribe-1.5 at Build 2026. It powers Microsoft speech features and is available to developers in Microsoft Foundry, but the accuracy, speed, and pricing figures below are Microsoft-reported — confirm them on your own audio before standardizing on it.

What Is MAI-Transcribe-1.5?

MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model — software that turns spoken audio into written text — part of the in-house MAI (Microsoft AI) family unveiled at Build 2026. It is built for accurate, fast, multilingual transcription: meeting notes, call-center logs, captions, and the transcription layer inside voice assistants.

A strong first-party transcription model lets Microsoft power speech features across Copilot, Teams, Dynamics 365 Contact Center, and Azure Speech without leaning on a partner — the same first-party logic behind the rest of the MAI family. It is the speech-to-text counterpart to MAI-Voice-2, which does the reverse job (text-to-speech).

💡Key Concept

Word error rate, in one line. Transcription accuracy is usually measured as word error rate (WER) — the percentage of words the model gets wrong (insertions, deletions, or substitutions). Lower is better, so a WER of about 2.4% means roughly 2 to 3 words wrong per hundred. WER varies a lot with audio quality, accents, and jargon, so treat any single number as a starting point.

What Microsoft Reports

By Microsoft's account, MAI-Transcribe-1.5 is both accurate and fast:

  • 43 languages, with automatic language detection so it identifies the spoken language without being told
  • No. 1 on the FLEURS benchmark, a standard multilingual speech-recognition test, with a word error rate of about 2.4%
  • Speed: transcribes one hour of audio in under 15 seconds — up to five times faster than rivals such as Gemini 3.1, ElevenLabs Scribe v2, and GPT-4o-Transcribe
  • Roughly 36 cents per hour of audio, which Microsoft frames as strong quality per dollar
  • A mixture-of-experts architecture (only part of the model runs per request, keeping it efficient) plus content biasing to better recognize domain-specific terms and names
AttributeMAI-Transcribe-1.5 (Microsoft-reported)
TypeSpeech-to-text (transcription)
Languages43, with automatic language detection
AccuracyNo. 1 on FLEURS; word error rate about 2.4%
SpeedOne hour of audio in under 15 seconds (up to 5 times faster than rivals)
PriceAbout 36 cents per hour of audio
AvailabilityPowers Microsoft speech features; in Microsoft Foundry

Why Speed and Price Matter

Transcription is a high-volume job — call centers, meeting archives, media libraries, and live captioning generate enormous amounts of audio. At that scale, cost per hour and throughput often matter as much as raw accuracy. Microsoft's pitch is that MAI-Transcribe-1.5 is competitive on accuracy while being markedly faster and cheaper per hour, which is the combination that wins large transcription workloads. Content biasing — feeding the model a list of expected terms, product names, or jargon — is the practical lever that pushes accuracy up on domain-specific audio.

Strengths

  • Top-tier accuracy claim: A reported No. 1 FLEURS placement and a low word error rate put it among the strongest transcription models
  • Very fast: An hour of audio in under 15 seconds suits large batch jobs and near-real-time captioning
  • Cost-efficient: Around 36 cents per hour is competitive for high-volume transcription
  • Broad language coverage: 43 languages with automatic detection
  • Domain tuning: Content biasing improves recognition of names and jargon
  • Deep Microsoft integration: Powers Teams, Contact Center, and Azure Speech, with developer access in Microsoft Foundry

Limitations & Considerations

  • Vendor-reported numbers: The FLEURS ranking, word error rate, speed, and price are Microsoft's own; your audio (accents, noise, overlap) is the real test
  • Benchmark versus reality: A leaderboard win does not guarantee the best result on noisy, multi-speaker, or heavily accented recordings
  • New and evolving: Availability and pricing may shift as it rolls out across products and Foundry
  • Ecosystem-leaning: Most convenient for teams already on Microsoft and Foundry
  • Transcription caveats remain: Speaker separation, punctuation, and sensitive-content handling still need review for production use

Best Use Cases

ScenarioWhy MAI-Transcribe-1.5
Meeting + call transcriptionFast, accurate, and cheap enough for high volume
Captions and subtitlesSpeed supports near-real-time captioning; 43-language coverage
Voice assistantsProvides the speech-to-text layer, pairing with MAI-Voice-2
Domain-heavy audioContent biasing lifts accuracy on names and jargon
Microsoft-stack appsNative in Teams, Contact Center, Azure Speech, and Foundry

When to choose alternatives:

  • A language or dialect not well covered → test a transcription model with proven support for it
  • Specialized needs (diarization-first, medical, legal) → a domain-specific transcription service
  • Non-Microsoft pipelines → a speech-to-text model offered broadly across clouds and direct API

Getting Started

  1. Access MAI-Transcribe-1.5 through Microsoft Foundry, or use it where it already powers Teams, Contact Center, and Azure Speech
  2. Test on your audio — your accents, noise levels, and number of speakers — not just on benchmark clips
  3. Use content biasing to feed in product names and domain terms, and measure the accuracy lift
  4. Compare cost per hour and turnaround against your current transcription stack before switching at scale

Tip

Benchmark on your hardest audio. Transcription models look great on clean speech and stumble on crosstalk, accents, and jargon. Run MAI-Transcribe-1.5 on your messiest real recordings — and try content biasing — before trusting the headline accuracy number.

Key Takeaways

  • MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model, unveiled at Build 2026 as part of the MAI family
  • Microsoft reports 43 languages, No. 1 on FLEURS with a word error rate around 2.4%, and one hour of audio transcribed in under 15 seconds — up to five times faster than rivals
  • At about 36 cents per hour, its pitch is strong quality per dollar for high-volume transcription, with content biasing to tune accuracy on domain terms
  • It is the speech-to-text counterpart to MAI-Voice-2 and pairs with it in voice apps
  • It powers Microsoft speech features and is available to developers in Microsoft Foundry — treat the benchmark figures as vendor-reported until you test your own audio

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

Tools Covered in This Lesson

🧭Recommended for you