Learning Objectives
- Understand what MAI-Transcribe-1.5 is and how speech-to-text fits into AI products
- Read its accuracy and speed claims with the right context
- Know where it fits — and where a different transcription model might serve better
📝Note
Newly launched and benchmark claims are Microsoft's own. Microsoft introduced MAI-Transcribe-1.5 at Build 2026. It powers Microsoft speech features and is available to developers in Microsoft Foundry, but the accuracy, speed, and pricing figures below are Microsoft-reported — confirm them on your own audio before standardizing on it.
What Is MAI-Transcribe-1.5?
MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model — software that turns spoken audio into written text — part of the in-house MAI (Microsoft AI) family unveiled at Build 2026. It is built for accurate, fast, multilingual transcription: meeting notes, call-center logs, captions, and the transcription layer inside voice assistants.
A strong first-party transcription model lets Microsoft power speech features across Copilot, Teams, Dynamics 365 Contact Center, and Azure Speech without leaning on a partner — the same first-party logic behind the rest of the MAI family. It is the speech-to-text counterpart to MAI-Voice-2, which does the reverse job (text-to-speech).
💡Key Concept
Word error rate, in one line. Transcription accuracy is usually measured as word error rate (WER) — the percentage of words the model gets wrong (insertions, deletions, or substitutions). Lower is better, so a WER of about 2.4% means roughly 2 to 3 words wrong per hundred. WER varies a lot with audio quality, accents, and jargon, so treat any single number as a starting point.
What Microsoft Reports
By Microsoft's account, MAI-Transcribe-1.5 is both accurate and fast:
- 43 languages, with automatic language detection so it identifies the spoken language without being told
- No. 1 on the FLEURS benchmark, a standard multilingual speech-recognition test, with a word error rate of about 2.4%
- Speed: transcribes one hour of audio in under 15 seconds — up to five times faster than rivals such as Gemini 3.1, ElevenLabs Scribe v2, and GPT-4o-Transcribe
- Roughly 36 cents per hour of audio, which Microsoft frames as strong quality per dollar
- A mixture-of-experts architecture (only part of the model runs per request, keeping it efficient) plus content biasing to better recognize domain-specific terms and names
| Attribute | MAI-Transcribe-1.5 (Microsoft-reported) |
|---|---|
| Type | Speech-to-text (transcription) |
| Languages | 43, with automatic language detection |
| Accuracy | No. 1 on FLEURS; word error rate about 2.4% |
| Speed | One hour of audio in under 15 seconds (up to 5 times faster than rivals) |
| Price | About 36 cents per hour of audio |
| Availability | Powers Microsoft speech features; in Microsoft Foundry |
Why Speed and Price Matter
Transcription is a high-volume job — call centers, meeting archives, media libraries, and live captioning generate enormous amounts of audio. At that scale, cost per hour and throughput often matter as much as raw accuracy. Microsoft's pitch is that MAI-Transcribe-1.5 is competitive on accuracy while being markedly faster and cheaper per hour, which is the combination that wins large transcription workloads. Content biasing — feeding the model a list of expected terms, product names, or jargon — is the practical lever that pushes accuracy up on domain-specific audio.
Strengths
- Top-tier accuracy claim: A reported No. 1 FLEURS placement and a low word error rate put it among the strongest transcription models
- Very fast: An hour of audio in under 15 seconds suits large batch jobs and near-real-time captioning
- Cost-efficient: Around 36 cents per hour is competitive for high-volume transcription
- Broad language coverage: 43 languages with automatic detection
- Domain tuning: Content biasing improves recognition of names and jargon
- Deep Microsoft integration: Powers Teams, Contact Center, and Azure Speech, with developer access in Microsoft Foundry
Limitations & Considerations
- Vendor-reported numbers: The FLEURS ranking, word error rate, speed, and price are Microsoft's own; your audio (accents, noise, overlap) is the real test
- Benchmark versus reality: A leaderboard win does not guarantee the best result on noisy, multi-speaker, or heavily accented recordings
- New and evolving: Availability and pricing may shift as it rolls out across products and Foundry
- Ecosystem-leaning: Most convenient for teams already on Microsoft and Foundry
- Transcription caveats remain: Speaker separation, punctuation, and sensitive-content handling still need review for production use
Best Use Cases
| Scenario | Why MAI-Transcribe-1.5 |
|---|---|
| Meeting + call transcription | Fast, accurate, and cheap enough for high volume |
| Captions and subtitles | Speed supports near-real-time captioning; 43-language coverage |
| Voice assistants | Provides the speech-to-text layer, pairing with MAI-Voice-2 |
| Domain-heavy audio | Content biasing lifts accuracy on names and jargon |
| Microsoft-stack apps | Native in Teams, Contact Center, Azure Speech, and Foundry |
When to choose alternatives:
- A language or dialect not well covered → test a transcription model with proven support for it
- Specialized needs (diarization-first, medical, legal) → a domain-specific transcription service
- Non-Microsoft pipelines → a speech-to-text model offered broadly across clouds and direct API
Getting Started
- Access MAI-Transcribe-1.5 through Microsoft Foundry, or use it where it already powers Teams, Contact Center, and Azure Speech
- Test on your audio — your accents, noise levels, and number of speakers — not just on benchmark clips
- Use content biasing to feed in product names and domain terms, and measure the accuracy lift
- Compare cost per hour and turnaround against your current transcription stack before switching at scale
✅Tip
Benchmark on your hardest audio. Transcription models look great on clean speech and stumble on crosstalk, accents, and jargon. Run MAI-Transcribe-1.5 on your messiest real recordings — and try content biasing — before trusting the headline accuracy number.
Key Takeaways
- MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model, unveiled at Build 2026 as part of the MAI family
- Microsoft reports 43 languages, No. 1 on FLEURS with a word error rate around 2.4%, and one hour of audio transcribed in under 15 seconds — up to five times faster than rivals
- At about 36 cents per hour, its pitch is strong quality per dollar for high-volume transcription, with content biasing to tune accuracy on domain terms
- It is the speech-to-text counterpart to MAI-Voice-2 and pairs with it in voice apps
- It powers Microsoft speech features and is available to developers in Microsoft Foundry — treat the benchmark figures as vendor-reported until you test your own audio
