Name: MAI-Transcribe-1.5
Availability: InStock
Author: Microsoft

Learning Objectives

Understand what MAI-Transcribe-1.5 is and how speech-to-text fits into AI products
Read its accuracy and speed claims with the right context
Know where it fits — and where a different transcription model might serve better

📝Note

Newly launched and benchmark claims are Microsoft's own. Microsoft introduced MAI-Transcribe-1.5 at Build 2026. It powers Microsoft speech features and is available to developers in Microsoft Foundry, but the accuracy, speed, and pricing figures below are Microsoft-reported — confirm them on your own audio before standardizing on it.

What Is MAI-Transcribe-1.5?

MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model — software that turns spoken audio into written text — part of the in-house MAI (Microsoft AI) family unveiled at Build 2026. It is built for accurate, fast, multilingual transcription: meeting notes, call-center logs, captions, and the transcription layer inside voice assistants.

A strong first-party transcription model lets Microsoft power speech features across Copilot, Teams, Dynamics 365 Contact Center, and Azure Speech without leaning on a partner — the same first-party logic behind the rest of the MAI family. It is the speech-to-text counterpart to MAI-Voice-2, which does the reverse job (text-to-speech).

💡Key Concept

Word error rate, in one line. Transcription accuracy is usually measured as word error rate (WER) — the percentage of words the model gets wrong (insertions, deletions, or substitutions). Lower is better, so a WER of about 2.4% means roughly 2 to 3 words wrong per hundred. WER varies a lot with audio quality, accents, and jargon, so treat any single number as a starting point.

What Microsoft Reports

By Microsoft's account, MAI-Transcribe-1.5 is both accurate and fast:

43 languages, with automatic language detection so it identifies the spoken language without being told
No. 1 on the FLEURS benchmark, a standard multilingual speech-recognition test, with a word error rate of about 2.4%
Speed: transcribes one hour of audio in under 15 seconds — up to five times faster than rivals such as Gemini 3.1, ElevenLabs Scribe v2, and GPT-4o-Transcribe
Roughly 36 cents per hour of audio, which Microsoft frames as strong quality per dollar
A mixture-of-experts architecture (only part of the model runs per request, keeping it efficient) plus content biasing to better recognize domain-specific terms and names

Attribute	MAI-Transcribe-1.5 (Microsoft-reported)
Type	Speech-to-text (transcription)
Languages	43, with automatic language detection
Accuracy	No. 1 on FLEURS; word error rate about 2.4%
Speed	One hour of audio in under 15 seconds (up to 5 times faster than rivals)
Price	About 36 cents per hour of audio
Availability	Powers Microsoft speech features; in Microsoft Foundry

Why Speed and Price Matter

Transcription is a high-volume job — call centers, meeting archives, media libraries, and live captioning generate enormous amounts of audio. At that scale, cost per hour and throughput often matter as much as raw accuracy. Microsoft's pitch is that MAI-Transcribe-1.5 is competitive on accuracy while being markedly faster and cheaper per hour, which is the combination that wins large transcription workloads. Content biasing — feeding the model a list of expected terms, product names, or jargon — is the practical lever that pushes accuracy up on domain-specific audio.

Strengths

Top-tier accuracy claim: A reported No. 1 FLEURS placement and a low word error rate put it among the strongest transcription models
Very fast: An hour of audio in under 15 seconds suits large batch jobs and near-real-time captioning
Cost-efficient: Around 36 cents per hour is competitive for high-volume transcription
Broad language coverage: 43 languages with automatic detection
Domain tuning: Content biasing improves recognition of names and jargon
Deep Microsoft integration: Powers Teams, Contact Center, and Azure Speech, with developer access in Microsoft Foundry

Limitations & Considerations

Vendor-reported numbers: The FLEURS ranking, word error rate, speed, and price are Microsoft's own; your audio (accents, noise, overlap) is the real test
Benchmark versus reality: A leaderboard win does not guarantee the best result on noisy, multi-speaker, or heavily accented recordings
New and evolving: Availability and pricing may shift as it rolls out across products and Foundry
Ecosystem-leaning: Most convenient for teams already on Microsoft and Foundry
Transcription caveats remain: Speaker separation, punctuation, and sensitive-content handling still need review for production use

Best Use Cases

Scenario	Why MAI-Transcribe-1.5
Meeting + call transcription	Fast, accurate, and cheap enough for high volume
Captions and subtitles	Speed supports near-real-time captioning; 43-language coverage
Voice assistants	Provides the speech-to-text layer, pairing with MAI-Voice-2
Domain-heavy audio	Content biasing lifts accuracy on names and jargon
Microsoft-stack apps	Native in Teams, Contact Center, Azure Speech, and Foundry

When to choose alternatives:

A language or dialect not well covered → test a transcription model with proven support for it
Specialized needs (diarization-first, medical, legal) → a domain-specific transcription service
Non-Microsoft pipelines → a speech-to-text model offered broadly across clouds and direct API

Getting Started

Access MAI-Transcribe-1.5 through Microsoft Foundry, or use it where it already powers Teams, Contact Center, and Azure Speech
Test on your audio — your accents, noise levels, and number of speakers — not just on benchmark clips
Use content biasing to feed in product names and domain terms, and measure the accuracy lift
Compare cost per hour and turnaround against your current transcription stack before switching at scale

✅Tip

Benchmark on your hardest audio. Transcription models look great on clean speech and stumble on crosstalk, accents, and jargon. Run MAI-Transcribe-1.5 on your messiest real recordings — and try content biasing — before trusting the headline accuracy number.

Key Takeaways

MAI-Transcribe-1.5 is Microsoft's first-party speech-to-text model, unveiled at Build 2026 as part of the MAI family
Microsoft reports 43 languages, No. 1 on FLEURS with a word error rate around 2.4%, and one hour of audio transcribed in under 15 seconds — up to five times faster than rivals
At about 36 cents per hour, its pitch is strong quality per dollar for high-volume transcription, with content biasing to tune accuracy on domain terms
It is the speech-to-text counterpart to MAI-Voice-2 and pairs with it in voice apps
It powers Microsoft speech features and is available to developers in Microsoft Foundry — treat the benchmark figures as vendor-reported until you test your own audio

MAI-Transcribe-1.5

Audio & video lessons are paid features