Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
7 min read·Updated April 25, 2026

OpenAI TTS & Whisper

OpenAI logoBy OpenAI

OpenAI offers two complementary audio models: Whisper for best-in-class speech-to-text transcription in 99 languages (open-source and API), and TTS for natural text-to-speech synthesis — both accessible as APIs and embedded in ChatGPT.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Distinguish between Whisper (speech-to-text) and OpenAI TTS (text-to-speech) and understand what each does
  • Evaluate the open-source availability of Whisper and what that enables
  • Identify the right use cases for OpenAI's audio models versus dedicated voice platforms

What Are OpenAI's Audio Tools?

OpenAI offers two distinct audio AI models, often discussed together because they serve complementary functions:

  • Whisper — a speech recognition (speech-to-text / transcription) model that converts spoken audio into written text with high accuracy across 99 languages
  • OpenAI TTS — a text-to-speech model that converts written text into spoken audio using one of six built-in voice personas

Both are available via the OpenAI API. Whisper is also released as open-source software — making it uniquely powerful for developers who want to run transcription on their own infrastructure without sending audio to OpenAI's servers.

Tip

Access: Both tools are available via platform.openai.com — Whisper also available as an open-source download at github.com/openai/whisper

Whisper: Speech-to-Text

⚠️Warning

Not to be confused with Wispr Flow. Wispr Flow is a consumer voice-dictation app from Wispr AI (a separate San Francisco startup), not an OpenAI product. Despite the similar name, Wispr Flow does not primarily use OpenAI's Whisper model — it runs its own proprietary cloud models. If you came here looking for the dictation app, see the Wispr Flow tool page instead.

What It Does

Whisper transcribes audio files (or live audio streams via the API) into text. It was trained on 680,000 hours of multilingual audio — an unusually diverse dataset that gives it strong accuracy across accents, languages, and audio conditions (including noisy recordings and spontaneous speech).

Whisper supports:

  • Transcription — convert audio to text in the original language
  • Translation — transcribe audio from any supported language and translate directly to English
  • Language detection — automatically identifies the language being spoken
  • Timestamps — output includes word-level or segment-level timestamps for subtitle generation

Language Support

Whisper supports 99 languages for transcription and covers 57 languages for translation to English. Top-quality performance is concentrated in languages well-represented in its training data (English, Spanish, French, German, Chinese, Japanese, Portuguese, Italian, Russian), but it achieves usable accuracy across dozens more.

Open-Source Availability

This is what makes Whisper exceptional: OpenAI released Whisper as fully open-source under the MIT license. Anyone can download the model weights and run Whisper locally.

💡Key Concept

Why open-source matters for transcription: Running Whisper locally means your audio never leaves your machine — critical for transcribing sensitive conversations, medical appointments, legal proceedings, or confidential business meetings. No data goes to OpenAI's servers, no usage fees accumulate, and the model runs offline. This is a meaningful advantage over cloud-only transcription services.

Whisper Model Sizes

Five model sizes are available (open-source), ranging from the compact tiny (39 million parameters, runs on CPU) to the full large-v3 (1.5 billion parameters, best accuracy, requires a GPU):

ModelParametersSpeedAccuracyBest For
tiny39 million~32x real-timeLowestOn-device; edge; fast prototyping
base74 million~16x real-timeModerateLightweight local use
small244 million~6x real-timeGoodLocal use with decent accuracy
medium769 million~2x real-timeHighProduction without GPU
large-v31.5 billion~1x real-timeHighestMaximum accuracy; requires GPU

The hosted API uses the large-v3 model — you pay for accuracy without managing GPU hardware.

API Pricing

ServicePrice
Whisper API (hosted)$0.006 per minute of audio
Self-hosted (open-source)Free (compute cost only)

At $0.006/minute, transcribing one hour of audio via the API costs $0.36 — extremely cost-effective for occasional use.

OpenAI TTS: Text-to-Speech

What It Does

OpenAI TTS converts written text into spoken audio using one of six voice models. It is designed primarily for developer integration — embedding AI voice into applications, generating audio companions for text content, or creating automated voice responses.

The Six Voices

VoiceCharacter
alloyNeutral, versatile, balanced
echoMale, warm, conversational
fableBritish accent, expressive, slightly theatrical
onyxDeep, authoritative
novaFemale, friendly, upbeat
shimmerSoft, clear, gentle

All six voices are available in two quality tiers: tts-1 (optimized for real-time streaming, lower latency) and tts-1-hd (higher quality audio, optimized for final output rather than streaming).

TTS API Pricing

ModelPrice
tts-1$15 per 1 million characters
tts-1-hd$30 per 1 million characters

For reference: one minute of speech at average reading speed is approximately 700–900 characters. A 10-minute audio file requires roughly 7,000–9,000 characters, costing under $0.15 on tts-1.

Output Formats

OpenAI TTS supports MP3, Opus (for internet streaming), AAC, FLAC, WAV, and PCM output — covering the full range of audio use cases from web streaming to broadcast-quality archival.

Real-Time Audio in ChatGPT (Advanced Voice Mode)

Beyond the API, OpenAI has integrated voice capabilities directly into ChatGPT's Advanced Voice Mode — allowing real-time conversational audio with ChatGPT on mobile and web. This uses different underlying models optimized for low-latency bidirectional audio, not the standard Whisper/TTS API stack.

Advanced Voice Mode is available on ChatGPT Plus and Pro — it represents the consumer-facing experience of OpenAI's audio capabilities rather than the developer API.

Strengths

  • Whisper accuracy: Among the most accurate transcription models available, particularly for multilingual audio and accented speech
  • Open-source (Whisper): Run locally, free, private — no data leaves your machine
  • 99 language support: Broader language coverage than most competitors
  • Tight ChatGPT integration: TTS powers ChatGPT's voice features; familiar quality for ChatGPT users
  • Developer-friendly: Both tools have clean APIs with official Python and Node.js SDKs
  • Affordable API pricing: Whisper at $0.006/minute and TTS at $15/1 million characters are competitive

Limitations & Considerations

  • TTS voice variety: Six voices is limited compared to ElevenLabs' 3,000+ voice library; no voice cloning capability
  • No real-time Whisper streaming via API: The hosted Whisper API accepts audio files, not continuous live audio streams (real-time transcription requires the open-source model or third-party integration)
  • TTS emotion range: OpenAI TTS voices are high quality but somewhat neutral — ElevenLabs offers more expressive, emotionally varied output for dramatic content
  • Whisper hallucination: On very low-quality or silent audio segments, Whisper occasionally generates plausible-sounding but incorrect text — always review output for critical transcriptions

Best Use Cases

TaskWhy OpenAI TTS / Whisper
Meeting and interview transcriptionWhisper API at $0.006/min; high accuracy on spoken dialogue
Private transcription of sensitive recordingsOpen-source Whisper running locally — no data leaves your machine
Subtitle generation for videoTimestamp output enables automatic subtitle file creation
Voice interface in an appTTS API for text-to-audio in chatbots, assistants, reading apps
Podcast transcript generationBatch-process audio files via Whisper API
Multilingual transcription99-language support; translation to English in one step

When to choose alternatives:

  • Voice cloning or custom voices → ElevenLabs (TTS has only 6 fixed voices)
  • Music generation → Suno AI or Udio (TTS is voice only, not music)
  • Studio voiceover production with a team workflow → Murf AI
  • AI podcast enhancement and noise removal → Adobe Podcast Enhance

Getting Started

Using Whisper via API:

from openai import OpenAI
client = OpenAI()

audio_file = open("interview.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="text"
)
print(transcript)

Using OpenAI TTS via API:

from openai import OpenAI
client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to the AI Pro Playbook."
)
response.stream_to_file("output.mp3")

Running Whisper open-source locally:

pip install openai-whisper
whisper audio.mp3 --model large-v3 --language English

Tip

Choosing between API and local Whisper: Use the API for convenience and the large-v3 quality without GPU setup. Use local Whisper when privacy is paramount, you have GPU hardware available, or you're processing audio at a volume where API costs add up.

Key Takeaways

  • OpenAI provides two complementary audio tools: Whisper (speech-to-text, open-source) and TTS (text-to-speech, API)
  • Whisper's open-source release is its defining feature — run it locally for private transcription with no API fees
  • OpenAI TTS offers six high-quality voices for developer integration; excellent for adding voice to applications, though limited in variety compared to ElevenLabs
  • Together, they cover the full audio pipeline for many developer use cases — transcribe with Whisper, respond with TTS — within OpenAI's existing API ecosystem

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you