Name: OpenAI TTS / Whisper
Availability: InStock
Author: OpenAI

Learning Objectives

Distinguish between Whisper (speech-to-text) and OpenAI TTS (text-to-speech) and understand what each does
Evaluate the open-source availability of Whisper and what that enables
Identify the right use cases for OpenAI's audio models versus dedicated voice platforms

What Are OpenAI's Audio Tools?

OpenAI offers two distinct audio AI models, often discussed together because they serve complementary functions:

Whisper — a speech recognition (speech-to-text / transcription) model that converts spoken audio into written text with high accuracy across 99 languages
OpenAI TTS — a text-to-speech model that converts written text into spoken audio using one of six built-in voice personas

Both are available via the OpenAI API. Whisper is also released as open-source software — making it uniquely powerful for developers who want to run transcription on their own infrastructure without sending audio to OpenAI's servers.

✅Tip

Access: Both tools are available via platform.openai.com — Whisper also available as an open-source download at github.com/openai/whisper

Whisper: Speech-to-Text

⚠️Warning

Not to be confused with Wispr Flow. Wispr Flow is a consumer voice-dictation app from Wispr AI (a separate San Francisco startup), not an OpenAI product. Despite the similar name, Wispr Flow does not primarily use OpenAI's Whisper model — it runs its own proprietary cloud models. If you came here looking for the dictation app, see the Wispr Flow tool page instead.

What It Does

Whisper transcribes audio files (or live audio streams via the API) into text. It was trained on 680,000 hours of multilingual audio — an unusually diverse dataset that gives it strong accuracy across accents, languages, and audio conditions (including noisy recordings and spontaneous speech).

Whisper supports:

Transcription — convert audio to text in the original language
Translation — transcribe audio from any supported language and translate directly to English
Language detection — automatically identifies the language being spoken
Timestamps — output includes word-level or segment-level timestamps for subtitle generation

Language Support

Whisper supports 99 languages for transcription and covers 57 languages for translation to English. Top-quality performance is concentrated in languages well-represented in its training data (English, Spanish, French, German, Chinese, Japanese, Portuguese, Italian, Russian), but it achieves usable accuracy across dozens more.

Open-Source Availability

This is what makes Whisper exceptional: OpenAI released Whisper as fully open-source under the MIT license. Anyone can download the model weights and run Whisper locally.

💡Key Concept

Why open-source matters for transcription: Running Whisper locally means your audio never leaves your machine — critical for transcribing sensitive conversations, medical appointments, legal proceedings, or confidential business meetings. No data goes to OpenAI's servers, no usage fees accumulate, and the model runs offline. This is a meaningful advantage over cloud-only transcription services.

Whisper Model Sizes

Five model sizes are available (open-source), ranging from the compact tiny (39 million parameters, runs on CPU) to the full large-v3 (1.5 billion parameters, best accuracy, requires a GPU):

Model	Parameters	Speed	Accuracy	Best For
tiny	39 million	~32x real-time	Lowest	On-device; edge; fast prototyping
base	74 million	~16x real-time	Moderate	Lightweight local use
small	244 million	~6x real-time	Good	Local use with decent accuracy
medium	769 million	~2x real-time	High	Production without GPU
large-v3	1.5 billion	~1x real-time	Highest	Maximum accuracy; requires GPU

The hosted API uses the large-v3 model — you pay for accuracy without managing GPU hardware.

API Pricing

Service	Price
Whisper API (hosted)	$0.006 per minute of audio
Self-hosted (open-source)	Free (compute cost only)

At $0.006/minute, transcribing one hour of audio via the API costs $0.36 — extremely cost-effective for occasional use.

OpenAI TTS: Text-to-Speech

What It Does

OpenAI TTS converts written text into spoken audio using one of six voice models. It is designed primarily for developer integration — embedding AI voice into applications, generating audio companions for text content, or creating automated voice responses.

The Six Voices

Voice	Character
alloy	Neutral, versatile, balanced
echo	Male, warm, conversational
fable	British accent, expressive, slightly theatrical
onyx	Deep, authoritative
nova	Female, friendly, upbeat
shimmer	Soft, clear, gentle

All six voices are available in two quality tiers: tts-1 (optimized for real-time streaming, lower latency) and tts-1-hd (higher quality audio, optimized for final output rather than streaming).

TTS API Pricing

Model	Price
tts-1	$15 per 1 million characters
tts-1-hd	$30 per 1 million characters

For reference: one minute of speech at average reading speed is approximately 700–900 characters. A 10-minute audio file requires roughly 7,000–9,000 characters, costing under $0.15 on tts-1.

Output Formats

OpenAI TTS supports MP3, Opus (for internet streaming), AAC, FLAC, WAV, and PCM output — covering the full range of audio use cases from web streaming to broadcast-quality archival.

Real-Time Audio in ChatGPT (Advanced Voice Mode)

Beyond the API, OpenAI has integrated voice capabilities directly into ChatGPT's Advanced Voice Mode — allowing real-time conversational audio with ChatGPT on mobile and web. This uses different underlying models optimized for low-latency bidirectional audio, not the standard Whisper/TTS API stack.

Advanced Voice Mode is available on ChatGPT Plus and Pro — it represents the consumer-facing experience of OpenAI's audio capabilities rather than the developer API.

Strengths

Whisper accuracy: Among the most accurate transcription models available, particularly for multilingual audio and accented speech
Open-source (Whisper): Run locally, free, private — no data leaves your machine
99 language support: Broader language coverage than most competitors
Tight ChatGPT integration: TTS powers ChatGPT's voice features; familiar quality for ChatGPT users
Developer-friendly: Both tools have clean APIs with official Python and Node.js SDKs
Affordable API pricing: Whisper at $0.006/minute and TTS at $15/1 million characters are competitive

Limitations & Considerations

TTS voice variety: Six voices is limited compared to ElevenLabs' 3,000+ voice library; no voice cloning capability
No real-time Whisper streaming via API: The hosted Whisper API accepts audio files, not continuous live audio streams (real-time transcription requires the open-source model or third-party integration)
TTS emotion range: OpenAI TTS voices are high quality but somewhat neutral — ElevenLabs offers more expressive, emotionally varied output for dramatic content
Whisper hallucination: On very low-quality or silent audio segments, Whisper occasionally generates plausible-sounding but incorrect text — always review output for critical transcriptions

Best Use Cases

Task	Why OpenAI TTS / Whisper
Meeting and interview transcription	Whisper API at $0.006/min; high accuracy on spoken dialogue
Private transcription of sensitive recordings	Open-source Whisper running locally — no data leaves your machine
Subtitle generation for video	Timestamp output enables automatic subtitle file creation
Voice interface in an app	TTS API for text-to-audio in chatbots, assistants, reading apps
Podcast transcript generation	Batch-process audio files via Whisper API
Multilingual transcription	99-language support; translation to English in one step

When to choose alternatives:

Voice cloning or custom voices → ElevenLabs (TTS has only 6 fixed voices)
Music generation → Suno AI or Udio (TTS is voice only, not music)
Studio voiceover production with a team workflow → Murf AI
AI podcast enhancement and noise removal → Adobe Podcast Enhance

Getting Started

Using Whisper via API:

from openai import OpenAI
client = OpenAI()

audio_file = open("interview.mp3", "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="text"
)
print(transcript)

Using OpenAI TTS via API:

from openai import OpenAI
client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to the AI Pro Playbook."
)
response.stream_to_file("output.mp3")

Running Whisper open-source locally:

pip install openai-whisper
whisper audio.mp3 --model large-v3 --language English

✅Tip

Choosing between API and local Whisper: Use the API for convenience and the large-v3 quality without GPU setup. Use local Whisper when privacy is paramount, you have GPU hardware available, or you're processing audio at a volume where API costs add up.

Key Takeaways

OpenAI provides two complementary audio tools: Whisper (speech-to-text, open-source) and TTS (text-to-speech, API)
Whisper's open-source release is its defining feature — run it locally for private transcription with no API fees
OpenAI TTS offers six high-quality voices for developer integration; excellent for adding voice to applications, though limited in variety compared to ElevenLabs
Together, they cover the full audio pipeline for many developer use cases — transcribe with Whisper, respond with TTS — within OpenAI's existing API ecosystem

OpenAI TTS & Whisper

Audio & video lessons are paid features

Learning Objectives

What Are OpenAI's Audio Tools?

Whisper: Speech-to-Text

What It Does

Language Support

Open-Source Availability

Whisper Model Sizes

API Pricing

OpenAI TTS: Text-to-Speech

What It Does

The Six Voices

TTS API Pricing

Output Formats

Real-Time Audio in ChatGPT (Advanced Voice Mode)

Strengths

Limitations & Considerations

Best Use Cases

Getting Started

Key Takeaways

Save your progress & take the quiz