Learning Objectives
- Distinguish between Whisper (speech-to-text) and OpenAI TTS (text-to-speech) and understand what each does
- Evaluate the open-source availability of Whisper and what that enables
- Identify the right use cases for OpenAI's audio models versus dedicated voice platforms
What Are OpenAI's Audio Tools?
OpenAI offers two distinct audio AI models, often discussed together because they serve complementary functions:
- Whisper — a speech recognition (speech-to-text / transcription) model that converts spoken audio into written text with high accuracy across 99 languages
- OpenAI TTS — a text-to-speech model that converts written text into spoken audio using one of six built-in voice personas
Both are available via the OpenAI API. Whisper is also released as open-source software — making it uniquely powerful for developers who want to run transcription on their own infrastructure without sending audio to OpenAI's servers.
✅Tip
Access: Both tools are available via platform.openai.com — Whisper also available as an open-source download at github.com/openai/whisper
Whisper: Speech-to-Text
⚠️Warning
Not to be confused with Wispr Flow. Wispr Flow is a consumer voice-dictation app from Wispr AI (a separate San Francisco startup), not an OpenAI product. Despite the similar name, Wispr Flow does not primarily use OpenAI's Whisper model — it runs its own proprietary cloud models. If you came here looking for the dictation app, see the Wispr Flow tool page instead.
What It Does
Whisper transcribes audio files (or live audio streams via the API) into text. It was trained on 680,000 hours of multilingual audio — an unusually diverse dataset that gives it strong accuracy across accents, languages, and audio conditions (including noisy recordings and spontaneous speech).
Whisper supports:
- Transcription — convert audio to text in the original language
- Translation — transcribe audio from any supported language and translate directly to English
- Language detection — automatically identifies the language being spoken
- Timestamps — output includes word-level or segment-level timestamps for subtitle generation
Language Support
Whisper supports 99 languages for transcription and covers 57 languages for translation to English. Top-quality performance is concentrated in languages well-represented in its training data (English, Spanish, French, German, Chinese, Japanese, Portuguese, Italian, Russian), but it achieves usable accuracy across dozens more.
Open-Source Availability
This is what makes Whisper exceptional: OpenAI released Whisper as fully open-source under the MIT license. Anyone can download the model weights and run Whisper locally.
💡Key Concept
Why open-source matters for transcription: Running Whisper locally means your audio never leaves your machine — critical for transcribing sensitive conversations, medical appointments, legal proceedings, or confidential business meetings. No data goes to OpenAI's servers, no usage fees accumulate, and the model runs offline. This is a meaningful advantage over cloud-only transcription services.
Whisper Model Sizes
Five model sizes are available (open-source), ranging from the compact tiny (39 million parameters, runs on CPU) to the full large-v3 (1.5 billion parameters, best accuracy, requires a GPU):
| Model | Parameters | Speed | Accuracy | Best For |
|---|---|---|---|---|
| tiny | 39 million | ~32x real-time | Lowest | On-device; edge; fast prototyping |
| base | 74 million | ~16x real-time | Moderate | Lightweight local use |
| small | 244 million | ~6x real-time | Good | Local use with decent accuracy |
| medium | 769 million | ~2x real-time | High | Production without GPU |
| large-v3 | 1.5 billion | ~1x real-time | Highest | Maximum accuracy; requires GPU |
The hosted API uses the large-v3 model — you pay for accuracy without managing GPU hardware.
API Pricing
| Service | Price |
|---|---|
| Whisper API (hosted) | $0.006 per minute of audio |
| Self-hosted (open-source) | Free (compute cost only) |
At $0.006/minute, transcribing one hour of audio via the API costs $0.36 — extremely cost-effective for occasional use.
OpenAI TTS: Text-to-Speech
What It Does
OpenAI TTS converts written text into spoken audio using one of six voice models. It is designed primarily for developer integration — embedding AI voice into applications, generating audio companions for text content, or creating automated voice responses.
The Six Voices
| Voice | Character |
|---|---|
| alloy | Neutral, versatile, balanced |
| echo | Male, warm, conversational |
| fable | British accent, expressive, slightly theatrical |
| onyx | Deep, authoritative |
| nova | Female, friendly, upbeat |
| shimmer | Soft, clear, gentle |
All six voices are available in two quality tiers: tts-1 (optimized for real-time streaming, lower latency) and tts-1-hd (higher quality audio, optimized for final output rather than streaming).
TTS API Pricing
| Model | Price |
|---|---|
| tts-1 | $15 per 1 million characters |
| tts-1-hd | $30 per 1 million characters |
For reference: one minute of speech at average reading speed is approximately 700–900 characters. A 10-minute audio file requires roughly 7,000–9,000 characters, costing under $0.15 on tts-1.
Output Formats
OpenAI TTS supports MP3, Opus (for internet streaming), AAC, FLAC, WAV, and PCM output — covering the full range of audio use cases from web streaming to broadcast-quality archival.
Real-Time Audio in ChatGPT (Advanced Voice Mode)
Beyond the API, OpenAI has integrated voice capabilities directly into ChatGPT's Advanced Voice Mode — allowing real-time conversational audio with ChatGPT on mobile and web. This uses different underlying models optimized for low-latency bidirectional audio, not the standard Whisper/TTS API stack.
Advanced Voice Mode is available on ChatGPT Plus and Pro — it represents the consumer-facing experience of OpenAI's audio capabilities rather than the developer API.
Strengths
- Whisper accuracy: Among the most accurate transcription models available, particularly for multilingual audio and accented speech
- Open-source (Whisper): Run locally, free, private — no data leaves your machine
- 99 language support: Broader language coverage than most competitors
- Tight ChatGPT integration: TTS powers ChatGPT's voice features; familiar quality for ChatGPT users
- Developer-friendly: Both tools have clean APIs with official Python and Node.js SDKs
- Affordable API pricing: Whisper at $0.006/minute and TTS at $15/1 million characters are competitive
Limitations & Considerations
- TTS voice variety: Six voices is limited compared to ElevenLabs' 3,000+ voice library; no voice cloning capability
- No real-time Whisper streaming via API: The hosted Whisper API accepts audio files, not continuous live audio streams (real-time transcription requires the open-source model or third-party integration)
- TTS emotion range: OpenAI TTS voices are high quality but somewhat neutral — ElevenLabs offers more expressive, emotionally varied output for dramatic content
- Whisper hallucination: On very low-quality or silent audio segments, Whisper occasionally generates plausible-sounding but incorrect text — always review output for critical transcriptions
Best Use Cases
| Task | Why OpenAI TTS / Whisper |
|---|---|
| Meeting and interview transcription | Whisper API at $0.006/min; high accuracy on spoken dialogue |
| Private transcription of sensitive recordings | Open-source Whisper running locally — no data leaves your machine |
| Subtitle generation for video | Timestamp output enables automatic subtitle file creation |
| Voice interface in an app | TTS API for text-to-audio in chatbots, assistants, reading apps |
| Podcast transcript generation | Batch-process audio files via Whisper API |
| Multilingual transcription | 99-language support; translation to English in one step |
When to choose alternatives:
- Voice cloning or custom voices → ElevenLabs (TTS has only 6 fixed voices)
- Music generation → Suno AI or Udio (TTS is voice only, not music)
- Studio voiceover production with a team workflow → Murf AI
- AI podcast enhancement and noise removal → Adobe Podcast Enhance
Getting Started
Using Whisper via API:
from openai import OpenAI
client = OpenAI()
audio_file = open("interview.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print(transcript)
Using OpenAI TTS via API:
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input="Welcome to the AI Pro Playbook."
)
response.stream_to_file("output.mp3")
Running Whisper open-source locally:
pip install openai-whisper
whisper audio.mp3 --model large-v3 --language English
✅Tip
Choosing between API and local Whisper: Use the API for convenience and the large-v3 quality without GPU setup. Use local Whisper when privacy is paramount, you have GPU hardware available, or you're processing audio at a volume where API costs add up.
Key Takeaways
- OpenAI provides two complementary audio tools: Whisper (speech-to-text, open-source) and TTS (text-to-speech, API)
- Whisper's open-source release is its defining feature — run it locally for private transcription with no API fees
- OpenAI TTS offers six high-quality voices for developer integration; excellent for adding voice to applications, though limited in variety compared to ElevenLabs
- Together, they cover the full audio pipeline for many developer use cases — transcribe with Whisper, respond with TTS — within OpenAI's existing API ecosystem