Learning Objectives
- Understand what the Realtime API is and how it differs from OpenAI's batch text-to-speech models like
tts-1-hd - Identify the three voice models in the May 2026 launch and what each is built for
- Evaluate when to choose the Realtime API over batch TTS, ElevenLabs, or Google's voice models
What Is the OpenAI Realtime API?
The OpenAI Realtime API is OpenAI's WebSocket-based interface for bidirectional, low-latency voice conversation — designed for builders who need a model to listen, reason, and speak as a conversation unfolds, rather than processing text and audio in separate batch calls. It is positioned as the successor surface to the original Whisper + GPT + TTS pipeline that builders previously chained together by hand.
On May 7, 2026, OpenAI launched three models on the Realtime API simultaneously, the largest single voice-product release in the API's history:
| Model | What it does | Billing |
|---|---|---|
| GPT-Realtime-2 | GPT-5-class voice reasoning + dialogue | Per-token (input + output) |
| GPT-Realtime-Translate | Real-time translation, 70+ input languages, 13 output languages | Per-minute |
| GPT-Realtime-Whisper | Live speech-to-text transcription | Per-minute |
✅Tip
Documentation: developers.openai.com/api/docs/guides/audio — the canonical Realtime API reference, with WebSocket setup, session configuration, and tool-use patterns
Each Model in Detail
GPT-Realtime-2
The flagship voice model brings "GPT-5-class" reasoning to voice agents — meaning it can plan, use tools, and check its own output mid-conversation with the same quality bar OpenAI sets for text-only GPT-5.5. This is the model to choose when the voice agent needs to make decisions, not just transcribe and reply. Common use cases:
- Customer service voice agents that look up account state, handle policy questions, and escalate when needed
- Live event automation — voice-driven control of lighting, scenes, captions during a live broadcast
- Education and tutoring — interactive voice tutors that adapt difficulty and explanation style
- Creator-platform voice features — podcast co-hosts, voice-driven editing assistants
GPT-Realtime-2 is billed per-token for input and output, similar to text-mode GPT-5.5 billing. The token-based pricing makes long conversations more expensive than short turns, but rewards efficient prompt design.
GPT-Realtime-Translate
A specialized model for real-time translation across 70+ input languages and 13 output languages, optimized for maintaining conversational pace rather than maximum translation quality on dense written text. The 70-input / 13-output asymmetry reflects realistic deployment patterns — businesses translate into a small set of operating languages (English, Spanish, Mandarin, etc.) far more often than they translate out of them.
Use cases:
- Multilingual customer support — agent handles English, customer speaks any of 70+ languages, customer hears their own language
- Travel and tourism applications — phone-based tour guides, hotel concierge bots
- Live event interpreting — multilingual townhalls, conferences with mixed-language audiences
Billing is per-minute, which matches how customers think about translation use — minutes of talk time, not tokens of text.
GPT-Realtime-Whisper
A live speech-to-text transcription model — the streaming successor to the batch Whisper API. The "live" distinction matters: this model emits transcribed text as the speaker talks, rather than waiting for an audio file to be uploaded and processed. Built for:
- Live captioning for broadcasts, lectures, conference talks
- Meeting transcription in real-time, not 30 seconds delayed
- Dictation tools with sub-second feedback to the writer
- Voice-to-search features in consumer apps
GPT-Realtime-Whisper is also billed per-minute, matching the audio-input duration.
Pricing
OpenAI did not publish a unified pricing table at launch; per-token rates for GPT-Realtime-2 follow the GPT-5.5 family billing structure, and per-minute rates for the specialized models will be visible on the OpenAI pricing page. Builders should expect:
- GPT-Realtime-2 to cost meaningfully more per minute than text-only GPT-5.5 once you factor in the tokens generated by realtime audio synthesis
- GPT-Realtime-Translate / Whisper to come in at competitive per-minute rates against ElevenLabs (translate) and Deepgram / Google (transcription)
The right comparison is not "Realtime API vs. batch TTS" — those are different product categories. The right comparison is Realtime API vs. ElevenLabs Conversational AI / Deepgram Voice Agents / Google's Gemini Live — the field of low-latency voice-agent platforms.
Strengths
- GPT-5-class voice reasoning: GPT-Realtime-2 is currently the only voice agent in the field that can claim parity with frontier text reasoning
- Three-model coverage: Reasoning, translate, transcribe — covers the majority of voice-agent use cases without leaving the OpenAI ecosystem
- Tool-use built in: Voice agents on Realtime can call functions, search the web, look up account data — same tool-use patterns as text-mode GPT
- WebSocket simplicity: Single bidirectional connection per session — no separate ASR + LLM + TTS chains to manage
- OpenAI ecosystem: Builders already on OpenAI's platform get unified billing, logs, and safety tooling
- Multilingual translate breadth: 70+ input languages is broader than most competitors at general availability
Limitations & Considerations
- WebSocket complexity: Real-time bidirectional voice is harder to debug than request/response; expect to build for retry, reconnect, and graceful degradation
- Cost at scale: Per-token GPT-Realtime-2 billing can climb fast on long sessions; design for handoff to lower-cost models when the conversation stops needing reasoning
- Voice quality vs. specialized TTS: For pure narration (long-form audio, audiobook-style voiceover), specialized batch models like
tts-1-hdor ElevenLabs may still produce better single-voice quality - Privacy and PII: Voice data carries identity signal beyond text — ensure your data-handling and retention policies match the additional risk
- Pricing not unified: Per-token + per-minute billing across the three models requires separate cost monitoring per surface
Best Use Cases
| Use Case | Best Realtime API Model |
|---|---|
| Customer service voice agent (with reasoning) | GPT-Realtime-2 |
| Live multilingual support | GPT-Realtime-Translate |
| Real-time meeting transcription | GPT-Realtime-Whisper |
| Voice tutor or education app | GPT-Realtime-2 |
| Live event captioning | GPT-Realtime-Whisper |
| Phone-based interpreter app | GPT-Realtime-Translate |
When to choose alternatives:
- Long-form narration (audiobooks, podcasts, lesson audio) → batch TTS (
tts-1-hd,gpt-4o-mini-tts) or ElevenLabs - Voice-cloning specifically → ElevenLabs
- Open-source, on-prem voice → Whisper (open weights) + open TTS like Coqui or VITS
- Google ecosystem deployment → Gemini Live or Cloud Speech-to-Text
How the Realtime API Compares to Batch TTS
| Surface | Direction | Latency | Best For |
|---|---|---|---|
| Realtime API (this page) | Bidirectional | Sub-second | Voice agents, live translation, live transcription |
| tts-1-hd / gpt-4o-mini-tts (batch) | Text-in, audio-file-out | Seconds | Long-form narration, lesson audio, voiceover |
| Whisper API (batch) | Audio-file-in, text-out | Seconds | Pre-recorded transcription, podcast indexing |
The Realtime API and batch TTS solve different problems. Most production stacks will use both — Realtime for live conversation surfaces, batch TTS for offline content production.
Getting Started
- Confirm an OpenAI API account with billing configured at platform.openai.com
- Read the Realtime API docs at developers.openai.com/api/docs/guides/audio — start with the WebSocket session example
- Pick the right model for your use case: GPT-Realtime-2 for reasoning, Translate for multilingual, Whisper for transcription
- Build a minimal voice agent with the WebSocket sample code; expect 1 to 3 hours to get a working echo agent, longer for production-grade error handling
- Wire in tool-use callbacks if the agent needs to look up account state, search the web, or take real-world actions
- Monitor per-token (Realtime-2) and per-minute (Translate, Whisper) costs separately during the first week of deployment
Key Takeaways
- The OpenAI Realtime API is the unified WebSocket-based voice surface for builders — three new models launched May 7, 2026 cover voice reasoning, translation, and transcription
- GPT-Realtime-2 is the first voice agent with "GPT-5-class" reasoning — choose it when the agent needs to plan, use tools, or make decisions mid-conversation
- GPT-Realtime-Translate handles 70+ input languages and 13 output languages, billed per-minute, optimized for conversational pace
- GPT-Realtime-Whisper streams live speech-to-text — use it for real-time captioning, meeting transcription, and voice-to-search features
- The right comparison is not Realtime vs. batch TTS but Realtime vs. ElevenLabs Conversational AI, Deepgram Voice Agents, and Gemini Live
- For long-form narration like audiobooks or lesson audio, batch TTS models like
tts-1-hdorgpt-4o-mini-ttsmay still produce better single-voice quality than Realtime