Name: OpenAI Realtime API
Availability: InStock
Author: OpenAI

Learning Objectives

Understand what the Realtime API is and how it differs from OpenAI's batch text-to-speech models like tts-1-hd
Identify the three voice models in the May 2026 launch and what each is built for
Evaluate when to choose the Realtime API over batch TTS, ElevenLabs, or Google's voice models

What Is the OpenAI Realtime API?

The OpenAI Realtime API is OpenAI's WebSocket-based interface for bidirectional, low-latency voice conversation — designed for builders who need a model to listen, reason, and speak as a conversation unfolds, rather than processing text and audio in separate batch calls. It is positioned as the successor surface to the original Whisper + GPT + TTS pipeline that builders previously chained together by hand.

On May 7, 2026, OpenAI launched three models on the Realtime API simultaneously, the largest single voice-product release in the API's history:

Model	What it does	Billing
GPT-Realtime-2	GPT-5-class voice reasoning + dialogue	Per-token (input + output)
GPT-Realtime-Translate	Real-time translation, 70+ input languages, 13 output languages	Per-minute
GPT-Realtime-Whisper	Live speech-to-text transcription	Per-minute

✅Tip

Documentation: developers.openai.com/api/docs/guides/audio — the canonical Realtime API reference, with WebSocket setup, session configuration, and tool-use patterns

Each Model in Detail

GPT-Realtime-2

The flagship voice model brings "GPT-5-class" reasoning to voice agents — meaning it can plan, use tools, and check its own output mid-conversation with the same quality bar OpenAI sets for text-only GPT-5.5. This is the model to choose when the voice agent needs to make decisions, not just transcribe and reply. Common use cases:

Customer service voice agents that look up account state, handle policy questions, and escalate when needed
Live event automation — voice-driven control of lighting, scenes, captions during a live broadcast
Education and tutoring — interactive voice tutors that adapt difficulty and explanation style
Creator-platform voice features — podcast co-hosts, voice-driven editing assistants

GPT-Realtime-2 is billed per-token for input and output, similar to text-mode GPT-5.5 billing. The token-based pricing makes long conversations more expensive than short turns, but rewards efficient prompt design.

GPT-Realtime-Translate

A specialized model for real-time translation across 70+ input languages and 13 output languages, optimized for maintaining conversational pace rather than maximum translation quality on dense written text. The 70-input / 13-output asymmetry reflects realistic deployment patterns — businesses translate into a small set of operating languages (English, Spanish, Mandarin, etc.) far more often than they translate out of them.

Use cases:

Multilingual customer support — agent handles English, customer speaks any of 70+ languages, customer hears their own language
Travel and tourism applications — phone-based tour guides, hotel concierge bots
Live event interpreting — multilingual townhalls, conferences with mixed-language audiences

Billing is per-minute, which matches how customers think about translation use — minutes of talk time, not tokens of text.

GPT-Realtime-Whisper

A live speech-to-text transcription model — the streaming successor to the batch Whisper API. The "live" distinction matters: this model emits transcribed text as the speaker talks, rather than waiting for an audio file to be uploaded and processed. Built for:

Live captioning for broadcasts, lectures, conference talks
Meeting transcription in real-time, not 30 seconds delayed
Dictation tools with sub-second feedback to the writer
Voice-to-search features in consumer apps

GPT-Realtime-Whisper is also billed per-minute, matching the audio-input duration.

Pricing

OpenAI did not publish a unified pricing table at launch; per-token rates for GPT-Realtime-2 follow the GPT-5.5 family billing structure, and per-minute rates for the specialized models will be visible on the OpenAI pricing page. Builders should expect:

GPT-Realtime-2 to cost meaningfully more per minute than text-only GPT-5.5 once you factor in the tokens generated by realtime audio synthesis
GPT-Realtime-Translate / Whisper to come in at competitive per-minute rates against ElevenLabs (translate) and Deepgram / Google (transcription)

The right comparison is not "Realtime API vs. batch TTS" — those are different product categories. The right comparison is Realtime API vs. ElevenLabs Conversational AI / Deepgram Voice Agents / Google's Gemini Live — the field of low-latency voice-agent platforms.

Strengths

GPT-5-class voice reasoning: GPT-Realtime-2 is currently the only voice agent in the field that can claim parity with frontier text reasoning
Three-model coverage: Reasoning, translate, transcribe — covers the majority of voice-agent use cases without leaving the OpenAI ecosystem
Tool-use built in: Voice agents on Realtime can call functions, search the web, look up account data — same tool-use patterns as text-mode GPT
WebSocket simplicity: Single bidirectional connection per session — no separate ASR + LLM + TTS chains to manage
OpenAI ecosystem: Builders already on OpenAI's platform get unified billing, logs, and safety tooling
Multilingual translate breadth: 70+ input languages is broader than most competitors at general availability

Limitations & Considerations

WebSocket complexity: Real-time bidirectional voice is harder to debug than request/response; expect to build for retry, reconnect, and graceful degradation
Cost at scale: Per-token GPT-Realtime-2 billing can climb fast on long sessions; design for handoff to lower-cost models when the conversation stops needing reasoning
Voice quality vs. specialized TTS: For pure narration (long-form audio, audiobook-style voiceover), specialized batch models like tts-1-hd or ElevenLabs may still produce better single-voice quality
Privacy and PII: Voice data carries identity signal beyond text — ensure your data-handling and retention policies match the additional risk
Pricing not unified: Per-token + per-minute billing across the three models requires separate cost monitoring per surface

Best Use Cases

Use Case	Best Realtime API Model
Customer service voice agent (with reasoning)	GPT-Realtime-2
Live multilingual support	GPT-Realtime-Translate
Real-time meeting transcription	GPT-Realtime-Whisper
Voice tutor or education app	GPT-Realtime-2
Live event captioning	GPT-Realtime-Whisper
Phone-based interpreter app	GPT-Realtime-Translate

When to choose alternatives:

Long-form narration (audiobooks, podcasts, lesson audio) → batch TTS (tts-1-hd, gpt-4o-mini-tts) or ElevenLabs
Voice-cloning specifically → ElevenLabs
Open-source, on-prem voice → Whisper (open weights) + open TTS like Coqui or VITS
Google ecosystem deployment → Gemini Live or Cloud Speech-to-Text

How the Realtime API Compares to Batch TTS

Surface	Direction	Latency	Best For
Realtime API (this page)	Bidirectional	Sub-second	Voice agents, live translation, live transcription
tts-1-hd / gpt-4o-mini-tts (batch)	Text-in, audio-file-out	Seconds	Long-form narration, lesson audio, voiceover
Whisper API (batch)	Audio-file-in, text-out	Seconds	Pre-recorded transcription, podcast indexing

The Realtime API and batch TTS solve different problems. Most production stacks will use both — Realtime for live conversation surfaces, batch TTS for offline content production.

Getting Started

Confirm an OpenAI API account with billing configured at platform.openai.com
Read the Realtime API docs at developers.openai.com/api/docs/guides/audio — start with the WebSocket session example
Pick the right model for your use case: GPT-Realtime-2 for reasoning, Translate for multilingual, Whisper for transcription
Build a minimal voice agent with the WebSocket sample code; expect 1 to 3 hours to get a working echo agent, longer for production-grade error handling
Wire in tool-use callbacks if the agent needs to look up account state, search the web, or take real-world actions
Monitor per-token (Realtime-2) and per-minute (Translate, Whisper) costs separately during the first week of deployment

Key Takeaways

The OpenAI Realtime API is the unified WebSocket-based voice surface for builders — three new models launched May 7, 2026 cover voice reasoning, translation, and transcription
GPT-Realtime-2 is the first voice agent with "GPT-5-class" reasoning — choose it when the agent needs to plan, use tools, or make decisions mid-conversation
GPT-Realtime-Translate handles 70+ input languages and 13 output languages, billed per-minute, optimized for conversational pace
GPT-Realtime-Whisper streams live speech-to-text — use it for real-time captioning, meeting transcription, and voice-to-search features
The right comparison is not Realtime vs. batch TTS but Realtime vs. ElevenLabs Conversational AI, Deepgram Voice Agents, and Gemini Live
For long-form narration like audiobooks or lesson audio, batch TTS models like tts-1-hd or gpt-4o-mini-tts may still produce better single-voice quality than Realtime

OpenAI Realtime API

Audio & video lessons are paid features