Free to read. Sign up to save your progress and take knowledge-check quizzes.

Sign up free
7 min read·Updated May 8, 2026

OpenAI Realtime API

OpenAI logoBy OpenAI

OpenAI's Realtime API is the unified WebSocket-based voice surface for builders — now hosting GPT-Realtime-2 (GPT-5-class voice reasoning), GPT-Realtime-Translate (real-time translation across 70+ languages), and GPT-Realtime-Whisper (live speech-to-text), all launched May 7, 2026.

Listen to this lesson

Free preview · first 0:30
0:00 / 0:30

Audio & video lessons are paid features

Plus unlocks audio streaming. Pro adds downloadable audio, video, certificates, and more.

Plus adds:
  • Audio streaming
  • Downloadable PDFs
  • All AI Playbooks
  • Personalized content
Pro also adds:
  • Certificates of completion
  • Audio MP3 downloads
  • Video lessonssoon
  • & More…soon

Watch this lesson

Video coming soon

Learning Objectives

  • Understand what the Realtime API is and how it differs from OpenAI's batch text-to-speech models like tts-1-hd
  • Identify the three voice models in the May 2026 launch and what each is built for
  • Evaluate when to choose the Realtime API over batch TTS, ElevenLabs, or Google's voice models

What Is the OpenAI Realtime API?

The OpenAI Realtime API is OpenAI's WebSocket-based interface for bidirectional, low-latency voice conversation — designed for builders who need a model to listen, reason, and speak as a conversation unfolds, rather than processing text and audio in separate batch calls. It is positioned as the successor surface to the original Whisper + GPT + TTS pipeline that builders previously chained together by hand.

On May 7, 2026, OpenAI launched three models on the Realtime API simultaneously, the largest single voice-product release in the API's history:

ModelWhat it doesBilling
GPT-Realtime-2GPT-5-class voice reasoning + dialoguePer-token (input + output)
GPT-Realtime-TranslateReal-time translation, 70+ input languages, 13 output languagesPer-minute
GPT-Realtime-WhisperLive speech-to-text transcriptionPer-minute

Tip

Documentation: developers.openai.com/api/docs/guides/audio — the canonical Realtime API reference, with WebSocket setup, session configuration, and tool-use patterns

Each Model in Detail

GPT-Realtime-2

The flagship voice model brings "GPT-5-class" reasoning to voice agents — meaning it can plan, use tools, and check its own output mid-conversation with the same quality bar OpenAI sets for text-only GPT-5.5. This is the model to choose when the voice agent needs to make decisions, not just transcribe and reply. Common use cases:

  • Customer service voice agents that look up account state, handle policy questions, and escalate when needed
  • Live event automation — voice-driven control of lighting, scenes, captions during a live broadcast
  • Education and tutoring — interactive voice tutors that adapt difficulty and explanation style
  • Creator-platform voice features — podcast co-hosts, voice-driven editing assistants

GPT-Realtime-2 is billed per-token for input and output, similar to text-mode GPT-5.5 billing. The token-based pricing makes long conversations more expensive than short turns, but rewards efficient prompt design.

GPT-Realtime-Translate

A specialized model for real-time translation across 70+ input languages and 13 output languages, optimized for maintaining conversational pace rather than maximum translation quality on dense written text. The 70-input / 13-output asymmetry reflects realistic deployment patterns — businesses translate into a small set of operating languages (English, Spanish, Mandarin, etc.) far more often than they translate out of them.

Use cases:

  • Multilingual customer support — agent handles English, customer speaks any of 70+ languages, customer hears their own language
  • Travel and tourism applications — phone-based tour guides, hotel concierge bots
  • Live event interpreting — multilingual townhalls, conferences with mixed-language audiences

Billing is per-minute, which matches how customers think about translation use — minutes of talk time, not tokens of text.

GPT-Realtime-Whisper

A live speech-to-text transcription model — the streaming successor to the batch Whisper API. The "live" distinction matters: this model emits transcribed text as the speaker talks, rather than waiting for an audio file to be uploaded and processed. Built for:

  • Live captioning for broadcasts, lectures, conference talks
  • Meeting transcription in real-time, not 30 seconds delayed
  • Dictation tools with sub-second feedback to the writer
  • Voice-to-search features in consumer apps

GPT-Realtime-Whisper is also billed per-minute, matching the audio-input duration.

Pricing

OpenAI did not publish a unified pricing table at launch; per-token rates for GPT-Realtime-2 follow the GPT-5.5 family billing structure, and per-minute rates for the specialized models will be visible on the OpenAI pricing page. Builders should expect:

  • GPT-Realtime-2 to cost meaningfully more per minute than text-only GPT-5.5 once you factor in the tokens generated by realtime audio synthesis
  • GPT-Realtime-Translate / Whisper to come in at competitive per-minute rates against ElevenLabs (translate) and Deepgram / Google (transcription)

The right comparison is not "Realtime API vs. batch TTS" — those are different product categories. The right comparison is Realtime API vs. ElevenLabs Conversational AI / Deepgram Voice Agents / Google's Gemini Live — the field of low-latency voice-agent platforms.

Strengths

  • GPT-5-class voice reasoning: GPT-Realtime-2 is currently the only voice agent in the field that can claim parity with frontier text reasoning
  • Three-model coverage: Reasoning, translate, transcribe — covers the majority of voice-agent use cases without leaving the OpenAI ecosystem
  • Tool-use built in: Voice agents on Realtime can call functions, search the web, look up account data — same tool-use patterns as text-mode GPT
  • WebSocket simplicity: Single bidirectional connection per session — no separate ASR + LLM + TTS chains to manage
  • OpenAI ecosystem: Builders already on OpenAI's platform get unified billing, logs, and safety tooling
  • Multilingual translate breadth: 70+ input languages is broader than most competitors at general availability

Limitations & Considerations

  • WebSocket complexity: Real-time bidirectional voice is harder to debug than request/response; expect to build for retry, reconnect, and graceful degradation
  • Cost at scale: Per-token GPT-Realtime-2 billing can climb fast on long sessions; design for handoff to lower-cost models when the conversation stops needing reasoning
  • Voice quality vs. specialized TTS: For pure narration (long-form audio, audiobook-style voiceover), specialized batch models like tts-1-hd or ElevenLabs may still produce better single-voice quality
  • Privacy and PII: Voice data carries identity signal beyond text — ensure your data-handling and retention policies match the additional risk
  • Pricing not unified: Per-token + per-minute billing across the three models requires separate cost monitoring per surface

Best Use Cases

Use CaseBest Realtime API Model
Customer service voice agent (with reasoning)GPT-Realtime-2
Live multilingual supportGPT-Realtime-Translate
Real-time meeting transcriptionGPT-Realtime-Whisper
Voice tutor or education appGPT-Realtime-2
Live event captioningGPT-Realtime-Whisper
Phone-based interpreter appGPT-Realtime-Translate

When to choose alternatives:

  • Long-form narration (audiobooks, podcasts, lesson audio) → batch TTS (tts-1-hd, gpt-4o-mini-tts) or ElevenLabs
  • Voice-cloning specifically → ElevenLabs
  • Open-source, on-prem voice → Whisper (open weights) + open TTS like Coqui or VITS
  • Google ecosystem deployment → Gemini Live or Cloud Speech-to-Text

How the Realtime API Compares to Batch TTS

SurfaceDirectionLatencyBest For
Realtime API (this page)BidirectionalSub-secondVoice agents, live translation, live transcription
tts-1-hd / gpt-4o-mini-tts (batch)Text-in, audio-file-outSecondsLong-form narration, lesson audio, voiceover
Whisper API (batch)Audio-file-in, text-outSecondsPre-recorded transcription, podcast indexing

The Realtime API and batch TTS solve different problems. Most production stacks will use both — Realtime for live conversation surfaces, batch TTS for offline content production.

Getting Started

  1. Confirm an OpenAI API account with billing configured at platform.openai.com
  2. Read the Realtime API docs at developers.openai.com/api/docs/guides/audio — start with the WebSocket session example
  3. Pick the right model for your use case: GPT-Realtime-2 for reasoning, Translate for multilingual, Whisper for transcription
  4. Build a minimal voice agent with the WebSocket sample code; expect 1 to 3 hours to get a working echo agent, longer for production-grade error handling
  5. Wire in tool-use callbacks if the agent needs to look up account state, search the web, or take real-world actions
  6. Monitor per-token (Realtime-2) and per-minute (Translate, Whisper) costs separately during the first week of deployment

Key Takeaways

  • The OpenAI Realtime API is the unified WebSocket-based voice surface for builders — three new models launched May 7, 2026 cover voice reasoning, translation, and transcription
  • GPT-Realtime-2 is the first voice agent with "GPT-5-class" reasoning — choose it when the agent needs to plan, use tools, or make decisions mid-conversation
  • GPT-Realtime-Translate handles 70+ input languages and 13 output languages, billed per-minute, optimized for conversational pace
  • GPT-Realtime-Whisper streams live speech-to-text — use it for real-time captioning, meeting transcription, and voice-to-search features
  • The right comparison is not Realtime vs. batch TTS but Realtime vs. ElevenLabs Conversational AI, Deepgram Voice Agents, and Gemini Live
  • For long-form narration like audiobooks or lesson audio, batch TTS models like tts-1-hd or gpt-4o-mini-tts may still produce better single-voice quality than Realtime

Save your progress & take the quiz

Sign up free to bookmark lessons, track which modules you've completed, and lock in what you learned with a quick knowledge-check quiz at the end of each lesson.

🧭Recommended for you