OpenAI launches three new real‑time audio API models, including GPT‑Realtime‑2
#AI

OpenAI launches three new real‑time audio API models, including GPT‑Realtime‑2

Laptops Reporter
5 min read

OpenAI’s Realtime API leaves beta and adds three streaming audio models—GPT‑Realtime‑2, GPT‑Realtime‑Translate, and GPT‑Realtime‑Whisper—each designed for continuous, low‑latency voice interactions. The flagship GPT‑Realtime‑2 brings GPT‑5‑class reasoning to live calls, a 128 K token context window, and built‑in agentic behaviors, while the translation and transcription models target multilingual support and live captioning at competitive per‑minute rates.

OpenAI expands its Realtime API with three streaming audio models

Featured image

OpenAI announced that its Realtime API is now generally available, and with the launch it ships three new models that operate on a continuous audio stream rather than the traditional “record‑then‑process” workflow. The headline model, GPT‑Realtime‑2, is the first voice‑enabled system built on the same reasoning core that powers the GPT‑5 family. Two specialist variants—GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper—cover live translation and live transcription, respectively. All three are reachable through the standard OpenAI API and the developer playground.


What’s new with GPT‑Realtime‑2?

Feature GPT‑Realtime‑1.5 GPT‑Realtime‑2
Reasoning core GPT‑4‑class GPT‑5‑class
Context window 32 K tokens 128 K tokens
Pricing (input) $0.032 per M tokens $0.032 per M tokens
Pricing (output) $0.064 per M tokens $0.064 per M tokens
Cached‑input cost $0.0004 per M tokens $0.0004 per M tokens

The most visible change is the continuous‑stream architecture. Instead of waiting for a full transcription before generating a response, GPT‑Realtime‑2 consumes the audio waveform as it arrives, updates its internal representation, and begins speaking as soon as it has enough context. This eliminates the typical half‑second to one‑second pause that occurs when a separate speech‑to‑text step hands off to a text‑generation model.

Agentic behavior baked in

OpenAI describes the model as “agentic” for voice calls. Three mechanisms make that possible:

  1. Preambles – The model can emit filler phrases such as “One moment, let me check that” while it fires a tool call. Users never hear dead air.
  2. Parallel tool calls – Multiple back‑end requests can be launched at once, and the model narrates which one is in flight (e.g., “I’m pulling your order history while I also check the latest shipping rates”).
  3. Recovery speech – If a tool fails, the model explains the error out loud instead of freezing, preserving the conversational flow.

Benchmarks and real‑world impact

  • Big Bench Audio – GPT‑Realtime‑2 scores 15.2 % higher than its predecessor.
  • Audio Multichallenger – Instruction‑following improves by 13.8 %.
  • Zillow field test – After prompt tuning, the model lifts the hardest adversarial call success rate from 69 % to 95 % (a 26‑point gain).

These numbers matter because they translate directly into lower friction for voice‑first applications such as virtual assistants, support hotlines, and interactive tutoring platforms.


GPT‑Realtime‑Translate – live multilingual bridge

The translation model focuses on continuous speech translation. It accepts spoken input in more than 70 source languages and can output translations in 13 target languages. Because it never waits for sentence boundaries, it is suited for:

  • Customer‑support agents handling cross‑border calls.
  • Real‑time captioning of webinars and live events.
  • Classroom language labs where students converse in their native tongue.

BolnaAI, which builds voice AI for Indian‑language markets, reports a 12.5 % reduction in word‑error rate for Hindi, Tamil, and Telugu compared with their previous pipeline.

Pricing: $0.034 per minute of processed audio (input + output). At that rate a 30‑minute support call costs roughly $1.02, making it competitive with human interpreter fees.


GPT‑Realtime‑Whisper – streaming transcription for accessibility

OpenAI extends its popular Whisper model into a streaming version. The original Whisper was designed for batch transcription; the new variant emits tokens as the speaker talks, enabling:

  • Live meeting captions.
  • Courtroom documentation where every word must be recorded in real time.
  • Accessibility overlays for video streams.

It retains Whisper’s strong multilingual accuracy while adding the low‑latency edge required for live use.

Pricing: $0.017 per minute of audio, the most affordable of the three models. A two‑hour conference would cost just $2.04.


Additional platform upgrades

The Realtime API rollout also adds:

  • MCP server support – developers can host the streaming backend on OpenAI’s Managed Compute Platform for better scaling.
  • Image input – a single API call can now carry both audio and a static image, opening possibilities for multimodal agents that reference visual context while speaking.
  • SIP phone integration – direct telephony endpoints can be hooked into the API, simplifying enterprise deployments that still rely on traditional PBX systems.

Who should care?

Audience Why it matters
Enterprise developers Ability to ship production‑grade voice agents without stitching together separate ASR, LLM, and TTS services.
Contact‑center operators Higher success rates and lower latency translate into faster resolution and lower agent overhead.
Product teams building multilingual experiences One model covers translation, transcription, and reasoning, reducing stack complexity and cost.
Accessibility advocates Real‑time captions at sub‑cent per‑hour pricing make inclusive design financially viable.

If you are already using OpenAI’s text‑only completions, the migration path is straightforward: swap the endpoint to the Realtime API, enable streaming, and let the model handle both speech and text in a single flow.


Bottom line

OpenAI’s exit from beta marks a turning point for voice AI. By unifying reasoning, translation, and transcription under a streaming architecture, the company removes the biggest source of latency in current voice assistants. GPT‑Realtime‑2’s 128 K token window makes complex, multi‑turn dialogues feasible without external memory tricks, while the specialist Translate and Whisper models provide cost‑effective solutions for multilingual and accessibility use cases. For developers looking to embed live, agentic voice experiences, the new Realtime API is now a production‑ready option.

Comments

Loading comments...