OpenAI has new voice models that reason, translate, and transcribe as you speak - 9to5Mac
#AI

OpenAI has new voice models that reason, translate, and transcribe as you speak - 9to5Mac

Mobile Reporter
5 min read

OpenAI’s latest Realtime API expansion delivers three specialized voice models that bring GPT-5-class reasoning, live 70-language translation, and low-latency streaming transcription to iOS and Android apps, with immediate availability and updated pricing for developers.

Featured image

OpenAI announced three new specialized realtime voice models for its Realtime API on May 7, 2026, designed to let developers build voice-driven app experiences with built-in reasoning, live translation, and instant transcription. The update expands the Realtime API’s capabilities beyond basic voice input-output, adding GPT-5-class reasoning to voice interactions for the first time.

The three new models each target a specific use case:

  • GPT-Realtime-2: The flagship model, powered by GPT-5-class reasoning. It handles full voice conversations, including interruptions, corrections, tool calls (such as checking a user’s calendar or sending a message), and context carryover across multiple turns. Unlike previous voice models that required separate text reasoning steps, GPT-Realtime-2 processes audio input, reasons through requests, and generates audio responses in a single pipeline.
  • GPT-Realtime-Translate: A live translation model that supports 70 input languages and 13 output languages, translating speech in real time as the speaker talks. It preserves the speaker’s tone and pace, avoiding the lag common with batch translation tools.
  • GPT-Realtime-Whisper: A streaming speech-to-text model that transcribes audio live as it is captured, with latency low enough to power real-time captions, meeting notes, and accessibility features. It is a streaming variant of the original Whisper model, optimized for continuous audio input rather than batch file processing.

All three models are available immediately in the OpenAI Realtime API. Pricing varies by model: GPT-Realtime-2 costs $32 per 1 million audio input tokens ($0.40 per 1 million cached input tokens) and $64 per 1 million audio output tokens. GPT-Realtime-Translate is priced at $0.034 per minute of audio processed. GPT-Realtime-Whisper costs $0.017 per minute of audio processed. Developers can test the models in the OpenAI Playground, and those using Codex can auto-add GPT-Realtime-2 to existing apps with a single prompt.

Apps

For iOS and Android developers, these models remove long-standing pain points in building voice features. Previously, adding voice reasoning or live translation to an app required stitching together three to four separate APIs: audio capture, speech-to-text, text reasoning or translation, and text-to-speech. Each additional step added latency, increased error rates, and raised costs. The new Realtime API models collapse this pipeline into a single audio-in, audio-out flow, reducing latency to under 500ms for most use cases according to OpenAI’s benchmarks.

Platform requirements for integration are consistent across iOS and Android. Developers need to target iOS 16 or higher, or Android 10 (API level 29) or higher, to access the low-latency audio capture APIs required for streaming to the Realtime API. The latest OpenAI client SDKs for iOS (version 2.1.0+) and Android (version 2.1.0+) include native support for the new voice model endpoints, with Kotlin and Swift sample code for audio streaming. Cross-platform developers using React Native, Flutter, or .NET MAUI can use the REST-based Realtime API directly, though they will need to implement native audio capture modules for optimal performance, as JavaScript or Dart threads cannot handle real-time audio streaming reliably. React Native developers can use the react-native-audio-toolkit library for native audio capture, while Flutter developers can use the flutter_sound package to interface with platform-specific audio APIs.

AI

The reasoning capabilities in GPT-Realtime-2 are particularly impactful for mobile apps. Previous voice assistants built into apps often felt rigid, unable to handle mid-conversation corrections or context switches. GPT-Realtime-2 natively supports interruptions, so if a user cuts off the model’s response to add new information, the model stops generating audio, processes the new input, and responds appropriately without requiring custom app logic to manage conversation state. It also supports tool calls, letting apps trigger native functionality (such as opening a map, creating a calendar event, or sending a text message) mid-conversation, all via voice input.

The translation model fills a gap for travel, education, and accessibility apps. With support for 70 input languages, apps can serve global user bases without maintaining custom translation stacks. For example, a cross-platform travel app could use GPT-Realtime-Translate to let users hold live conversations with locals in 13 output languages, with no lag between the speaker’s words and the translated audio. The transcription model, meanwhile, is a fit for meeting notes apps, live captioning tools, and accessibility features for deaf or hard-of-hearing users. Because it transcribes speech as it is spoken, captions appear in real time, rather than waiting for a speaker to finish a sentence or paragraph.

Pricing for the new models is competitive for real-time use cases, though more expensive than batch alternatives. GPT-Realtime-Whisper costs $0.017 per minute, compared to $0.006 per minute for batch Whisper processing. The tradeoff is latency: batch Whisper requires waiting for a full audio clip to finish before processing, while the streaming model transcribes in real time. For apps where latency matters more than cost, the streaming model is the better choice. GPT-Realtime-Translate’s $0.034 per minute pricing is in line with other live translation APIs, and cheaper than combining separate speech-to-text and translation APIs. GPT-Realtime-2’s token-based pricing is more variable, but cached input tokens reduce costs by 98% for repeated audio inputs, such as wake words or common phrases.

Migration to the new models is straightforward for developers already using the Realtime API. To switch from older voice models, update the model parameter in your API requests to the new model name (e.g., change gpt-4o-realtime to gpt-realtime-2). Developers using separate transcription and reasoning pipelines should audit their existing voice flows to replace multi-step API calls with a single model call. For example, a meeting notes app that previously used batch Whisper for transcription and GPT-4 for summarization can switch to GPT-Realtime-Whisper for live transcription, then use GPT-Realtime-2 to generate summaries once the meeting ends, cutting out two separate API calls.

Developers new to the Realtime API should start by testing the models in the Playground to understand latency and output quality for their use case. The OpenAI Realtime API documentation includes sample code for iOS, Android, and cross-platform integrations, as well as guides for handling audio streaming and error cases. For apps that require offline voice processing, these models are cloud-only, so developers building offline voice features will need to stick to on-device models like Apple’s Speech framework or Android’s SpeechRecognizer, though these lack the reasoning and translation capabilities of OpenAI’s new models.

Comments

Loading comments...