OpenAI Expands Voice AI API With New Realtime Models for Reasoning, Translation and Live Transcription

· · Views: 2,113 · 3 min time to read

OpenAI has added new voice intelligence features to its API, making it easier for developers to build apps that can talk, listen, translate, and transcribe in real time.

In its May 7 announcement, OpenAI introduced three audio models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. These models are designed for a new generation of voice apps that can reason, translate, and transcribe as people speak.

These features help developers create apps that can talk, transcribe, and translate conversations with users.

A stronger push beyond simple voice chat

The main highlight of the update is GPT-Realtime-2. OpenAI calls it its first voice model with GPT-5-level reasoning, able to handle tougher requests and keep conversations flowing naturally.

The company says it wants to move real-time audio beyond simple call-and-response, aiming for voice interfaces that can actually get things done.

TechCrunch also noted that GPT-Realtime-2 is designed to sound more realistic and handle more complex requests than the earlier GPT-Realtime-1.5 model.

OpenAI shared performance details for the new model. GPT-Realtime-2 now supports a bigger context window, growing from 32K to 128K, and lets developers pick different reasoning levels, from minimal to xhigh.

In its own tests, OpenAI said GPT-Realtime-2 scored 15.2% higher than GPT-Realtime-1.5 on Big Bench Audio for audio intelligence. The xhigh version also scored 13.8% higher on Audio MultiChallenge for following instructions and managing live conversations.

Translation and transcription are getting more ambitious

The update goes beyond making voice responses sound more natural. OpenAI said GPT-Realtime-Translate can translate speech from over 70 input languages into 13 output languages, keeping up with the speaker in real time.

This tool is aimed at uses like customer support, education, events, media, cross-border sales, and creator platforms. The model is built to provide real-time translation that keeps up with users as they talk.

For transcription, OpenAI introduced GPT-Realtime-Whisper as a streaming speech-to-text tool that transcribes speech “live as the speaker talks.”

TechCrunch described it as a new transcription capability that gives users live speech-to-text as interactions happen, rounding out a product bundle that now covers speaking, understanding, translating and transcribing in the same API family.

OpenAI is targeting real business use cases

OpenAI is focusing these features on real, revenue-generating uses instead of just demos. Customer service is an obvious target, and also potential in education, media, events, and creator platforms.

OpenAI’s announcement gave examples from Zillow, Deutsche Telekom, Vimeo, and BolnaAI. The company believes voice is becoming one of the most natural ways for people to use software, and that useful voice agents need to do more than respond quickly. They should keep context, adapt when requests change, and use tools as the conversation goes on.

Guardrails, data residency and pricing

OpenAI also highlighted safety and enterprise readiness in the launch. The company said its usage policies ban outputs used for spam, deception, or other harmful purposes.

OpenAI has added guardrails to prevent abuse like spam and fraud, including triggers that can stop conversations if harmful content is detected. The Realtime API fully supports EU Data Residency for EU-based apps and that the models follow its enterprise privacy commitments.

The pricing shows that OpenAI views these tools as commercial infrastructure, not just experiments. GPT-Realtime-2 costs $32 for 1 million audio input tokens and $64 for 1 million audio output tokens.

GPT-Realtime-Translate is $0.034 per minute, and GPT-Realtime-Whisper is $0.017 per minute. All three are available through OpenAI’s Realtime API, with Translate and Whisper billed by the minute and GPT-Realtime-2 billed by token use.

For OpenAI, the launch is another sign that voice is becoming more central to the developer stack.

For developers, the real test comes next: whether these new voice tools save enough time and friction to make speaking to software feel less like a gimmick and more like a default interface.

Share
f 𝕏 in
Copied