OpenAI announces next-generation audio models to power voice agents

In recent months, OpenAI has released several new tools, including Operator, Deep Research, Computer-Using Agents, and the Responses API, focusing on text-based agents. Today, OpenAI announced new speech-to-text and text-to-speech audio models in the API, enabling developers to create more powerful, customizable, and expressive voice agents than ever before.

OpenAI's new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, offer significant improvements in word error rate, language recognition, and accuracy compared to OpenAI's existing Whisper models. These advancements were achieved through reinforcement learning and extensive mid-training using diverse and high-quality audio datasets.

OpenAI claims that these new audio models can better understand speech nuances, reduce misrecognitions, and improve transcription reliability even when the input audio involve accents, noisy environments, and varying speech speeds.

The gpt-4o-mini-tts is the latest text-to-speech model, offering improved steerability. Developers can now instruct the model on how to articulate the text content. However, for now, the text-to-speech model is limited to artificial, preset voices.

The gpt-4o-transcribe model costs $6 per million Audio Input Tokens, $2.50 per million Text Input Tokens, and $10 per million Text Output Tokens. The gpt-4o-mini-transcribe costs $3 per million Audio Input Tokens, $1.25 per million Text Input Tokens, and $5 per million Text Output Tokens. Finally, the gpt-4o-mini-tts costs $0.60 per million text input tokens and $12 per million audio output tokens. This works out to the following per-minute costs:

gpt-4o-transcribe: ~0.6 cents / minute
gpt-4o-mini-transcribe: ~0.3 cents / minute
gpt-4o-mini-tts: ~1.5 cents / minute

The OpenAI team wrote the following regarding these new audio models:

"Looking ahead, we plan to continue to invest in improving the intelligence and accuracy of our audio models and exploring ways to allow developers to bring their own custom voices to build even more personalized experiences in ways that align with our safety standards."

These new audio models are now available to all developers via APIs. OpenAI also announced an integration with the Agents SDK, allowing developers to easily build voice agents. For low-latency speech-to-speech experiences, OpenAI recommends using the Realtime API.