When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Amazon announces Nova Sonic audio model, claiming to outperform OpenAI and Google

Amazon Nova Sonic

Amazon today announced Nova Sonic, a state-of-the-art speech-to-speech model that enables developers to build applications featuring real-time, human-like voice conversations. Amazon claims this new audio model offers industry-leading price performance and low latency.

Typically, developing a voice-enabled app requires developers to work with multiple models—such as a speech recognition model to convert speech to text, large language models to understand and generate responses, and a text-to-speech model to convert text back to audio. This approach is not only complex but also often fails to capture crucial acoustic context and nuances like tone, prosody, and speaking style.

Nova Sonic addresses this challenge by unifying understanding and audio generation capabilities into a single model. This integrated approach allows the model to comprehend tone, style, and spoken input, resulting in more natural dialogue. It can also determine the appropriate time to respond and better handle interruptions (barge-ins).

Nova Sonic supports both masculine- and feminine-sounding voices in various English accents, including American and British. Developers can access the model through Amazon Bedrock via a bidirectional streaming API, with support for function calling. It also includes built-in protections such as content moderation and watermarking.

Find the model details below:

Amazon Nova Sonic

Model ID

amazon.nova-sonic-v1:0

Input Modalities

Speech

Output Modalities

Speech with transcription and text responses

Context Window

300K context

Max Connection Duration

8 minutes connection timeout, with max 20 concurrent connections per customer.

Supported Languages

English

Regions

US East (N. Virginia)

Bidirectional Stream API Support

Yes

Bedrock Knowledge Bases

Supported through tool use (function calling)

On a related note, last month OpenAI announced next-generation speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, offering significant improvements in word error rate, language recognition, and accuracy compared to its existing Whisper models.

Report a problem with article
Instagram website running on iPhone
Next Article

After nearly 15 years, Instagram might finally launch an iPad app

google gemini
Previous Article

Google launches Gemini 2.5 Pro-powered Deep Research, outperforming ChatGPT Deep Research

Join the conversation!

Login or Sign Up to read and post a comment.

0 Comments - Add comment