Amazon announces Nova Sonic audio model, claiming to outperform OpenAI and Google

Amazon today announced Nova Sonic, a state-of-the-art speech-to-speech model that enables developers to build applications featuring real-time, human-like voice conversations. Amazon claims this new audio model offers industry-leading price performance and low latency.

Typically, developing a voice-enabled app requires developers to work with multiple models—such as a speech recognition model to convert speech to text, large language models to understand and generate responses, and a text-to-speech model to convert text back to audio. This approach is not only complex but also often fails to capture crucial acoustic context and nuances like tone, prosody, and speaking style.

Nova Sonic addresses this challenge by unifying understanding and audio generation capabilities into a single model. This integrated approach allows the model to comprehend tone, style, and spoken input, resulting in more natural dialogue. It can also determine the appropriate time to respond and better handle interruptions (barge-ins).

Nova Sonic supports both masculine- and feminine-sounding voices in various English accents, including American and British. Developers can access the model through Amazon Bedrock via a bidirectional streaming API, with support for function calling. It also includes built-in protections such as content moderation and watermarking.

Find the model details below:

	Amazon Nova Sonic
Model ID	amazon.nova-sonic-v1:0
Input Modalities	Speech
Output Modalities	Speech with transcription and text responses
Context Window	300K context
Max Connection Duration	8 minutes connection timeout, with max 20 concurrent connections per customer.
Supported Languages	English
Regions	US East (N. Virginia)
Bidirectional Stream API Support	Yes
Bedrock Knowledge Bases	Supported through tool use (function calling)

On a related note, last month OpenAI announced next-generation speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, offering significant improvements in word error rate, language recognition, and accuracy compared to its existing Whisper models.