Microsoft announces new HD voices with improved expressiveness in Azure AI Speech

Last year, Microsoft announced super-realistic AI voices optimized for conversational scenarios, including chatbots, voice assistants, gaming, and more. Developers were able to use these neural text-to-speech (TTS) voices in their applications using the Azure Speech SDK or REST API. Over the past several months, Microsoft has been adding several new neural text-to-speech (TTS) voices for developers. Microsoft now offers over 500 neural voices in more than 140 languages and locales.

Today, Microsoft announced a new and improved HD version of its neural text-to-speech service for select voices. The new HD voices improve overall expressiveness with emotion detection based on the context of the text input. Microsoft claims that the latest HD voices are based on auto-regressive transformer language models and that they speak in the selected platform's voice timbre. They offer the following advantages:

Human-like speech generation: The new model accurately interprets the input text and understands the underlying sentiment, automatically adjusting the speaking tone to match the emotion conveyed in real time.
Conversational: The new model can produce spontaneous pauses and emphasis. Microsoft claims that this model can reproduce common phonemes, like pauses and filler words.
Prosody variations: This new HD voice system improves realism by introducing slight variations in each output, making the speech sound even more natural. Essentially, every sentence will sound different from any previously spoken ones.

Garfield He, Cognitive Services Speech program manager at Microsoft, said the following regarding the HD voice launch:

"With innovative technology that uses acoustic and linguistic features to generate speech filled with rich, natural variations, it can adeptly detect emotional cues in the text and autonomously adjust the voice's tone and style. With this upgrade, you can expect a more human-like speech pattern characterized by improved intonation, rhythm, and emotion."

You can check out sample audio content generated using this HD voice model in the video below.

The new HD voices are available in preview for developers in three regions: East US, West Europe, and Southeast Asia. The cost for HD voices will be $30 per 1 million characters.

Source: Microsoft