Microsoft announces general availability of Text to Speech Avatar in Azure AI Speech Service

Azure AI Speech service allows developers to build voice-enabled, multilingual, generative AI apps with support for natural-sounding voices. The new Text to Speech Avatar feature in Azure AI Speech service can convert simple text into a video of a photorealistic human speaking with a natural-sounding voice. Developers can use any of the prebuilt avatars available as part of this service or create their own custom avatars.

Today, Microsoft announced the general availability of Text to Speech Avatar. This new capability enables developers to create personalized and engaging content for their users. The output video of this service will be 1920 x 1080 resolution with 25 frames per second (FPS).

Check out the demo of the Text to Speech Avatar service below.

Azure Speech Text to Speech Avatar comes with the following capabilities:

Converts text into a digital video of a photorealistic human speaking with natural-sounding voices powered by Azure AI text to speech.

Provides a collection of prebuilt avatars.

The voice of the avatar is generated by Azure AI text to speech.

Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time.

Provides a content creation tool in Speech Studio for creating video content without coding.

Enables real-time avatar conversations through the live chat avatar tool in Speech Studio.

The pricing of the Text to Speech Avatar service is a bit complicated. As expected, the charges will be based on the length of the video output and will be billed per second. Also, the text-to-speech, speech-to-text, Azure OpenAI, or other Azure services used as part of the Text to Speech Avatar service solution are charged separately. Also, this service is now available in the following Azure regions: Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, and West US 2.

You can learn more about the Text to Speech Avatar service here.