OpenAI can clone your voice with little to no input, and it wants to talk about it with you

OpenAI is facing a big dilemma. It has created a powerful AI model for voice cloning, however, it performs so well that the company is afraid of the giant potential for its misuse.

That’s why OpenAI hesitates to release the model to the public. Instead, the company presented only a preview that demonstrates the abilities of a model called Voice Engine. And it is indeed impressive.

The essentials of AI-based voice cloning technology are very simple. The model needs just two things: an audio sample of the original voice and a text that the synthetic voice Is supposed to read. Feed the tool with enough samples and the result will, hopefully, sound realistically enough.

This is the part where things are getting interesting, and a bit scary. Unlike other models that are already publicly available, Voice Engine needs just 15-second audio of the original speaker. Despite the very limited input, the resulting voice expressions are incredibly realistic.

This is a cloned voice. OpenAI's #VoiceEngine can create these convincingly realistic speech expressions using as little as 15 seconds of audio sample: pic.twitter.com/9LIGhBA3sL
— Martin Hodás (@Hody_MH11) March 31, 2024

That is also the exact reason why OpenAI takes its time to decide what to do next, citing its commitment to developing safe and broadly beneficial AI. Such a powerful tool could become a mighty weapon in the hands of malicious actors, especially as part of disinformation campaigns.

Voice Engine was first developed in late 2022. Since then, it has been used to power the preset voices available in the text-to-speech API, as well as ChatGPT Voice and Read Aloud. Late last year, OpenAI started privately testing its voice-cloning abilities with a small group of trusted partners. The company says it has been impressed by the applications this group has developed.

One cause of these tests is figuring out how people and various industries can benefit from it. The other cause is to identify the potential of its misuse and decide what steps to take:

“At the same time, we are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse. We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities. Based on these conversations and the results of these small scale tests, we will make a more informed decision about whether and how to deploy this technology at scale.”

OpenAI thinks that wider release of this technology should go hand in hand with policies and counter-measures to prevent its misuse. For instance, original speakers should be adding their voice to the service knowingly, and the service should be able to verify this fact. Also, the services should have “a no-go list” of celebrities, politicians, and other prominent figures whose voice re-creation would be prohibited.

Voice Engine’s demonstration should spark public conversation. The company encourages the following steps to mitigate possible problems:

Phasing out voice-based authentication as a security measure for accessing bank accounts and other sensitive information

Exploring policies to protect the use of individuals’ voices in AI

Educating the public in understanding the capabilities and limitations of AI technologies, including the possibility of deceptive AI content

Accelerating the development and adoption of techniques for tracking the origin of audiovisual content, so it’s always clear when you’re interacting with a real person or with an AI

It is worth mentioning that OpenAI’s model wouldn’t be the only publicly available voice cloning tool. Currently, the most popular is ElevenLabs. However, even with enough audio samples, the results are not always convincing.

It seems that Voice Engine would be a big step forward in both ease of use and the resulting quality of the cloned voice.