Microsoft Research Asia released a new paper introducing VASA, a framework for generating lifelike talking faces. The researchers presented their model, dubbed VASA-1, that can generate realistic videos based only on a single static image and a speech audio clip. The full paper is available at arXiv.
The results are impressive and beat all previous tools that use generative artificial intelligence to produce realistic deepfakes.
What is particularly interesting about VASA-1 is the overall ability to emulate natural facial expressions, a wide range of emotions, and lip-sync ability with very few artifacts.
The researchers admit that the model – like all the other models – still struggles with non-rigid elements, such as hair. However, even in this area, the model performs above average, mitigating one of the known red flags when identifying an inauthentic, deepfake video.
New VASA-1 model by Microsoft Research Asia. Impressive lip-sync and natural face expression.
— Martin Hodás (@Hody_MH11) April 18, 2024
There are still visible artifacts, however, to the point where many regular ppl with little awareness about the state of AI technology could no longer tell if it is fake... pic.twitter.com/Qxi8qdHNXd
The technical cornerstone, Microsoft says, is an innovative holistic facial dynamics and head movement generation model that works in an expressive and disentangled face latent space. VASA-1 also offers real-time efficiency:
“Our method generates video frames of 512 × 512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms, evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.”
The tool based on the new model is very easy to use and even offers the ability to control “optional signals as condition,” meaning the user can set a main eye gaze direction, head distance, and emotion offsets:
One of the pros of VASA-1 is the easy of use. Watch this real-time demonstration: pic.twitter.com/QvHnpHVx8e
— Martin Hodás (@Hody_MH11) April 18, 2024
VASA-1 also handles non-realistic inputs, such as art. Therefore, it can essentially bring paintings to life too.
The model can also make the photos sing, rap, or talk in languages other than English. As one of the examples, Microsoft presented a hilarious clip of crazy Mona Lisa rapping:
Rapping Mona Lisa. Not sure I wanted to see this... pic.twitter.com/1B8sgm5qQ9
— Martin Hodás (@Hody_MH11) April 18, 2024
It is important to emphasize the potential harm that such technology could cause when used to generate content imitating actual people – not just politicians and celebrities, but also regular citizens. The good news is that Microsoft’s researchers are aware of the risk:
“We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”
Microsoft acknowledges the possibility of misuse. However, it also highlights the potential benefits of the technology, ranging from enhancing educational equity, improving accessibility for individuals with communication challenges, and offering companionship or therapeutic support to those in need.
It is worth mentioning that Microsoft’s competitor, OpenAI, also faces a similar dilemma. Just recently, OpenAI presented a powerful AI model for voice cloning but opted not to make it public. The company claims that the wider release of this technology should go hand in hand with policies and countermeasures to prevent its misuse.