Meta today announced the MovieGen family of media foundation AI models that can generate realistic videos with sound based on text prompts. The MovieGen family includes two primary models: MovieGen Video and MovieGen Audio.
MovieGen Video is a 30-billion parameter transformer model that can generate high-quality, high-definition images and videos from a single text prompt. The generated videos can be up to 16 seconds long at a rate of 16 frames per second.
MovieGen Audio is a 13-billion parameter transformer model that can take a video input along with optional text prompts and generate high-fidelity audio up to 45 seconds long that syncs with the input video. This new audio model can generate ambient sound, instrumental background music, and Foley sound. Meta claims it delivers state-of-the-art results in audio quality, video-to-audio alignment, and text-to-audio alignment.
These models are not just for creating brand-new videos. They can be used to edit existing videos using simple text prompts. MovieGen also allows users to make localized edits, such as adding, removing, or replacing elements, in addition to global changes like background or style changes. For example, if you have a video of someone throwing a ball with a simple text prompt, you can change the video so that the person is throwing a watermelon, preserving the rest of the original content.
MovieGen models will allow users to create personalized videos. By using an image of a person and a text prompt, these models can generate personalized videos that preserve human identity and motion. Meta claims these models deliver state-of-the-art results in character preservation and natural movement in video.
Meta claims these models create better videos than other video generation models, including OpenAI Sora and Runway Gen-3. Meta is now working with creative professionals to further improve the model before its public release.
Source: Meta
5 Comments - Add comment