Tencent releases HunyuanVideo, a state-of-the-art open-source video generation model

Early this year, OpenAI unveiled Sora, a new video generation AI model that can create realistic and imaginative scenes from text prompts. While OpenAI has delayed the public launch of Sora, we have seen several AI startups, including Runway and Luma, releasing their own video generation models over the past few months.

Today, Chinese giant Tencent announced HunyuanVideo, a state-of-the-art video generation model which is also open-source. This is the first major open-source video generation model with the inference code and model weights openly available for everyone.

Happy to share that our team at Tencent open-sources a 13B parameter video generation model

Web Page: https://t.co/v6qQprYFUJ
GitHub: https://t.co/fSaO8gMT4W pic.twitter.com/ZHjzwnz9fw
— chenyangqi (@chenyangqi1) December 3, 2024

Tencent claims that HunyuanVideo can generate videos that are comparable to leading closed-source models with high visual quality, motion diversity, text-video alignment, and generation stability. With over 13 billion parameters, it is the largest among all open-source video generation models. HunyuanVideo includes a framework that integrates data curation, image-video joint model training, and an efficient infrastructure to support large-scale model training and inference.

Tencent also tested the model using professional human evaluation. As per the evaluation results, HunyuanVideo outperforms all leading closed-source state-of-the-art models, including Runway Gen-3 and Luma 1.6.

Instead of using separate models for text, image, and video generation, Tencent used the following different technique to achieve better video quality when compared to the existing models:

HunyuanVideo introduces the Transformer design and employs a Full Attention mechanism for unified image and video generation. Specifically, we use a "Dual-stream to Single-stream" hybrid model design for video generation. In the dual-stream phase, video and text tokens are processed independently through multiple Transformer blocks, enabling each modality to learn its own appropriate modulation mechanisms without interference. In the single-stream phase, we concatenate the video and text tokens and feed them into subsequent Transformer blocks for effective multimodal information fusion. This design captures complex interactions between visual and semantic information, enhancing overall model performance.

HunyuanVideo's release marks a significant step towards democratizing AI video generation technology. With open-source code and weights, HunyuanVideo may bring a revolution to the AI video generation ecosystem. You can learn more about this model here.