Alibaba's Qwen2-VL model achieves state-of-the-art performance in several AI benchmarks

Alibaba has announced the release of the Qwen2-VL family of vision language models built upon Qwen-2. The Qwen2-VL family includes three models: Qwen2-VL-72B, Qwen2-VL-2B, and Qwen2-VL-7B. Alibaba is releasing the Qwen2-VL-2B and Qwen2-VL-7B models under the Apache 2.0 license. The most powerful Qwen2-VL-72B model is accessible via the official API.

Alibaba claims that Qwen2-VL-72B achieves state-of-the-art performance on several visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA. As you can observe from the table below, Qwen2-VL-72B beats OpenAI's GPT-4o-0513 and Claude 3.5 Sonnet in most benchmarks, and it also achieves state-of-the-art performance on many. This is the first time an open-source model is achieving such benchmark numbers that even beat the closed-source ones.

Alibaba claims that Qwen2-VL can understand videos over 20 minutes and deliver high-quality video-based question answering. Since Qwen2-VL supports complex reasoning and decision-making, it can be integrated into a wide variety of AI applications. Apart from English and Chinese, Qwen2-VL now supports most European languages, Japanese, Korean, Arabic, and Vietnamese, making it suitable for multilingual scenarios.

The smaller Qwen2-VL-7B model beats OpenAI GPT-4o mini in most benchmarks. Even this 7B parameter model supports image, multi-image, and video inputs. According to benchmarks, the Qwen2-VL-7B model performs better in document understanding tasks such as DocVQA and MTVQA. The smallest Qwen2-VL-2B model is targeted towards smartphone deployment, and it delivers strong performance in image, video, and multilingual comprehension.

The open-source Qwen2-VL-7B and Qwen2-VL-2B models are integrated with Hugging Face Transformers, vLLM, and other third-party frameworks.

The Qwen team mentioned the following in the Qwen2-VL announcement about their future plans:

We look forward to your feedback and the innovative applications you will build with Qwen2-VL. In the near future, we are going to build stronger vision language models upon our next-version language models and endeavor to integrate more modalities towards an omni model!

With its impressive performance and open-source availability, the Qwen2-VL family has the potential to significantly advance research and development in the field of vision language models, enabling new and innovative AI applications across various domains.