When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

Alibaba's Qwen2-VL model achieves state-of-the-art performance in several AI benchmarks

Alibaba Qwen

Alibaba has announced the release of the Qwen2-VL family of vision language models built upon Qwen-2. The Qwen2-VL family includes three models: Qwen2-VL-72B, Qwen2-VL-2B, and Qwen2-VL-7B. Alibaba is releasing the Qwen2-VL-2B and Qwen2-VL-7B models under the Apache 2.0 license. The most powerful Qwen2-VL-72B model is accessible via the official API.

Alibaba claims that Qwen2-VL-72B achieves state-of-the-art performance on several visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA. As you can observe from the table below, Qwen2-VL-72B beats OpenAI's GPT-4o-0513 and Claude 3.5 Sonnet in most benchmarks, and it also achieves state-of-the-art performance on many. This is the first time an open-source model is achieving such benchmark numbers that even beat the closed-source ones.

Alibaba claims that Qwen2-VL can understand videos over 20 minutes and deliver high-quality video-based question answering. Since Qwen2-VL supports complex reasoning and decision-making, it can be integrated into a wide variety of AI applications. Apart from English and Chinese, Qwen2-VL now supports most European languages, Japanese, Korean, Arabic, and Vietnamese, making it suitable for multilingual scenarios.

Alibaba Qwen2-VL

The smaller Qwen2-VL-7B model beats OpenAI GPT-4o mini in most benchmarks. Even this 7B parameter model supports image, multi-image, and video inputs. According to benchmarks, the Qwen2-VL-7B model performs better in document understanding tasks such as DocVQA and MTVQA. The smallest Qwen2-VL-2B model is targeted towards smartphone deployment, and it delivers strong performance in image, video, and multilingual comprehension.

The open-source Qwen2-VL-7B and Qwen2-VL-2B models are integrated with Hugging Face Transformers, vLLM, and other third-party frameworks.

The Qwen team mentioned the following in the Qwen2-VL announcement about their future plans:

We look forward to your feedback and the innovative applications you will build with Qwen2-VL. In the near future, we are going to build stronger vision language models upon our next-version language models and endeavor to integrate more modalities towards an omni model!

With its impressive performance and open-source availability, the Qwen2-VL family has the potential to significantly advance research and development in the field of vision language models, enabling new and innovative AI applications across various domains.

Report a problem with article
The Google Watch Pixel 3
Next Article

Nest Cam streaming is coming soon to other Wear OS smartwatches

amazon fire tv 4K max
Previous Article

Get discounts on Amazon Fire TV devices this Labor Day weekend so you can play Xbox games

Join the conversation!

Login or Sign Up to read and post a comment.

0 Comments - Add comment