OpenAI's GPT-4o mini surpasses Gemini 1.5 Flash and Claude Haiku in most benchmarks

Today, OpenAI introduced GPT-4o mini, its most cost-efficient small model. Despite being more than 60% cheaper than GPT-3.5 Turbo, the GPT-4o model has scored 82% on the MMLU AI benchmark and currently outperforms GPT-4 on chat preferences in the LMSYS leaderboard. Additionally, GPT-4o mini has surpassed both Gemini 1.5 Flash and Claude 3 Haiku in several benchmarks in textual intelligence and multimodal reasoning.

Let's delve into the various benchmark scores of the new GPT-4o mini.

When it comes to reasoning tasks involving both text and vision, GPT-4o mini surpasses all other small models by scoring 82.0% on MMLU. In mathematical reasoning and coding tasks, GPT-4o mini scored 87.0%, compared to 75.5% for Gemini Flash and 71.7% for Claude Haiku.

In coding performance, GPT-4o mini scored 87.2% on HumanEval, compared to 71.5% for Gemini Flash and 75.9% for Claude Haiku. In multimodal reasoning, GPT-4o mini scored 59.4%, compared to 56.1% for Gemini Flash and 50.2% for Claude Haiku. In the MathVista benchmark alone, Gemini 1.5 Flash performs better than GPT-4o mini by 3%.

Apart from the above benchmarks, GPT-4o mini delivers strong performance in function calling, allowing developers to create applications that fetch data or take actions with external systems. It also offers improved long-context performance compared to the GPT-3.5 Turbo small model.

The OpenAI team wrote the following regarding the GPT-4o mini launch:

We envision a future where models become seamlessly integrated in every app and on every website. GPT-4o mini is paving the way for developers to build and scale powerful AI applications more efficiently and affordably. The future of AI is becoming more accessible, reliable, and embedded in our daily digital experiences, and we’re excited to continue to lead the way.

GPT-4o Mini's strong performance across multiple benchmarks demonstrates OpenAI's commitment to pushing the boundaries of AI capabilities while making them accessible to a wider audience.

Source: OpenAI