Google's latest experimental Gemini model beats OpenAI's GPT-4o model

Chatbot Arena is an open platform for crowd-sourced AI benchmarking. Over the past two years, OpenAI's models have remained at the top in most AI benchmarks. In some categories, Google's Gemini models and Anthropic's Claude models have posted better results than OpenAI's models, but overall, OpenAI's models were untouched.

Today, Chatbot Arena revealed a new experimental model from Google called Gemini-Exp-1114. This new Gemini (Exp 1114) model was tested with over 6,000 community votes over the past week, and it now ranks joint No. 1 along with OpenAI's ChatGPT-4o-latest (2024-09-03). Compared to the last Gemini model, the overall Arena score increased from 1301 to 1344. It is important to note that this new model's score even beats OpenAI's o1-preview model.

According to Chatbot Arena, Gemini-Exp-1114 now ranks No. 1 on the Vision leaderboard. It also ranks No. 1 in the following categories:

Math
Creative Writing
Longer Query
Instruction Following
Multi-turn
Hard Prompts

This new model ranks No. 3 in coding and Hard Prompts with Style Control. OpenAI's o1-preview model leads the coding and style-control category. When compared to other comparable models in terms of the overall win-rate heatmap, Gemini wins 50% vs. GPT-4o-latest, 56% vs. o1-preview, and 62% vs. Claude-3.5-Sonnet.

In September, Google released the updated Gemini 1.5 series models to deliver a ~7% increase in MMLU-Pro, a ~20% improvement in MATH and HiddenMath benchmarks, and ~2-7% improvements in vision and code use cases. The overall helpfulness of the model responses has also been improved. Google claims that the new model responds in a more concise style. Also, the default output length of the updated models is ~5-20% shorter than previous models.

You can find the full results of the new Gemini experimental (Gemini-Exp-1114) model here. Developers can try out this model in Google AI Studio right now, and it will soon be available through the API as well.