Microsoft reveals Windows Agent Arena to benchmark generative AI agents

The use of generative AI and large language models to automate and simplify tasks for people who work with PCs continued to grow. However, there's also a need to see how well AI can work to accomplish tasks. This week, Microsoft Research announced it has developed a benchmark specifically to test out AI agents on Windows PCs.

The benchmark, as revealed on Microsoft's GitHub page, is called Windows Agent Arena. This framework is designed to test how well and how quickly AI agents can interact with Windows applications that humans usually use. The list of apps that were tested with AI agents in Windows Agent Arena included web browsers like Microsoft Edge and Google Chrome, OS functions like File Explorer Settings, coding apps like Visual Studio Code), simple preinstalled Windows apps like Notepad, Clock, and Paint and even watching videos with VLC Player.

Microsoft stated:

We adapt the OSWorld framework to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is also scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes.

Microsoft Research also created its own multi-modal agent called Navi to test it out in the Windows Agent Arena benchmark. It was asked to perform tasks with certain text prompts, such as, "Can you turn the website I am looking at into a PDF file and put it on my main screen, you know, the Desktop?". It found that Navi had an average performance success rate of 19.5 percent, which is still quite low compared to the human performance rating of 74.5 percent.

Having a benchmark like Windows Agent Arena could be a huge development for the creation of AI agents, so they can be improved and perform closer to the level of human performance.

Microsoft's team also worked with researchers at Carnegie Mellon University and Columbia University on the project. You can check out the full paper at GitHub, along with the benchmark's code.