How an 8B Model Beat an Industry Giant

Jim Griffin November 7, 2024

This video describes how a system called ‘AgentStore’ was able to gain the top spot on a benchmark for AI agents – beating out a gigantic model with a small one.

AgentStore is a platform and method for aggregating specialized agents that perform real-world tasks on digital devices on macOS, Windows and Ubuntu. In that system, a meta agent selects the best resource (or combination of resources) for each user request. The new benchmark was achieved using a small 8B model, outperforming industry heavy-weight Claude 3.5 Sonnet.

The testing was done on OSWorld, which is an environment for benchmarking agents on 369 different computer tasks involving popular web and desktop workflows, spanning multiple applications, ranging from Google Chrome and Microsoft Office to Thunderbird and PDF. The video describes some of the tasks that are part of this difficult benchmark. Testing was also done on APPAgent, which is a similar benchmark for mobile applications. The video reviews the test results and the capabilities of the agents, as well as the overall system design, including a special class of token that identifies what each agent can do. This information is used by a meta agent that picks the most suitable resource for each task, based on the information in those tokens.

Author

Jim Griffin

Mesh Anything (except a Pink Hippo Ballerina)

Jim Griffin September 5, 2024

The developers at MeshAnything have just released new code that offers an important improvement in how the surface of 3D objects can be encoded. What the new method does is […]

Default

Raghav Ram: 40 LLMs, One Answer

Jim Griffin November 24, 2025

Default

Michael Koved: The Economics of Generative AI

Jim Griffin June 10, 2025