The rise of AI agents has sparked a flurry of innovation—tools that can reason, retrieve, plan, and act across tasks semi-autonomously are everywhere now. According to McKinsey's 2024 survey, AI adoption has jumped to 72 percent across organizations, with many of these implementations now incorporating agent-like capabilities. But with all that development comes a pressing question: how do we know if these agents are actually good at what they do?
Consider Microsoft's Copilot, which users report regularly struggles with basic tasks. One user documented multiple failures: inability to access OneDrive files in Word despite explicit sharing permissions, failure to create PowerPoint presentations from shared files, and consistent errors when analyzing Excel data across multiple tabs. These everyday tasks—simple for human assistants—apparently remain major challenges for today's AI systems. Stories like these highlight why effective benchmarking matters.
In the world of machine learning, benchmarking is how we get our bearings. It helps researchers and developers make meaningful comparisons, track progress, and spot bottlenecks. So naturally, as the agent ecosystem grows, benchmarks have started popping up. But after exploring several of these recent efforts, I've found myself wondering—are we benchmarking too early, or perhaps benchmarking the wrong things?
A few standout agent benchmarks have emerged recently:
As Dr. Melanie Mitchell, AI researcher and author, noted recently in a tweet: "AI surpassing humans on a benchmark is not the same (at all!) as AI surpassing humans on a general ability. E.g., just because a benchmark has 'language understanding' in its name doesn't mean it tests general lang. understanding." This insight applies equally to agent benchmarks, which may not reflect real-world capabilities.
Each benchmark tackles a different piece of the agent puzzle. And that's useful—there's no universal definition of an agent yet, so it makes sense that benchmarks reflect that diversity. But that also means comparisons between them (and to real-world applications) aren't always straightforward.
Many benchmarks today are still rooted in evaluating the language model, not the agent system as a whole.
For example, some simulate tasks by prompting an LLM to “pretend” it’s an agent rather than running tests on actual orchestrated agents that integrate memory, retrieval, tools, and error handling. That makes sense from a simplicity standpoint, but it leaves out huge chunks of what makes agents powerful (or unreliable).
Take the case of a customer service agent benchmark that tests response quality but doesn't evaluate whether the agent can correctly access a knowledge base when a customer asks about a specific product detail. The agent might generate a perfectly coherent response that's factually wrong because the retrieval component wasn't tested.
So the gap emerges: we’re often testing whether an LLM can act like an agent, rather than measuring how well real agents perform in production-like settings. In my honest opinion, this means there aren't any actual AI Agent benchmarks. Those are all LLM as an agent benchmarks.
From my perspective, there are a few consistent blind spots:
Benchmarks often isolate individual capabilities—like planning or tool use—but rarely test how these parts function together in a full productionized pipeline. Real agents juggle multiple steps, systems, and APIs in messy ways.
Real-world example: An agent might excel at planning a data analysis workflow in isolation but fail completely when it needs to authenticate with a database, handle rate limits, and format results for visualization—all in sequence.
Benchmark prompts tend to be clean and well-scoped. In production, users don't write prompts like researchers. Agents face vague instructions, misspelled words, ambiguous language, and context shifts.
Real-world example: A benchmark might test: "Analyze this dataset for trends" while a real user asks: "can u tell me whats going on with our sales numbers from last month? they seem weird lol" The latter requires much more robust understanding.
Can agents retrieve relevant context? Use real tools reliably? Recover when tools fail or time out? These mechanics often get skipped in favor of single-shot responses.
Real-world example: When asked to "book the cheapest flight to Toronto next Friday," an agent might need to: check multiple travel APIs (which could timeout), compare prices, verify availability, and confirm booking details—all while handling potential failures at each step.
Persistent memory and strategic planning are essential for longer tasks. Yet most current benchmarks don't pressure-test agents across time or require them to manage goals beyond the immediate prompt.
Real-world example: An agent helping manage a project needs to remember previous conversations, track deadlines, and adjust plans when new information emerges—capabilities rarely tested comprehensively in current benchmarks.
Despite these limitations, there are legitimate reasons why the field has embraced early benchmarking:
The challenge remains ensuring these benchmarks evolve to address the full complexity of agent systems rather than just their language model components.
It's still early, but I'm excited about where this could go. I'd love to see benchmarks that test:
This could look like structured environments with increasing complexity, sandbox simulations, or even integrations with real production-style systems.
To move toward more meaningful agent benchmarks, I recommend:
There's also a pressing need for more enterprise-focused benchmarks, given that businesses represent the primary target market for most agent systems. Enterprise environments introduce unique challenges—complex legacy systems, strict security requirements, domain-specific workflows, and high-stakes decision contexts—that aren't adequately represented in current benchmarks. While isolated capability testing certainly has its place and value in the development lifecycle, we need complementary benchmarks that evaluate how these capabilities function together in enterprise settings with their distinctive constraints and requirements.
The trick will be balancing realism with reproducibility—which has always been the tension in benchmarking.
It's tempting to rush toward benchmarks as a sign of maturity. But with agents, we're still very much in the wild experimentation phase. That doesn't mean benchmarking is wrong—it means we need to be thoughtful about what we're testing and why.
Salesforce's CRM LLM benchmark offers a compelling example of what more meaningful evaluation could look like. Unlike many benchmarks that focus on academic or consumer use cases, Salesforce specifically addresses business relevance by using real-world CRM data. What makes this benchmark particularly valuable is its use of expert human evaluations by actual CRM practitioners, along with comprehensive assessment of factors that matter in production: accuracy, speed, cost, and trust considerations. While this benchmark is specifically for LLMs rather than complete agent systems, it represents the level of real-world relevance and practical evaluation that AI agent benchmarks should aspire to achieve.
More than anything, I hope this is a phase of exploration. The benchmarks we have are a great start—but the most useful ones might still be ahead of us, shaped by deeper understanding, messier realities, and a broader view of what real-world agent performance actually entails.
Until then, I'll be keeping my eye on the landscape—and probably breaking a few of my own agents along the way.