Everyone Is Benchmarking MCP Servers Wrong

Over 8,000 Model Context Protocol servers were registered in 2025. Very few of them have published compelling evidence of making agents more useful at completing real tasks.

The benchmarks that do exist are measuring the wrong thing. They test whether models can use MCP tools correctly. They don't test whether adding an MCP server to your agent actually improves outcomes.

What the existing benchmarks measure

MCP-Bench: Measures how well LLMs discover, select and use tools from 28 open source servers
MCP-Atlas: Measures how well LLMs orchestrate complex multi-tool workflows
MCP-Universe: Measures how LLMs perform on hard tasks with tools they haven't seen

What they don't measure

None of the above approaches answer the relevant question for developers building MCP servers, namely, "Does adding MY server to MY agent improve task completion and success rates?" A developer needs to know the marginal effects of one server on one agent across a diverse number of real-world cases.

It is entirely possible to build an MCP server with a tool that is adopted in 100% of scenarios, fails 0% of the time, and still reduces the agent's general task performance. I know this because I've measured the performance of several MCPs for coding agents across tens of runs on 500 SWE-bench Verified tasks.

A/B testing for MCP servers

The answer is paired comparison. Run the same agent on the same tasks twice: once with your MCP server, once without. Hundreds of real-world task datasets already exist for this (SWE-bench, TerminalBench, and more). The infrastructure for controlled experiments is there. What was missing was a tool to run them.

I built mcpbr to automate exactly this. It runs paired experiments across any task dataset, computes resolution deltas, and reports statistical significance. It's open source on GitHub and available on PyPI and npm. The full methodology is in the pre-print.

What we found when we measured

We tested an experimental MCP server with Claude Sonnet 4 across 500 SWE-bench Verified tasks. The result: resolution rate dropped from 49.8% to 42.4%, even as overall cost fell ~15%. The effect varied wildly by repository — neutral on some, devastating on others. None of this was visible without a controlled experiment. The full results and analysis are in the first post.

Metric	Baseline	With MCP	Change
Tasks resolved	249/500 (49.8%)	212/500 (42.4%)	-14.9% relative
Only MCP passes	—	18
Only baseline passes	55	—

The logs and traces from that run let us redesign the tool interface and test each iteration against the last. Instead of shipping on vibes, we ship on data.

Test your MCP server

If you're shipping an MCP server, benchmark it before your users do. Install mcpbr, point it at a task dataset, and get real numbers on what your server does to agent performance. One run will tell you more than any leaderboard.

We're using it to iterate on the server we tested. Every change gets measured against the last. That's the bar. Ship with evidence, not assumptions.

Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology studying LLM agent evaluation and tool augmentation. The preprint is available at doi.org/10.5281/zenodo.18627369.