The Problem
My company ran into a problem trying to test a Model Context Protocol (MCP) server we created to boost the performance of AI tools like coding agents. While there are plenty of good coding benchmarks available online, nearly every one exists to measure the performance of the underlying LLM.
No available tool allowed users to easily measure the performance improvement of introducing their MCP server to an agent. Translation: No one is rigorously testing their MCPs. At least, not to the level of a model.
Many MCP developers are shipping blind (like we were), hoping their tools actually improve agent performance instead of simply wasting context. Users have no way to compare MCPs or verify marketing claims.
On January 17th at 4AM, I'd been trying to run an evaluation on our Model Context Protocol server for three weeks. We tried running on the raw SWE-bench dataset with SWE-agent. We tried adding our MCP tool to an agent framework with built-in benchmark support (OpenHands, in this case).
We were at the end of our rope with our runway quickly ticking down like sand in an hourglass.
What I Built
I found that I needed a commandline that could do the following tasks:
- Run SWE-bench-verified to compare baseline agent vs one using my Model Context Protocol tool
- Calculate basic statistics about success rates, token usage, and turn efficiency
- Sample all logs and traces from each run helping me debug failed tool calls
As an added bonus, I threw in an Azure provider that would create and destroy Virtual Machines so that I didn't have to keep doing it manually.
mcpbr is open source software that works by spinning up prebuilt Docker containers for each SWE-bench task with Python, injecting a headless Claude Code instance to the container with a special configuration, and measuring performance with and without your MCP.
It was designed with a flexible architecture to support other agent harnesses, API providers, and benchmarks in the future. Development of the repository is community-driven with over 9 contributors so far, and many are using AI agents to push along our roadmap.
What I Learned
Building mcpbr taught me something I didn't expect. MCP servers should be tested like APIs, not like plugins.
APIs have contracts they are expected to fill, while plugins mostly need to avoid crashing. An MCP server needs to not only return the right size and shape of data, it also needs to fulfill the implicit promise made in its tool description, or the agents won't keep reaching for it.
MCPs also exhibit unique failure modes in that the way agents respond to failures is non-deterministic. Claude Code could loop, hallucinate, or ignore the tool entirely if it errors or responds too slowly.
Performance must also be quantified, both in terms of benchmark task success rate where you'd hope to see an improvement. Other key metrics of interest include tool adoption rate (how often Claude calls your tool), tool failure rate, and token efficiency.
No tool existed that treated MCP servers as proper first-class candidates for this type of evaluation, so I had to create one myself.
What's Next
What's next for mcpbr? As we speak, a team of human and agent contributors are humming away working on our next milestones.
We've run thousands of instances of SWE-bench tasks and we will continue to test new benchmarks and infra providers as we add experimental support for them. Is there a particular benchmark you'd like to see added?
After we reach milestone v1.0, mcpbr has an exciting roadmap for research-oriented features. The logging and traces we collect could be extremely interesting source data for reinforcement learning environments. Think of it like an MCP Gym: a place where you can train agents to use tools more effectively by replaying real benchmark sessions. Every tool call, every failure, every recovery strategy gets captured. That's a dataset.
We're also tracking some ambitious features on the roadmap:
- Multi-agent evaluation support for testing how MCPs perform when multiple agents collaborate
- Adversarial testing mode to stress-test your MCP with malformed inputs and edge cases
- Kubernetes support for teams who want to run benchmarks at scale on their own infra
Thank You
I want to take a moment to thank the incredible people who have contributed to mcpbr. This project wouldn't be where it is today without their time, talent, and belief in what we're building.
A heartfelt thank you to codegraphtheory, robotson, sakshikirmathe, Simon Ohara, and Jonathan Popham. Your contributions have helped shape mcpbr into something genuinely useful for the MCP community.
A special thanks to the teams at GitHub behind GitHub Actions and Dependabot for keeping our CI humming and our dependencies secure. And to the team at Anthropic for building Claude, the AI that powers so much of our development workflow.
If you're reading this and want to join us in contributing to mcpbr, we'd love to have you.