Existing coding benchmarks like SWE-bench are designed to measure the performance of the underlying LLM, not the tools it uses. mcpbr is built specifically to treat MCP servers as first-class evaluation candidates. It runs the same SWE-bench-verified tasks with and without your MCP to isolate its actual impact on agent performance.
mcpbr works by spinning up prebuilt Docker containers for each task, injecting a headless Claude Code instance with a special configuration, and capturing detailed logs and traces from every tool call. That gives you the data to debug failed calls, spot patterns in agent behavior, and quantify whether your MCP is helping or just burning tokens. It was designed with a flexible architecture to support other agent harnesses, API providers, and benchmarks in the future.