Evaluation

3 items

Blog posts

Everyone Is Benchmarking MCP Servers Wrong

Existing MCP benchmarks rank models, not servers. Here's how to A/B test whether your MCP server actually improves agent performance.

AI MCP Research Evaluation

Why I Built mcpbr

MCP developers are shipping tools without evidence they work. I built mcpbr to find out. Here are results from a 500-task controlled SWE-bench experiment that surprised us.

AI MCP Open Source Developer Tools Research Evaluation

Projects

mcpbr

Benchmark runner for Model Context Protocol servers.

20 Python
Python AI MCP Evaluation Developer Tools

All tags

AI (5) Cloud Computing (2) C++ (1) Developer Tools (2) Evaluation (3) MCP (5) Open Source (1) Python (2) Research (3) TypeScript (2)