Evaluation

5 items

Blog posts

SWE-bench Verified Is Broken: 5 Things I Found in the Source Code

March 6, 2026

After building 1,798 SWE-bench containers, I dug into the source. The tests reject correct solutions and every frontier model has memorized the answers.

AI Open Source Evaluation

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

March 5, 2026

SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.

AI Open Source Evaluation Go

Projects

matchspec

Eval framework. Define correct, test against it, get results.

22 • Go

Go AI Evaluation Mist-stack

evaldriven.org

Ship evals before you ship features.

18 • Markdown

AI Evaluation Methodology

mcpbr supermodeltools

Benchmark runner for Model Context Protocol servers. Paired comparison experiments on SWE-bench.

6 • Python

Python AI Evaluation MCP Methodology