SWE-bench Verified Is Broken: 5 Things I Found in the Source Code
After building 1,798 SWE-bench containers, I dug into the source. The tests reject correct solutions and every frontier model has memorized the answers.
5 items
After building 1,798 SWE-bench containers, I dug into the source. The tests reject correct solutions and every frontier model has memorized the answers.
SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.
Eval framework. Define correct, test against it, get results.
Ship evals before you ship features.
Benchmark runner for Model Context Protocol servers. Paired comparison experiments on SWE-bench.