Open Source

2 items

Blog posts

SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)

March 6, 2026

How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis.

AI Open Source Evaluation

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

March 5, 2026

SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.

AI Open Source Evaluation Go

Open Source

Blog posts

SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

All tags