How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis.
SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.
AI coding agents lose critical structural understanding of codebases when context compaction occurs. Code graphs provide persistent external memory—representing functions, classes, and dependencies as queryable relationships—so agents can recover context without re-reading files from scratch.