SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)
How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis.