Open Source

2 items

Blog posts

SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)

How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis.

AI Open Source Evaluation

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.

AI Open Source Evaluation Go

All tags

AI (15) Architecture (2) Aws (4) Cloud Computing (1) Code-graphs (3) Compiler (1) Developer Tools (3) Evaluation (5) Event-driven (1) Go (7) Inference (1) MCP (2) Methodology (2) Mist-stack (5) Observability (3) Open Source (2) Python (2) Serverless (1) TypeScript (4)