Evaluation

4 items

Blog posts

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.

AI Open Source Evaluation Go

Projects

matchspec

Eval framework. Define correct, test against it, get results.

21 Go
Go AI Evaluation Mist-stack

evaldriven.org

Ship evals before you ship features.

7 Markdown
AI Evaluation Methodology

mcpbr supermodeltools

Benchmark runner for Model Context Protocol servers. Paired comparison experiments on SWE-bench.

4 Python
Python AI Evaluation MCP Methodology

All tags

AI (11) Aws (1) Cloud Computing (1) Compiler (1) Evaluation (4) Go (7) Inference (1) MCP (2) Methodology (2) Mist-stack (5) Observability (3) Open Source (1) Python (2) TypeScript (4)