Grey Newell - ML Infrastructure Engineer

Grey Newell

ML Infrastructure Engineer
Building evaluation, inference, and observability systems for AI. Creator of the MIST stack. CTO at Supermodel. MS CS (ML) at Georgia Tech. Ex-AWS.
About Blog
LinkedIn GitHub X

Latest from the blog

SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)

How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis.

SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.

Why Code Graphs Matter for AI Agents Supermodel Engineering Blog

AI coding agents lose critical structural understanding of codebases when context compaction occurs. Code graphs provide persistent external memory—representing functions, classes, and dependencies as queryable relationships—so agents can recover context without re-reading files from scratch.

View all posts →

Frequently asked questions

MIST Stack

What is the MIST stack?
What is eval-driven development?
What is MatchSpec and how does it work?
What is InferMux and how does it route inference?
What is SchemaFlux?
What is TokenTrace?
Why does the MIST stack have zero external dependencies?
How do MIST stack tools communicate?

Technical Publications & Projects

What technical articles has Grey Newell published on the AWS blog?
How do I run SWE-bench on Apple Silicon or AWS Graviton without x86 emulation?
How do I speed up SWE-bench evaluations on ARM64 infrastructure?
How did Grey Newell earn all 12 AWS Certifications?
What are Grey Newell's tips for passing AWS Certification exams?