SWE-bench Tests Run 6x Faster on ARM64 with Native Containers

If you're running SWE-bench evaluations on ARM64 hardware, your test suites are running under x86 emulation. Apple Silicon Macs, AWS Graviton instances, it doesn't matter. The pre-built images are x86_64, and QEMU translates every instruction at runtime.

SWE-bench's FAQ lists ARM support as "experimental" and recommends an x86_64 machine. In practice, that means conda installs, pip builds, and pytest runs all go through QEMU's user-space translation layer. It works. It's just slow.

I wrote swe-bench-fast, a Go reimplementation of the SWE-bench eval harness that builds native ARM64 container images. On the test runner, I measured a 6.3x speedup over the emulated x86 images. I benchmarked on an M3 Pro, but the images run natively on Graviton3 and Graviton4 too.

The 6.3x speedup

I selected 11 SWE-bench instances (one per repository) and ran the same gold patches and test suites through both harnesses on the same machine. All images were pre-built and cached locally, and the patches were pre-computed. No agent inference time is included. This is purely test runner wall-clock time: container start, patch apply, pytest, grade.

Machine: MacBook Pro M3 Pro (12 cores, 36 GB RAM). Docker: Colima VM with 10 CPUs, 28 GB RAM, linux/arm64.

Instance	ARM64 native (s)	x86 emulated (s)	Speedup	Result match
astropy__astropy-12907	2.7	9.7	3.7x	yes
django__django-13346	2.7	18.9	7.0x	yes
matplotlib__matplotlib-14623	38.0	265.7	7.0x	yes
mwaskom__seaborn-3069	15.4	101.0	6.6x	yes
pallets__flask-5014	1.0	3.9	3.9x	yes
psf__requests-1142	1.1	4.8	4.3x	yes
pylint-dev__pylint-7277	14.0	76.0	5.4x	yes
pytest-dev__pytest-6197	4.7	28.2	6.1x	yes
scikit-learn__scikit-learn-25102	2.7	18.2	6.6x	yes
sphinx-doc__sphinx-10323	3.1	17.2	5.6x	yes
sympy__sympy-11618	1.9	8.0	4.2x	yes
Total	87.3	551.7	6.3x	11/11

The repos with heavier test suites (matplotlib at 265s emulated, seaborn at 101s) showed the largest absolute gains. All 11 instances produce identical results on both harnesses.

The full benchmark data and raw notes are in this gist.

78% of SWE-bench runs natively on ARM64

Out of 2,294 instances in the full SWE-bench dataset, 1,798 build and run natively on ARM64. The remaining 496 require x86 because they depend on binary conda packages (scikit-learn, matplotlib, xarray) that aren't published for ARM.

Those 496 instances still run under QEMU. There's no coverage gap. The 78% that go native just stop paying the emulation tax.

Repository	ARM64 native	x86 required
django/django	811	39
sympy/sympy	382	4
scikit-learn/scikit-learn	37	192
matplotlib/matplotlib	37	147
pydata/xarray	0	110
sphinx-doc/sphinx	185	2
pytest-dev/pytest	118	1
astropy/astropy	94	1
Others	134	0

The list of x86-only instances is defined in USE_X86 in the SWE-bench source.

Comparable image sizes

I built all 11 benchmarked instances as native ARM64 images and compared on-disk sizes against the Epoch x86_64 images.

Instance	ARM64 native	x86 Epoch	Difference
astropy__astropy-12907	3.41 GB	3.20 GB	+6.6%
django__django-13346	3.34 GB	3.44 GB	-2.9%
matplotlib__matplotlib-14623	5.95 GB	6.03 GB	-1.3%
mwaskom__seaborn-3069	3.98 GB	3.30 GB	+20.6%
pallets__flask-5014	3.30 GB	2.97 GB	+11.1%
psf__requests-1142	3.11 GB	2.67 GB	+16.5%
pylint-dev__pylint-7277	3.28 GB	2.89 GB	+13.5%
pytest-dev__pytest-6197	3.11 GB	2.71 GB	+14.8%
scikit-learn__scikit-learn-25102	4.20 GB	5.96 GB	-29.5%
sphinx-doc__sphinx-10323	3.36 GB	3.00 GB	+12.0%
sympy__sympy-11618	3.20 GB	3.10 GB	+3.2%

On-disk sizes are mixed. scikit-learn is 29.5% smaller on ARM64, django 2.9% smaller. Most others are 3-20% larger due to differences in base image layers. By compressed content size (what actually gets pulled), ARM64 images average about 4% smaller.

The Dockerfiles and package lists are identical to upstream. swe-bench-fast builds images through BuildKit with in-memory tar build contexts, which avoids the stray build artifacts that the upstream Python harness leaks into image layers. Net effect: native ARM64 images are roughly the same size.

What I had to fix

Four issues anyone hitting this path will encounter:

Conda channel config changed. Miniconda py311_23.11.0-2 now defaults to conda-forge only with channel_priority: strict. Older packages like setuptools==38.2.4 live on the defaults channel and won't resolve. The fix: explicitly configure both channels before building env images.

make_test_spec defaults to x86_64. Every call to make_test_spec hardcodes arch="x86_64". On ARM hosts, this means images are built for the wrong architecture unless you explicitly override it. I opened a PR (issue) to auto-detect via platform.machine().

x86-only instances need enforcement. Some instances must be x86 regardless of host arch. Without checking USE_X86 in the build pipeline, these instances silently get ARM images that fail at runtime. The broader ARM64 support PR by @SailorJoe6 addresses this along with JS and Java language support.

Unpinned transitive dependencies break tests. The upstream specs pin direct dependencies but not all transitives. When pip install -e .[test] resolves on ARM64, it can pull newer package versions than what the Epoch x86 images were built with. For sphinx instances, Pygments==2.19 changed HTML output for line number spans, causing pass-to-pass test failures. Pinning Pygments==2.18.0 to match the Epoch images fixed it. Any repo with HTML/rendering assertions is vulnerable to this kind of drift.

Try it yourself

swe-bench-fast is a standalone Go binary. It pulls pre-built ARM64 images from Docker Hub for the 78% of instances that support it, and Epoch x86 images for the rest. No Python, no image builds.

swe-bench-fast run --dataset swe-bench-full.jsonl --predictions preds.jsonl

That works on both ARM64 and x86. On ARM64, 1,798 instances run natively and 496 run under QEMU. On x86, everything runs natively via the Epoch images.

On an M-series Mac, allocate at least 120 GB disk and 8+ CPU cores to Docker Desktop or Colima.

On AWS Graviton (c7g, m7g, r7g, r8g), Docker runs natively with no VM layer. Install qemu-user-static for the x86-only instances. Graviton instances typically cost 20-40% less than comparable x86 EC2. That cost difference plus the 6x speedup makes a real difference in iteration time.

The benchmark gist has the full methodology, raw data, and detailed notes.

What's next

I'm building and pushing the 1,798 ARM64-native SWE-bench instance images to Docker Hub. The next post covers what that full build taught me about how SWE-bench actually works under the hood.

Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The raw benchmark data is available at gist.github.com. The eval harness source is at github.com/greynewell/swe-bench-fast.