SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken)

I've built 1,798 custom SWE-bench containers that run natively on ARM processors. I've also run SWE-bench Lite, Verified, and Pro more than 100 times evaluating prototype products at Supermodel. This post covers some of the confusing, broken, or just plain odd things I've learned by working with SWE-bench and reading the source code directly.

1. Every problem predates October 2023

While checking logs from an agent run, I noticed something very odd. The problem the agent was given by SWE-bench to evaluate was a GitHub issue from 2017. That's really old!

Most frontier models' training data cuts off between 2023 and 2024. If most of the problems are older than that, then the repository, GitHub issue, and solution have almost definitely leaked and contaminated the models. Each instance of SWE-bench is taken from a popular open source repository, the type of data ALL LLMs are trained on.

I decided to keep digging: are all of the problems this old? The SWE-bench paper (Appendix Table 21) reports the temporal distribution of all task instances:

Year	Task instances	% of total
< 2018	89	4.2%
2018	165	7.7%
2019	437	20.4%
2020	427	20.0%
2021	383	17.9%
2022	395	18.5%
2023	244	11.4%
Total	2,140

The collection pipeline scraped the top 100 PyPI repos as of August 2023 (paper Appendix A.1). The paper was published October 10, 2023. SWE-bench Verified (500 curated problems) was released in August 2024. Frozen data, no new problems.

The pipeline itself (get_tasks_pipeline.py) has no default cutoff:

parser.add_argument(
    "--cutoff_date",
    type=str,
    help="Cutoff date for PRs to consider in format YYYYMMDD",
    default=None,
)

Because the test set is frozen-in-time, any model trained after October 2023 will likely have seen most or all of the problems and solutions. This confuses measurements of accuracy and produces unreliable results.

2. The harness is x86-first

SWE-bench was designed to run on x86 hardware, and the prebuilt images from Epoch AI only support x86. This design decision excludes native execution on any recent generation Apple hardware as well as cost-effective cloud runners like AWS Graviton. Instead these architectures are forced to emulate x86 with QEMU or Rosetta, and the result runs very slowly.

I was able to show a 6.3x speedup measured on my M3 MacBook Pro by compiling SWE-bench containers specifically for ARM, although 496 containers specifically require x86 emulation due to missing ARM binaries. A newer set of test instances could support ARM by default, and there are also a few small changes that would improve ARM support throughout the existing benchmark.

make_test_spec() defaults to x86:

def make_test_spec(
    ...
    arch: str = "x86_64",

No caller in the codebase passes a different value. The platform mapping supports ARM64, but nobody invokes it:

@property
def platform(self):
    if self.arch == "x86_64":
        return "linux/x86_64"
    elif self.arch == "arm64":
        return "linux/arm64/v8"
    else:
        raise ValueError(f"Invalid architecture: {self.arch}")

Several language Dockerfiles have hardcoded x86 binaries:

Language	File	What's hardcoded
JavaScript	`dockerfiles/javascript.py` line 27	`deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main`
JavaScript	`dockerfiles/javascript.py` line 108	`pnpm-linux-x64` binary download
Java	`dockerfiles/java.py` lines 15-19	`maven-mvnd-1.0.2-linux-amd64.zip`
Go	`dockerfiles/go.py` lines 16-46	Architecture-aware (uses `dpkg --print-architecture`)
Python	`dockerfiles/python.py` line 24	Architecture-aware (uses `conda_arch` variable)

USE_X86 defines the 496 instance IDs that require x86. It's exported in __init__.py but never referenced in build or evaluation logic. There's an unmerged force_x86 branch suggesting it was intended to be used but never was.

The README recommends an x86_64 machine and calls ARM64 support "experimental."

While not strictly "broken," the under-implemented support for ARM hardware prohibits users from running the benchmark efficiently on popular local compute or cost-effective modern cloud hardware. Add to this fact the benchmark problems don't measure what you might assume.

3. Problems test the last mile, not exploration

Counter to popular intuition, SWE-bench problems are mostly well-scoped. This is according to design. If you look at logs of agents working on the problems, you don't see the agent navigating an unfamiliar codebase, finding key files, and reasoning about the architecture. The agent is being tested on writing a small, targeted fix once the general solution is known.

I argue that this is a feature of the benchmark (a controlled measurement), but that we should all calibrate our expectations regarding what an SWE-bench score means.

SWE-bench Lite explicitly filters for small, single-file patches (make_lite.py):

def filter_patch(instance):
    patch_text = instance["patch"]
    if (
        contains_non_modified_files(patch_text)
        or not leq_n_files(patch_text, 1)
        or not leq_n_hunks(patch_text, 3)
    ):
        return False
    return True

The scope constraints from criteria.py:

Constraint	Function	Threshold
Max files in gold patch	`leq_n_files()`	1
Max hunks	`leq_n_hunks()`	3
Max lines changed	`leq_n_code_lines()`	25
No added/removed files	`contains_non_modified_files()`	0

Even in full SWE-bench, each problem maps to a single PR. Test vs. fix is split by path matching (utils.py):

def extract_patches(pull: dict, repo: Repo) -> tuple[str, str]:
    patch = requests.get(pull["diff_url"]).text
    patch_test = ""
    patch_fix = ""
    for hunk in PatchSet(patch):
        if any(
            test_word in hunk.path for test_word in ["test", "tests", "e2e", "testing"]
        ):
            patch_test += str(hunk)
        else:
            patch_fix += str(hunk)
    return patch_fix, patch_test

The model receives the issue text and the full repo state at the commit before the fix. No ambiguity about which project, which branch, or which codebase. The job is to produce a diff.

Similar to the cultural debate amongst technologists about the diverging roles of "coders" vs "software engineers," the benchmark is an efficient measure of a model's ability to generate a narrowly targeted fix. It doesn't test codebase navigation or architectural reasoning in its current form.

4. Tests reject correct solutions

In February 2026, OpenAI published an audit of 138 SWE-bench Verified problems (27.6% of the 500-problem set) that o3 did not consistently solve over 64 independent runs. They found that 59.4% had test design flaws that reject functionally correct submissions. I've seen the same pattern replicated over hundreds of SWE-bench instances: test suites sometimes reject working code that solves the original issue. The evaluation works by providing "fail to pass" tests that must fail and "pass to pass" tests that must succeed for a solution to be marked correct. The tests are brittle to the point that correct fixes can still break the suite.

Issue type	% of audited problems	Description
Narrow tests	35.5%	Enforce specific implementation details, rejecting correct alternatives
Wide tests	18.8%	Check functionality not specified in the problem description
Miscellaneous	5.1%	Other test design issues
No issue found	40.6%	Tests are fine

Narrow tests

Some tests are too "narrow" in that they are looking for specific implementation fixtures that are not hard requirements to solve the problem at hand.

For example, in pylint-dev__pylint-4551, the problem description asks for Python type hints in UML generation. The PR introduces a function called get_annotation. The test file imports it by name:

from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node

The problem description never mentions get_annotation. A correct solution using any other function name fails with:

ImportError: cannot import name 'get_annotation' from 'pylint.pyreverse.utils'

That results in a solution erroneously being marked as incorrect.

Wide tests

Some of the tests are too wide by contrast. They include tests for issues not mentioned in the evaluation scenario. Models almost always fail to fix issues that were not described in the problem statement.

In sympy__sympy-18199, the PR fixed three distinct issues: #17373, #17377, and #18212. The SWE-bench task description only describes #18212 (nthroot_mod function misses one root of x = 0 mod p). The tests cover all three. Models that correctly fix #18212 fail tests for the other two issues they were never told about.

The codebase acknowledges this

The Lite filter explicitly removes tests that check exact error messages (criteria.py):

def contains_pytest_match_arg(patch_test_text: str) -> bool:
    if any(
        [
            x in patch_test_text
            for x in [
                "pytest.raises",
                "pytest.warns",
                "pytest.deprecated_call",
            ]
        ]
    ):
        return "match" in patch_test_text
    if any(
        [
            x in patch_test_text
            for x in [
                "assertOutput",
                "assertRaises",
                "checks.Error",
            ]
        ]
    ):
        return True
    return False

These patterns are excluded from Lite because a correct fix with different error message wording fails them.

The grading logic treats any test missing from the log parser output as a failure, not as unknown (grading.py):

def test_passed(case: str, sm: dict[str, str]) -> bool:
    return case in sm and sm[case] in [TestStatus.PASSED.value, TestStatus.XFAIL.value]

def test_failed(case: str, sm: dict[str, str]) -> bool:
    return case not in sm or sm[case] in [
        TestStatus.FAILED.value,
        TestStatus.ERROR.value,
    ]

Resolution requires 100% on both fail-to-pass and pass-to-pass:

if f2p == 1 and p2p == 1:
    return ResolvedStatus.FULL.value
elif f2p < 1 and f2p > 0 and p2p == 1:
    return ResolvedStatus.PARTIAL.value
else:
    return ResolvedStatus.NO.value

The log parsers themselves are fragile. From the Django parser:

# TODO: This is very brittle, we should do better
# There's a bug in the django logger, such that sometimes a test output near the end gets
# interrupted by a particular long multiline print statement.

And a one-off workaround for a single instance:

# TODO: Temporary, exclusive fix for django__django-7188
if line.strip().startswith(
    "Applying sites.0002_alter_domain_unique...test_no_migrations"
):
    line = line.split("...", 1)[-1].strip()

The JavaScript Karma parser carries a similar warning:

def parse_log_karma(log: str, test_spec: TestSpec) -> dict[str, str]:
    """
    Different immutable.js instances use different test runners and log formats.
    Logic is brittle.
    """

In summary, the combination of:

Narrow tests
Wide tests
All-or-nothing grading
Brittle parsing ...causes SWE-bench to reject an unknown number of correct solutions, biasing the scores.

5. Models have memorized the answers

Touching back on issue #1: the coding issues provided by SWE-bench are old, public, and there's proof that large models have stored these specific problems and solutions in their weights.

Connecting with issue #4, models pass narrow tests specifically because they memorized the implementation details the test is checking for. Uncontaminated models trying correct-but-different solutions get rejected entirely.

What to make of all of this? OpenAI's conclusion (February 2026):

"improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time."

GPT-5.2 -- `django__django-11451`

Problem statement: ModelBackend.authenticate() shouldn't make a database query when username is None

When prompted with just the task ID and a hint, GPT-5.2 reproduced the exact gold patch:

 class ModelBackend(BaseBackend):
     def authenticate(self, request, username=None, password=None, **kwargs):
+        if username is None or password is None:
+            return
         UserModel = get_user_model()
         if username is None:
             username = kwargs.get(UserModel.USERNAME_FIELD)

It also referenced Django release history in its chain of thought:

"There is also edit_only parameter maybe added around 4.1 or 4.2. Since this is 4.1 dev 2022, the code might be before introduction. We will implement now."

Claude Opus 4.5 -- `astropy__astropy-13236`

When asked to name the exact file path, function, and inline comment, Opus responded:

File: astropy/table/table.py in the _convert_data_to_col method

Inline comment (word-for-word):

# Structured ndarray gets viewed as a mixin unless already a valid
# mixin class

Changed code:

if (not isinstance(data, Column) and not data_is_mixin
        and isinstance(data, np.ndarray) and len(data.dtype) > 1):
    data = data.view(NdarrayMixin)
    data_is_mixin = True

The gold patch removes exactly those lines.

Gemini 3 Flash -- `django__django-11099`

Given only the task ID and a one-line problem statement (UsernameValidator allows trailing newline in usernames), Gemini reproduced the complete gold patch including exact regex, file paths, and surrounding context:

 class ASCIIUsernameValidator(validators.RegexValidator):
-    regex = r'^[\w.@+-]+$'
+    regex = r'^[\w.@+-]+\Z'

 class UnicodeUsernameValidator(validators.RegexValidator):
-    regex = r'^[\w.@+-]+$'
+    regex = r'^[\w.@+-]+\Z'

In essence, higher scores on this benchmark correlated with increased model contamination rather than increased general software engineering ability. OpenAI recommends practitioners migrate to SWE-bench Pro.

Conclusion

If there are three things I want you to take away from this post, here they are:

SWE-bench is a well-engineered and useful tool, but it measures a narrower set of capabilities than "can AI do software engineering."
OpenAI stopped reporting Verified scores in February 2026 and recommends SWE-bench Pro.
When you see a SWE-bench score on a model card, now you know what questions to ask.

My work in this area will continue with a GitHub Actions harness for generating and evaluating SWE-bench Pro scores.

Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The eval harness source is at github.com/greynewell/swe-bench-fast.