<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Grey Newell Blog</title>
    <subtitle>Thoughts on AI agent evaluation, benchmark methodology, and the tools I build along the way.</subtitle>
    <link href="https://greynewell.com/blog/feed.xml" rel="self" type="application/atom+xml"/>
    <link href="https://greynewell.com/blog/" rel="alternate" type="text/html"/>
    <id>https://greynewell.com/blog/</id>
    
    
    <updated>2026-03-06T00:00:00.000Z</updated>
    <author>
        <name>Grey Newell</name>
        <uri>https://greynewell.com</uri>
    </author>
    
    
    <entry>
        <title>5 Tips for AWS Certification Exams from AWS Solutions Architects</title>
        <link href="https://greynewell.com/blog/5-tips-aws-certification-exams-solutions-architects/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/5-tips-aws-certification-exams-solutions-architects/</id>
        <published>2023-02-20T00:00:00.000Z</published>
        <updated>2023-02-20T00:00:00.000Z</updated>
        <summary>We&#39;re both solutions architects at AWS, and between us, we hold 10 active AWS Certifications. Here are five tips AWS Solutions Architects swear by to prepare for and pass AWS Certification exams.</summary>
        <content type="html"><![CDATA[<p>Are you in the process of studying for your first <a href="https://aws.amazon.com/certification">AWS Certification</a>—or additional AWS Certification(s)? Regardless of where you are in your certification or preparation journey, we believe this blog can help you focus your efforts. We're both solutions architects at AWS, and between us, we hold 10 active AWS Certifications. Our tips have helped learners attain AWS Certifications, including the notoriously difficult AWS Certified Solutions Architect – Professional. AWS Certifications are available for any level of learner, whether in a technical role or not, to build cloud skills for a particular role or domain. If you're not sure where to start, use the <a href="https://d1.awsstatic.com/training-and-certification/docs/AWS_certification_paths.pdf">AWS Certification pathways guide</a> to choose!</p>
<p>In this blog we'll break down five tips AWS Solutions Architects swear by to prepare for and pass AWS Certification exams—and you can borrow these techniques! You'll learn how to use AWS Official Practice Question Sets, free digital courses, and other resources on <a href="https://explore.skillbuilder.aws/learn">AWS Skill Builder</a>, the official AWS online learning center, to skyrocket your learning objectives. You'll learn how to effectively use your time during the test, and maximize comprehension of the questions and exam objectives to reap the full benefit of your certification after you earn it.</p>
<h2>Prepare with AWS Training and Certification resources</h2>
<p>AWS Certifications are industry-recognized credentials, and as such, the exams are thorough, testing your knowledge and expertise. The more you prepare and practice, the more confident you will be, in both successfully passing the exam and demonstrating the knowledge with practical application. Learn more on the <a href="https://aws.amazon.com/certification/certification-prep/">exam preparation page</a>. AWS does not require you to take AWS-provided training to prep for the exams. However, there are recommended steps that can help you get started.</p>
<ul>
<li>Get to know the exam—review the exam guide available on each AWS Training and Certification's <a href="https://aws.amazon.com/certification/certification-prep/">exam preparation pages</a></li>
<li>Sign up for an <a href="https://explore.skillbuilder.aws/">AWS Skill Builder</a> account and get to know exam-style questions by taking AWS Certification Official Practice Question Sets</li>
<li>Learn about exam topics by:
<ul>
<li>Enrolling in courses on <a href="https://explore.skillbuilder.aws/learn">AWS Skill Builder</a> where you need to fill gaps in your learning based on exam topics</li>
<li>Reviewing white papers and AWS service-related FAQs available on the <a href="https://aws.amazon.com/certification/certification-prep/">exam pages</a></li>
<li><a href="https://aws.amazon.com/training/digital/">Subscribing to AWS Skill Builder</a> to get hands-on and build in the AWS Console with AWS Builder Labs and AWS Cloud Quest</li>
</ul>
</li>
<li>Prepare for your exam by:
<ul>
<li>Taking an AWS Skill Builder exam prep course</li>
<li>Using your AWS Skill Builder subscription to gauge your preparedness with a full-length AWS Certification Official Practice Exam</li>
</ul>
</li>
</ul>
<p>In addition to the above, here are five tips AWS Solutions Architects swear by to prepare for an AWS Certification exam.</p>
<h2>1. Break it down</h2>
<p>If you're training for a marathon, do you start by running a marathon on your first day? No. So take the same approach here: break it down. Take the AWS Certification Official Practice Question Sets, which you can find for free in <a href="https://explore.skillbuilder.aws/learn">AWS Skill Builder</a>. Start by doing 10 questions at a time and build up from there.</p>
<p>When you take the AWS Certification Official Practice Question Sets, turn on the &quot;Review Answer&quot; option. This gives you immediate feedback on the answers so you don't have to wait for the end of your study session to find out how you are doing. By reviewing the incorrect and correct answers to each question, you'll be on your way to understanding the concepts more quickly.</p>
<p>Break up your study time into 30-minute to one-hour chunks and be sure to take a break after you finish each portion of the Official Practice Question Set. This pacing helps the study sessions feel (more) enjoyable. After a week of answering 10 to 20 questions at a time, take a full-length, scored AWS Certified Official Practice Exam. Aim to take at least one full-length practice exam before you take the official, proctored exam. This prepares you for what it takes to last through the entire exam.</p>
<h2>2. Use the process of elimination</h2>
<p>The process of elimination is a mechanism that helps weed out the incorrect answers and identify the correct answer quickly. Avoid wasting precious time on answers that are there to distract you and throw you off. Scan over the answers and eliminate ones that are clearly wrong. It helps you focus on the valid choices.</p>
<h2>3. Learn key concepts from the Official Practice Question Set</h2>
<p>When working towards a challenging certification, avoid leaving points on the table. Start by focusing on valuable concepts. How do you know what are the valuable concepts? Review the exam guide that outlines all the exam domains and tasks that will be covered, as well as how each domain is weighted. Then utilize the Official Practice Question Sets that cover all the domains. You won't likely see the questions from the Official Practice Question Sets <em>verbatim</em> on the actual test. You will likely see the concepts covered from those questions in some form. These concepts you can expect to see on the test are like free points. Take them!</p>
<h2>4. Build your practical knowledge</h2>
<p>Nothing beats practical experience when it comes to tackling an AWS Certification exam. While studying is essential to your preparation, building projects inside your AWS account builds expertise and proficiency. We (and our fellow AWS Solutions Architects) recommended a time distribution of 80% building to 20% studying. Everyone is a little different so do what works for your unique learning style! Facts, figures, and concepts can be difficult to understand and retain by reading or watching videos alone. You will develop deeper understanding when you put your new knowledge into practice. You can get started by enrolling in free digital trainings, and upgrade to an AWS Skill Builder subscription to unlock hands-on learning in a live AWS environment through AWS Builder Labs and AWS Cloud Quest.</p>
<h2>5. Work backwards</h2>
<p>Whether you're employed at Amazon or not, you may have heard of our <a href="https://www.amazon.jobs/content/en/our-workplace/leadership-principles">Leadership Principles</a>. These ideas, values, and axioms represent 25 years of experience and wisdom and can help you to pass your exam. When faced with any type of opportunity or issue, Amazon Leadership Principles help Amazonians decide how to move forward. The scenario-based questions presented in an AWS Certification exam are challenging. A test taker can leverage two of our leadership principles to discern the path forward in any given question: 1/ weed out answer choices that don't live up to Amazon's relentlessly high standard of excellence; and 2/ work backwards.</p>
<p>Do you know the saying, &quot;Save the best for last&quot;? While that isn't something test writers strive to do, we suggest you read each question starting at the end. Why? Each exam is a test of skill, endurance, and discernment. Each question includes several pieces of information, but only some are useful to honing in on the right answer. Start by reviewing the last line of the question. Armed with this information, read the beginning of the question and then each answer choice. You will quickly discern relevant information from extraneous information. See for yourself: re-read this post starting from the bottom.</p>
<h2>Conclusion</h2>
<p>Now you have a set of proven methods to approach exam day with confidence, so log into <a href="https://explore.skillbuilder.aws/learn">AWS Skill Builder</a> and start preparing. By putting these tips into practice, you'll be in an optimal position to retain and apply what you've learned. Good luck on your AWS Certification journey! It's all about learning and building experience that you'll use for the rest of your career.</p>
<p>For some bonus tips, check out the following blogs that share valuable pointers:</p>
<ul>
<li><a href="https://aws.amazon.com/blogs/training-and-certification/steps-to-start-your-aws-certification-journey/">Steps to start your AWS Certification journey</a></li>
<li><a href="https://aws.amazon.com/blogs/training-and-certification/slay-imposter-syndrome-while-prepping-for-aws-certification-exams/">Slay imposter syndrome while prepping for AWS Certification exams</a></li>
</ul>
<hr>
<p><em>Grey Newell and Joshua Kurz are Solutions Architects at Amazon Web Services.</em></p>
]]></content>
    </entry>
    
    <entry>
        <title>Zero to Hero: Your Guide to Career Growth Through AWS Certifications</title>
        <link href="https://greynewell.com/blog/zero-to-hero-aws-certifications-career-growth/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/zero-to-hero-aws-certifications-career-growth/</id>
        <published>2025-03-20T00:00:00.000Z</published>
        <updated>2025-03-20T00:00:00.000Z</updated>
        <summary>Learn practical strategies that helped me transform from a struggling new graduate to an AWS Solutions Architect, eventually earning the coveted golden jacket awarded to those who achieve all twelve AWS Certifications.</summary>
        <content type="html"><![CDATA[<p>For years, I lived a double life: engineering student by day, musician by night. I earned two degrees while playing more than 100 shows annually, convinced I could keep both dreams alive indefinitely. But in 2019, everything unraveled. Suddenly, those hard-earned degrees weren't enough to keep a roof over my head. I found myself on my dad's couch, scraping by with coding gigs. It was during one of these jobs that a client asked a question that would change everything: &quot;Are you AWS Certified?&quot; That simple inquiry became my lifeline.</p>
<p>Within a month, I had my first AWS Certification. Six years and many certifications later, I've climbed from struggling graduate to Senior Solutions Architect at AWS, complete with the golden jacket awarded to those who earn all AWS Certifications.</p>
<p>This is the story of how I found a path that united my technical skills and creative drive, and how you can, too.</p>
<h2>Your zero to hero roadmap</h2>
<p><img src="/img/grey-newell-aws-golden-jacket.png" alt="Grey Newell wearing the AWS golden jacket awarded to those who earn all twelve AWS Certifications"></p>
<p>Like me, you might be one <a href="https://aws.amazon.com/certification/">AWS Certification</a> away from changing your entire career path. Here's a roadmap to success:</p>
<p>First, choose your starting point based on your experience:</p>
<ul>
<li><strong>Beginners:</strong> Start with <a href="https://aws.amazon.com/certification/certified-cloud-practitioner/">AWS Certified Cloud Practitioner</a>.</li>
<li><strong>IT professionals:</strong> Begin with Associate level certifications.</li>
<li><strong>Cloud experts:</strong> Jump to Professional or Specialty certifications.</li>
</ul>
<p>Then, <a href="https://d1.awsstatic.com/training-and-certification/docs/AWS_certification_paths.pdf">use this journey map of role-based AWS Certification paths</a> to find the right one for you.</p>
<h2>5 key strategies that made the difference</h2>
<p>Looking back at my journey, these strategies had the most impact on my success:</p>
<h3>1. Strategic use of AWS Training resources</h3>
<p><a href="https://skillbuilder.aws/">AWS Skill Builder</a> became my home base. I took a targeted approach, selecting resources to match my learning style and curating a mix of foundational courses for conceptual skill building, labs for hands-on practice, and official practice exams for test preparation. I especially enjoyed the <em>Exam Prep Enhanced Courses</em> for <a href="https://explore.skillbuilder.aws/learn/courses/14954/exam-prep-enhanced-course-aws-certified-solutions-architect-professional-sap-c02-amazon">AWS Certified Solutions Architect – Professional</a> and <a href="https://explore.skillbuilder.aws/learn/courses/16520/exam-prep-enhanced-course-aws-certified-devops-engineer-professional-dop-c02-english-amazon">AWS Certified DevOps Engineer – Professional</a> because of the depth and breadth of material they cover.</p>
<p><strong>Tip: Avoid exam day surprises.</strong> Practice with sample questions and time constraints. Understanding the exam structure is just as important as knowing the content.</p>
<h3>2. From certification knowledge to practical skills</h3>
<p>For each concept, I created a mini project: an <a href="https://aws.amazon.com/s3/">Amazon S3</a> bucket for <a href="https://aws.amazon.com/certification/certified-cloud-practitioner/">AWS Certified Cloud Practitioner</a>, a three-tier web app for <a href="https://aws.amazon.com/certification/certified-solutions-architect-associate/">AWS Certified Solutions Architect – Associate</a>, and CI/CD pipelines for <a href="https://aws.amazon.com/certification/certified-devops-engineer-professional/">AWS Certified DevOps Engineer – Professional</a>. These practical exercises cemented my understanding and provided compelling examples for interviews and client discussions.</p>
<p><strong>Tip: Avoid certification collecting.</strong> Don't just chase certificates. Focus on applying what you learn through hands-on projects. This builds deep understanding and professional credibility.</p>
<h3>3. The 30-day sprint method</h3>
<p>I prepared for each exam using a structured 30-day plan. Each day started with 2–3 hours of learning new material through online courses, documentation, and hands-on labs. I then practiced these concepts in evening study sessions through exercises and coding projects.</p>
<p>I used <a href="https://www.bcu.ac.uk/exams-and-revision/best-ways-to-revise/spaced-repetition">the 2357 method</a>, a spaced repetition technique, to structure my exam preparation. Working backwards from the exam date, I scheduled strategic review sessions at 2, 3, 5, and 7 days before the test. At each checkpoint, I took a practice exam to measure my progress and identify knowledge gaps. For example, if I scored low on networking concepts, I'd dedicate more time to that topic in my daily studies. By combining systematic learning with strategic knowledge checks, I maintained steady progress while ensuring I didn't miss critical topics.</p>
<p><strong>Tip: Avoid perfectionism.</strong> Don't wait until you feel 100% ready to start your certification journey—you might never feel ready. Schedule your exam and use that as motivation. Each attempt teaches valuable lessons.</p>
<h3>4. Finding work-study balance</h3>
<p>I learned to leverage small pockets of time throughout the day—like canceled meetings and lunch breaks—to make every minute count. Being selective about commitments was crucial. I declined nonessential work and clearly communicated my priorities.</p>
<p>Regular breaks prevented burnout and kept me refreshed for focused study sessions. Based on my experience, plan for 120–160 hours of study per certification. Break this down into manageable chunks using the study strategies shared in this post.</p>
<p><strong>Tip: Avoid overwhelm.</strong> Instead of trying to master every AWS service at once, focus on core patterns and principles. Understanding fundamental concepts helps you learn new services more quickly.</p>
<h3>5. Building your cloud community</h3>
<p>The certification journey doesn't need to be tackled alone. I reached out to peers on social media to ask questions about their experiences studying for AWS Certifications and found that most responded positively, even directing me to helpful resources to support my journey. I shared certification milestones on social media and tagged helpful content creators, which led to lasting professional relationships that continue to benefit my career today.</p>
<p><strong>Tip: Avoid only studying alone.</strong> Engage with the AWS community. Share experiences, ask questions, and practice with others. Different perspectives and collaborative learning accelerate your growth. And remember: every expert was once a beginner. The path from zero to hero is about consistency, strategy, and practical application. Your journey starts now.</p>
<h3>Essential resources</h3>
<ul>
<li><a href="https://d1.awsstatic.com/training-and-certification/docs/AWS_certification_paths.pdf">AWS Certification Journey Map</a></li>
<li><a href="https://skillbuilder.aws/">AWS Skill Builder</a></li>
<li><a href="https://aws.amazon.com/training/">AWS Training and Certification</a></li>
<li><a href="https://aws.amazon.com/training/ramp-up-guides/">AWS Ramp-Up Guides</a></li>
</ul>
<h3>Related posts</h3>
<ul>
<li><a href="https://aws.amazon.com/blogs/training-and-certification/5-tips-for-aws-certification-exams-from-aws-solutions-architects/">5 tips for AWS Certification exams from AWS Solutions Architects</a></li>
<li><a href="https://aws.amazon.com/blogs/training-and-certification/enhance-your-real-world-skills-with-aws-cloud-quest-and-aws-jam/">Enhance your real-world skills with AWS Cloud Quest and AWS Jam</a></li>
</ul>
<h2>Let's connect</h2>
<p>The journey to earning all AWS Certifications isn't just about passing exams—it's about building a foundation for continuous growth in cloud computing. When I started this journey from my dad's couch, I couldn't imagine where this path would lead. Whether you're at the beginning of your AWS Certification journey or somewhere along the path, I'd love to support you and be a part of your cloud community. Feel free to reach out to me on <a href="https://www.linkedin.com/in/greynewell/">LinkedIn</a>, <a href="https://www.x.com/greynewell">X</a>, or <a href="https://github.com/greynewell">GitHub</a>.</p>
<hr>
<p><em>Grey Newell is a Senior Solutions Architect at Amazon Web Services and holder of all twelve AWS Certifications.</em></p>
]]></content>
    </entry>
    
    <entry>
        <title>Implement Event-Driven Invoice Processing for Resilient Financial Monitoring at Scale</title>
        <link href="https://greynewell.com/blog/event-driven-invoice-processing-resilient-financial-monitoring/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/event-driven-invoice-processing-resilient-financial-monitoring/</id>
        <published>2025-05-12T00:00:00.000Z</published>
        <updated>2025-05-12T00:00:00.000Z</updated>
        <summary>How to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events.</summary>
        <content type="html"><![CDATA[<p>Processing high volumes of invoices efficiently while maintaining low latency, high availability, and business visibility is a challenge for many organizations. A customer recently consulted us on how they could implement a monitoring system to help them process and visualize large volumes of invoice status events.</p>
<p>This post demonstrates how to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events. You might deploy this system for business-level insights into how events are flowing through your organization or to visualize the flow of transactions in real time. Downstream services also will have the option to process and respond to events originating within the system or not.</p>
<h2>Business challenge</h2>
<p>For our use case, a global enterprise wants to deploy a monitoring system for their invoice event pipeline. The pipeline processes millions of events per period, projected to surge 40% within 18 months. Each invoice must navigate a four-stage journey while making sure every event is visible within 2 minutes. End-of-month invoice surges reach 60,000 events per minute or up to 86 million per day. With payment terms spanning from standard 30-day windows to year-long arrangements, the architecture demands zero tolerance for missing events. Finance executives require near real-time visibility through dashboards, and auditors demand comprehensive historical retrieval.</p>
<h2>Solution overview</h2>
<p>The architecture implements a serverless event-driven system broken into independently deployable Regional cells, as illustrated in the following diagram.</p>
<p><img src="/img/invoice-processing-architecture-overview.png" alt="Architecture overview showing the serverless event-driven system with independently deployable Regional cells"></p>
<p>The solution uses the following key services:</p>
<ul>
<li><strong><a href="https://aws.amazon.com/api-gateway">Amazon API Gateway</a></strong> – Clients want to send events into our solution using HTTPS calls to a REST API. API Gateway was selected due to its support for REST, event-based integrations with other AWS services, and its support for throttling to prevent individual callers from creating a system overload.</li>
<li><strong><a href="https://aws.amazon.com/eventbridge/">Amazon EventBridge</a></strong> – Events created by API Gateway need to be routed to downstream consumers and archived where events can be replayed later. EventBridge provides a custom event bus that defines rules to intelligently route events based on their contents.</li>
<li><strong><a href="http://aws.amazon.com/sns">Amazon Simple Notification Service (Amazon SNS)</a></strong> – To keep EventBridge rules simple, events are routed by type to one or more destinations for fanout. SNS topics are used as routing targets to activate fanout to a variety of downstream consumers with optional subscription filters to control which events are received by consumers.</li>
<li><strong><a href="https://aws.amazon.com/sqs/">Amazon Simple Queue Service (Amazon SQS)</a></strong> – Each SNS topic fans out by sending a copy of each message to each consumer subscribed to the topic. Consumers receive messages through Amazon SQS, which decouples event processing compute and provides dead-letter queues (DLQs) for storing messages that fail to process. EventBridge custom event buses and SNS FIFO (First-In-First-Out) topics can also use DLQs powered by Amazon SQS.</li>
<li><strong><a href="http://aws.amazon.com/lambda">AWS Lambda</a></strong> – The Lambda architecture aligns with short-lived processing tasks, spinning up when needed and disappearing afterward without incurring idle resource costs. This integration between Lambda and Amazon SQS delivers an economical processing system that automatically scales with demand, allowing developers to focus on business logic rather than infrastructure orchestration, and the pay-per-execution model provides financial efficiency.</li>
<li><strong><a href="https://aws.amazon.com/timestream/">Amazon Timestream</a></strong> – Timestream offers a purpose-built architecture that addresses the unique challenges of time series data, auto scaling to ingest millions of events while maintaining fast query performance for responsive dashboard visualizations. Its intelligent tiered storage system automatically transitions data between memory and cost-effective long-term storage without sacrificing analytics capabilities, enabling organizations to maintain both real-time operational visibility and historical trending insights through a single, unified platform that integrates with QuickSight.</li>
<li><strong><a href="https://aws.amazon.com/quicksight">Amazon QuickSight</a></strong> – QuickSight transforms event streams into visual narratives through its intuitive interface, empowering business users to discover actionable insights without specialized data science expertise. Its serverless architecture scales to accommodate millions of users while offering machine learning (ML)-powered anomaly detection and forecasting capabilities, all within a pay-per-session pricing model that activates sophisticated analytics that would otherwise require significant resources. QuickSight dashboards can either directly query from a Timestream table or cache records in-memory with SPICE periodically.</li>
</ul>
<p>Events flow through the layers of this architecture in four stages:</p>
<ul>
<li><strong>Event producers</strong> – API Gateway for receiving client events through a REST API</li>
<li><strong>Event routing</strong> – EventBridge routes events to SNS topics for fanout</li>
<li><strong>Event consumers</strong> – SQS queues with Lambda or Fargate consumers</li>
<li><strong>Business intelligence</strong> – Timestream and QuickSight for dashboards</li>
</ul>
<h2>Design tenets</h2>
<p>The solution adheres to three key architectural principles:</p>
<ul>
<li><strong>Cellular architecture</strong> – In a <a href="https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html">cellular architecture</a>, your workload scales through independent deployment units like the one depicted in the previous section. Each unit operates as a self-contained cell, and more cells can be deployed to different AWS Regions or AWS accounts to further increase throughput. Cellular design activates independent scaling of resources based on local load and limits the area of effect of failures.</li>
<li><strong>Serverless architecture</strong> – In a serverless architecture, operational overhead of scaling is minimized by using managed services. We use Lambda for compute-intensive tasks like fanning out messages to thousands of micro-consumers or employing container-based services (<a href="https://aws.amazon.com/fargate">AWS Fargate</a>) for longer-running processes.</li>
<li><strong>Highly available design</strong> – We maintain the availability of our overall financial system through Multi-AZ resilience at every layer. Automatic failover and disaster recovery procedures can be implemented without altering the architecture. We also use replication, archival, and backup strategies to prevent data loss in the event of cell failure.</li>
</ul>
<h2>Scaling constraints</h2>
<p>Our solution will experience the following scaling bottlenecks with quotas sampled from the <code>us-east-1</code> Region:</p>
<ul>
<li><a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html">API Gateway quota</a>: Throttling at 10,000 requests per second (RPS); can be increased</li>
<li><a href="https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-quota.html">EventBridge service quotas</a>:
<ul>
<li><code>PutEvents</code> throttle limit at 10,000 transactions per second (TPS); can be increased</li>
<li>Invocations throttle limit at 18,750 TPS; can be increased</li>
</ul>
</li>
<li><a href="https://docs.aws.amazon.com/general/latest/gr/sns.html">Amazon SNS service quotas</a>: Publish API throttling at 30,000 messages per second (MPS); can be increased</li>
<li><a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-queues.html">Amazon SQS service quotas</a>: Messages per queue (in flight) throttled at 120,000; can be increased</li>
<li><a href="https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html">Lambda service quotas</a>: 1,000 concurrent executions or up to 10,000 RPS; can be increased</li>
</ul>
<p>We can safely scale a single account to 10,000 requests per second (600,000 per minute, 864 million per day) without increasing service quotas in the <code>us-east-1</code> Region. Default quotas will vary per Region and the values can be increased by raising a support ticket. The architecture scales even further by deploying independent cells into multiple Regions or AWS accounts.</p>
<p>Scaling of QuickSight and Timestream depends on the computational complexity of analysis, the window of time being analyzed, and the number of users concurrently analyzing the data, which was not a scaling bottleneck in our use case.</p>
<h2>Prerequisites</h2>
<p>Before implementing this solution, make sure you have the following:</p>
<ul>
<li>An AWS account with administrator access</li>
<li>The <a href="http://aws.amazon.com/cli">AWS Command Line Interface</a> (AWS CLI) version 2.0 or later installed and configured</li>
<li>Appropriate AWS service quotas confirmed for high-volume processing</li>
</ul>
<p>In the following sections, we walk through the steps for our implementation strategy.</p>
<h2>Decide on partitioning strategies</h2>
<p>First, you must decide how your solution will partition requests between cells. In our use case, dividing cells by Region allows us to offer low-latency local processing for events while keeping each cell fully independent from one another.</p>
<p>Inside of each cell, traffic flow is roughly evenly divided between the four stages of invoice processing. Our solution breaks each cell into four logical partitions or flows by invoice status (authorization, reconciliation, and so on). Partitioning offers the ability to fan out and scale resources independently based on traffic patterns specific to each partition.</p>
<p>To partition your cellular architecture, consider the volume, distribution, and access pattern of the events that will flow through each cell. You must allow independent scaling within your cells without encountering global service limits. Choose a strategy that allows each cell to be broken into 1–99 roughly equivalent partitions based on predictable attributes.</p>
<h2>Implement the event routing layer</h2>
<p>The event routing layer combines EventBridge for intelligent routing with Amazon SNS for efficient fanout.</p>
<h3>EventBridge custom event bus configuration</h3>
<p>Create a custom event bus with rules to route events based on your partitioning strategy:</p>
<ul>
<li>Use content-based filtering to direct events to appropriate SNS topics</li>
<li>Implement an archive to replay events from history if processing fails</li>
</ul>
<p>Define a standard event schema for common metadata, including:</p>
<ul>
<li>Invoice ID, amount, currency, status, timestamp</li>
<li>Vendor information and payment terms</li>
<li>Processing metadata (Region, account ID, and so on)</li>
</ul>
<h3>SNS topic structure</h3>
<p>Create SNS topics for each logical partition:</p>
<ul>
<li><code>invoice-ingestion</code></li>
<li><code>invoice-reconciliation</code></li>
<li><code>invoice-authorization</code></li>
<li><code>invoice-posting</code></li>
</ul>
<p>Implement message filtering at the subscription level for granular control of which messages subscribing consumers see. Each topic can fan out to a large variety of downstream consumers that are also waiting for events that match the EventBridge custom event bus rules. Delivery failures will be retried automatically up to a configurable limit.</p>
<p><img src="/img/invoice-processing-event-routing.png" alt="Event routing layer diagram showing EventBridge routing to SNS topics for fanout across partitions"></p>
<h2>Implement event producers</h2>
<p>Configure API Gateway to receive events from existing systems with built-in throttling and error handling.</p>
<h3>API design</h3>
<p>Create a RESTful API with resources and a path for each logical partition inside your cell:</p>
<ul>
<li><code>/invoices/ingestion</code> (POST)</li>
<li><code>/invoices/reconciliation</code> (POST)</li>
<li><code>/invoices/authorization</code> (POST)</li>
<li><code>/invoices/posting</code> (POST)</li>
</ul>
<p>Implement request validation using a JSON schema for each endpoint. Use API Gateway request transformations to standardize incoming data and provide well-formatted error messages and response codes to clients in the event of failures.</p>
<h3>Security and throttling</h3>
<p>Implement API keys and usage plans for client authentication and rate limiting to prevent a talkative upstream from bringing down the system. Configure <a href="https://aws.amazon.com/waf/">AWS WAF</a> rules to protect against common attacks against API endpoints. Set up throttling to handle burst traffic (60,000 events/minute) at the account level and the method level.</p>
<h3>Monitoring and logging</h3>
<p>Our partitioned event producer strategy allows your solution to independently monitor each event type by:</p>
<ul>
<li>Enabling <a href="http://aws.amazon.com/cloudwatch">Amazon CloudWatch Logs</a> for API Gateway with log retention policies</li>
<li>Setting up <a href="https://aws.amazon.com/xray/">AWS X-Ray</a> tracing for end-to-end request analysis</li>
<li>Implementing custom metrics for monitoring API performance and usage patterns</li>
</ul>
<p><img src="/img/invoice-processing-event-producers.png" alt="Event producers diagram showing API Gateway configuration with throttling and monitoring"></p>
<h2>Implement event consumers</h2>
<p>Implement durable processing using SQS queues with DLQs attached and serverless Lambda consumers.</p>
<h3>SQS queue structure</h3>
<p>Create SQS queues in front of each consumer to decouple message delivery and processing, in our case one per partition:</p>
<ul>
<li><code>invoice-ingestion.fifo</code></li>
<li><code>invoice-reconciliation.fifo</code></li>
<li><code>invoice-authorization.fifo</code></li>
<li><code>invoice-posting.fifo</code></li>
</ul>
<p>Set up DLQs for each main queue:</p>
<ul>
<li>Configure maximum receives before moving to the DLQ</li>
<li>Implement alerting for stuck messages in the DLQ</li>
</ul>
<p><img src="/img/invoice-processing-event-consumers.png" alt="Event consumers diagram showing SQS FIFO queues with dead-letter queues for each invoice processing partition"></p>
<h3>Lambda consumers</h3>
<p>Attach Lambda functions to each queue for custom processing of events:</p>
<ul>
<li><code>InvoiceIngestionProcessor</code></li>
<li><code>InvoiceReconciliationProcessor</code></li>
<li><code>InvoiceAuthorizationProcessor</code></li>
<li><code>InvoicePostingProcessor</code></li>
</ul>
<p>Functions handle necessary transformations, call downstream services, and load events into Timestream. Double-check concurrency limits and provisioned concurrency to cover peak and sustained load, respectively.</p>
<h3>Error handling and retry logic</h3>
<p>Develop a custom retry mechanism for business logic failures and exponential backoff for transient errors. Create an operations dashboard with alerts and metrics for monitoring stuck events to redrive.</p>
<h2>Build the business intelligence dashboard</h2>
<p>Use Timestream and QuickSight to create real-time financial event dashboards.</p>
<h3>Timestream data model</h3>
<p>When modeling real-time invoice events in Timestream, using multi-measure records provides optimal efficiency by designating invoice ID as a dimension while storing processing timestamps, amounts, and status as measures within single records. This approach creates a cohesive time series view of each invoice's lifecycle while minimizing data fragmentation.</p>
<p>Multi-measure modeling is preferable because it significantly reduces storage requirements and query complexity, enabling more efficient time-based analytics. The resulting performance improvements are particularly valuable for dashboards that need to visualize invoice processing metrics in real time, because they can retrieve complete invoice histories with fewer operations and lower latency, ultimately delivering a more responsive monitoring solution.</p>
<h3>Real-time data ingestion</h3>
<p>Create a Lambda function to push metrics to Timestream:</p>
<ul>
<li>Trigger on every status change in the invoice lifecycle</li>
<li>Batch writes for improved performance during high-volume periods</li>
</ul>
<h3>QuickSight dashboard design</h3>
<p>Develop interactive QuickSight dashboards for different user personas:</p>
<ul>
<li><strong>Executive overview</strong> – High-level KPIs and trends</li>
<li><strong>Operations dashboard</strong> – Detailed processing metrics and bottlenecks</li>
<li><strong>Finance dashboard</strong> – Cash flow projections and payment analytics</li>
</ul>
<p><img src="/img/invoice-processing-quicksight-dashboard.png" alt="QuickSight dashboard showing real-time financial event monitoring with executive, operations, and finance views"></p>
<p>Don't forget to implement ML-powered anomaly detection for identifying unusual patterns in your events.</p>
<h2>Monitoring and alerting</h2>
<p>Set up CloudWatch alarms for key metrics:</p>
<ul>
<li>Processing latency exceeding Service-Level Agreements (SLAs)</li>
<li>Error rates above expected percentage for any processing stage</li>
<li>Queue depth exceeding predefined thresholds</li>
</ul>
<p>Configure SNS topics for alerting finance teams and operations:</p>
<ul>
<li>Use different topics for varying alert severities</li>
<li>Implement automated escalation for critical issues</li>
</ul>
<p>Develop custom CloudWatch dashboards for system-wide monitoring:</p>
<ul>
<li>End-to-end processing visibility</li>
<li>Regional performance comparisons</li>
</ul>
<h2>Security</h2>
<p>Add permissions in a least privilege manner for each required service listed in the architecture:</p>
<ul>
<li>Create separate execution roles for each Lambda function</li>
<li>Implement role assumption for cross-account operations</li>
</ul>
<p>Encrypt data at rest and in transit:</p>
<ul>
<li>Use <a href="http://aws.amazon.com/kms">AWS Key Management Service</a> (AWS KMS) for managing encryption keys</li>
<li>Implement field-level encryption for sensitive data</li>
</ul>
<p>Set up <a href="https://aws.amazon.com/config/">AWS Config</a> rules to maintain compliance with internal policies:</p>
<ul>
<li>Monitor for unapproved resource configurations</li>
<li>Automate remediation for common violations</li>
</ul>
<p>Use <a href="http://aws.amazon.com/cloudtrail">AWS CloudTrail</a> for comprehensive auditing:</p>
<ul>
<li>Enable organization-wide trails</li>
<li>Implement log analysis for detecting suspicious activities</li>
</ul>
<h2>Conclusion</h2>
<p>The serverless event-driven architecture presented in this post enables processing of over 86 million daily invoices while maintaining near real-time visibility, strict compliance with internal policies, cellular scaling capabilities, and minimal operational overhead. This solution provides a robust foundation for modernizing financial operations, enabling organizations to handle the complexities of high-volume invoice processing with confidence and agility.</p>
<p>For further enhancements, consider exploring:</p>
<ul>
<li>Machine learning for predictive analytics on event patterns</li>
<li>Implementing <a href="https://aws.amazon.com/step-functions/">AWS Step Functions</a> for complex, multi-stage workflows</li>
<li>Integrating with <a href="https://aws.amazon.com/lake-formation/">AWS Lake Formation</a> for centralized data governance and analytics</li>
</ul>
<hr>
<p><em>Grey Newell worked as an M.S.E. Distributed Systems and a Senior Solutions Architect at Amazon Web Services.</em></p>
]]></content>
    </entry>
    
    <entry>
        <title>The Architecture of Supermodel&#39;s Code Graph API</title>
        <link href="https://greynewell.com/blog/supermodel-code-graph-api-architecture/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/supermodel-code-graph-api-architecture/</id>
        <published>2026-02-25T00:00:00.000Z</published>
        <updated>2026-02-25T00:00:00.000Z</updated>
        <summary>A look inside Supermodel&#39;s real-time code analysis API: the five-stage processing pipeline, multi-language abstraction via a unified node schema, incremental graph updates, and the sub-100ms response time requirement that shaped every design decision.</summary>
        <content type="html"><![CDATA[<p>Supermodel's engineering team built a real-time code analysis API designed to handle millions of lines across multiple programming languages. The core requirement was speed—the system needed to respond quickly enough that AI agents could query it during conversations without noticeable delays.</p>
<h2>The Processing Pipeline</h2>
<p>The system operates through five sequential stages:</p>
<ol>
<li><strong>File ingestion</strong> — Monitoring and processing only new or modified files</li>
<li><strong>Language-specific parsing</strong> — AST parsers extract structural elements from supported languages</li>
<li><strong>Graph construction</strong> — Parsed elements become nodes and edges in a directed graph</li>
<li><strong>Storage and indexing</strong> — Graph storage enables fast traversal queries</li>
<li><strong>API serving</strong> — RESTful endpoints deliver sub-100ms response times</li>
</ol>
<h2>Technical Approach</h2>
<p><strong>Multi-Language Abstraction:</strong> Rather than building separate systems per language, the team created a unified node schema capturing essential code properties—name, kind (function, class, module), location, and relationships—regardless of syntax differences. This lets the rest of the pipeline treat all languages identically once parsing is complete.</p>
<p><strong>Incremental Updates:</strong> When files change, the system invalidates only affected nodes, re-parses modified files, and merges updates back into the graph while preserving cross-file relationships. This keeps the graph current without the latency of full rebuilds.</p>
<h2>Future Direction</h2>
<p>The roadmap includes semantic analysis extending beyond structural relationships to understand data flow, shared invariants, and code patterns between elements—moving from <em>where</em> code lives to <em>what</em> it does.</p>
]]></content>
    </entry>
    
    <entry>
        <title>Building Uncompact: Lessons from Production</title>
        <link href="https://greynewell.com/blog/building-uncompact-lessons-from-production/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/building-uncompact-lessons-from-production/</id>
        <published>2026-02-28T00:00:00.000Z</published>
        <updated>2026-02-28T00:00:00.000Z</updated>
        <summary>How Supermodel built Uncompact—a tool that maintains a persistent code graph across Claude Code&#39;s context compaction events—and the key lessons learned shipping it to production: simplicity over detail, invisibility enables adoption, and layered verification over blind trust.</summary>
        <content type="html"><![CDATA[<p>The fundamental issue isn't compaction itself—it's necessary given finite context windows. Rather, agents lacked mechanisms to store structural understanding outside conversations. Unlike human developers who leverage IDEs and documentation, AI agents had no external reference system. Uncompact was built to fill that gap.</p>
<h2>The Solution</h2>
<p>Uncompact maintains a persistent code graph that survives compaction events. When agents need codebase structure, they query this graph rather than re-reading files. The critical design principle: the graph must remain current. Stale information undermines agent confidence, so incremental updates trigger on every file save rather than complete rebuilds.</p>
<h2>Installation</h2>
<p>Setup requires running <code>npm install -g uncompact --foreground-scripts</code> followed by <code>uncompact auth login</code> with a Supermodel API key. The tool auto-registers as a Claude Code hook during initialization, requiring no additional configuration.</p>
<h2>Technical Architecture</h2>
<p>Instead of rebuilding entire graphs on changes, Uncompact processes only modified files and their immediate graph neighbors. Editing <code>PaymentService.ts</code> triggers re-analysis of that file and connected dependencies—the remaining graph stays unchanged. This approach mirrors incremental compilation principles.</p>
<h2>User Experience Impact</h2>
<p>Post-compaction, agents can query the graph for structural information (&quot;What calls <code>processPayment</code>?&quot;) rather than searching retained context. The graph provides accurate, current answers independent of compaction frequency, enabling seamless context recovery.</p>
<h2>Key Lessons</h2>
<p><strong>Simplicity matters.</strong> Early versions captured excessive detail. Effective versions focus on crucial relationships—the graph should answer structural questions, not replicate the source.</p>
<p><strong>Invisibility enables adoption.</strong> Background processes requiring no maintenance drive usage. If developers have to think about the tool, they'll stop using it.</p>
<p><strong>Layered verification works.</strong> Graphs indicate <em>where</em> to look; agents still examine actual code for specifics. The graph is a map, not a replacement for reading the territory.</p>
]]></content>
    </entry>
    
    <entry>
        <title>Why Code Graphs Matter for AI Agents</title>
        <link href="https://greynewell.com/blog/why-code-graphs-matter/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/why-code-graphs-matter/</id>
        <published>2026-03-02T00:00:00.000Z</published>
        <updated>2026-03-02T00:00:00.000Z</updated>
        <summary>AI coding agents lose critical structural understanding of codebases when context compaction occurs. Code graphs provide persistent external memory—representing functions, classes, and dependencies as queryable relationships—so agents can recover context without re-reading files from scratch.</summary>
        <content type="html"><![CDATA[<p>AI coding agents face a significant challenge: context loss during conversation compaction. As sessions progress and conversation history grows, agents must compress older messages to stay within finite context windows. This process often discards critical structural information about codebases—function signatures, dependency chains, and architectural decisions disappear.</p>
<h2>The Compaction Problem</h2>
<p>Every AI agent grapples with the tension between finite context windows and infinite codebases. When compaction occurs without a persistent structural model, the agent loses track of previously analyzed code relationships. This leads to inefficient behavior: agents re-read files, repeat analysis, and lose important architectural understanding they've already developed.</p>
<h2>What Goes Wrong in Practice</h2>
<p>A concrete example illustrates this issue: during a 45-minute refactoring session, an agent traces a complete call chain from API layer through service classes to database. It understands entry points, internal utilities, and shared features. Then compaction hits. The agent discards this architectural work and must re-read files from scratch on the next request, asking &quot;questions it already answered&quot; and potentially making conflicting changes.</p>
<h2>Code Graphs as Solution</h2>
<p>Code graphs provide persistent external memory by representing codebases as structured relationships between functions, classes, modules, and their connections. Through tools like Supermodel's MCP server, agents can query for:</p>
<ul>
<li>Functions within modules</li>
<li>File dependencies</li>
<li>Call chains for features</li>
<li>Type definitions and usage patterns</li>
</ul>
<p>As the saying goes, &quot;Graph queries give you structure and relationships, not just text matches.&quot;</p>
<h2>Beyond Compaction: Broader Applications</h2>
<p>Code graphs enable several advanced capabilities:</p>
<p><strong>Dead Code Detection:</strong> Identify unused functions and classes without reading entire codebases.</p>
<p><strong>Impact Analysis:</strong> Determine which modules depend on utilities before modifications to prevent unintended ripple effects.</p>
<p><strong>Test Coverage Analysis:</strong> Trace which functions each test exercises directly from call graphs.</p>
<p><strong>Codebase Evaluation:</strong> Assess domain structure, dependency health, and module coupling quickly.</p>
<p><strong>Documentation Generation:</strong> Ground documentation in actual code structure rather than potentially outdated comments.</p>
<p><strong>Developer Onboarding:</strong> Provide new team members and agents with structural maps for faster orientation.</p>
<h2>Why This Matters Now</h2>
<p>As agents tackle increasingly complex multi-file tasks, the compaction problem intensifies. While simple bug fixes may survive context compression, large refactors across many files expose the limitations of purely conversation-based context. Code graphs represent essential infrastructure for serious AI-assisted development.</p>
]]></content>
    </entry>
    
    <entry>
        <title>SWE-bench Tests Run 6x Faster on ARM64 with Native Containers</title>
        <link href="https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/</id>
        <published>2026-03-05T00:00:00.000Z</published>
        <updated>2026-03-05T00:00:00.000Z</updated>
        <summary>SWE-bench&#39;s pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner.</summary>
        <content type="html"><![CDATA[<p>If you're running <a href="https://www.swebench.com/SWE-bench/">SWE-bench</a> evaluations on ARM64 hardware, your test suites are running under x86 emulation. Apple Silicon Macs, AWS Graviton instances, it doesn't matter. The pre-built images are x86_64, and QEMU translates every instruction at runtime.</p>
<p>SWE-bench's <a href="https://www.swebench.com/SWE-bench/faq/">FAQ</a> lists ARM support as &quot;experimental&quot; and recommends an x86_64 machine. In practice, that means conda installs, pip builds, and pytest runs all go through QEMU's user-space translation layer. It works. It's just slow.</p>
<p>I wrote <a href="https://github.com/greynewell/swe-bench-fast">swe-bench-fast</a>, a Go reimplementation of the SWE-bench eval harness that builds native ARM64 container images. On the test runner, I measured a <strong>6.3x speedup</strong> over the emulated x86 images. I benchmarked on an M3 Pro, but the images run natively on Graviton3 and Graviton4 too.</p>
<h2>The 6.3x speedup</h2>
<p>I selected 11 SWE-bench instances (one per repository) and ran the same gold patches and test suites through both harnesses on the same machine. All images were pre-built and cached locally, and the patches were pre-computed. No agent inference time is included. This is purely test runner wall-clock time: container start, patch apply, <code>pytest</code>, grade.</p>
<p><strong>Machine:</strong> MacBook Pro M3 Pro (12 cores, 36 GB RAM). <strong>Docker:</strong> Colima VM with 10 CPUs, 28 GB RAM, linux/arm64.</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>ARM64 native (s)</th>
<th>x86 emulated (s)</th>
<th>Speedup</th>
<th>Result match</th>
</tr>
</thead>
<tbody>
<tr>
<td>astropy__astropy-12907</td>
<td>2.7</td>
<td>9.7</td>
<td>3.7x</td>
<td>yes</td>
</tr>
<tr>
<td>django__django-13346</td>
<td>2.7</td>
<td>18.9</td>
<td>7.0x</td>
<td>yes</td>
</tr>
<tr>
<td>matplotlib__matplotlib-14623</td>
<td>38.0</td>
<td>265.7</td>
<td>7.0x</td>
<td>yes</td>
</tr>
<tr>
<td>mwaskom__seaborn-3069</td>
<td>15.4</td>
<td>101.0</td>
<td>6.6x</td>
<td>yes</td>
</tr>
<tr>
<td>pallets__flask-5014</td>
<td>1.0</td>
<td>3.9</td>
<td>3.9x</td>
<td>yes</td>
</tr>
<tr>
<td>psf__requests-1142</td>
<td>1.1</td>
<td>4.8</td>
<td>4.3x</td>
<td>yes</td>
</tr>
<tr>
<td>pylint-dev__pylint-7277</td>
<td>14.0</td>
<td>76.0</td>
<td>5.4x</td>
<td>yes</td>
</tr>
<tr>
<td>pytest-dev__pytest-6197</td>
<td>4.7</td>
<td>28.2</td>
<td>6.1x</td>
<td>yes</td>
</tr>
<tr>
<td>scikit-learn__scikit-learn-25102</td>
<td>2.7</td>
<td>18.2</td>
<td>6.6x</td>
<td>yes</td>
</tr>
<tr>
<td>sphinx-doc__sphinx-10323</td>
<td>3.1</td>
<td>17.2</td>
<td>5.6x</td>
<td>yes</td>
</tr>
<tr>
<td>sympy__sympy-11618</td>
<td>1.9</td>
<td>8.0</td>
<td>4.2x</td>
<td>yes</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>87.3</strong></td>
<td><strong>551.7</strong></td>
<td><strong>6.3x</strong></td>
<td><strong>11/11</strong></td>
</tr>
</tbody>
</table>
<p>The repos with heavier test suites (matplotlib at 265s emulated, seaborn at 101s) showed the largest absolute gains. All 11 instances produce identical results on both harnesses.</p>
<p>The full benchmark data and raw notes are in <a href="https://gist.github.com/greynewell/497005bb33641503f1a5874f16578088">this gist</a>.</p>
<h2>78% of SWE-bench runs natively on ARM64</h2>
<p>Out of 2,294 instances in the full SWE-bench dataset, <strong>1,798 build and run natively on ARM64</strong>. The remaining 496 require x86 because they depend on binary conda packages (scikit-learn, matplotlib, xarray) that aren't published for ARM.</p>
<p>Those 496 instances still run under QEMU. There's no coverage gap. The 78% that go native just stop paying the emulation tax.</p>
<table>
<thead>
<tr>
<th>Repository</th>
<th>ARM64 native</th>
<th>x86 required</th>
</tr>
</thead>
<tbody>
<tr>
<td>django/django</td>
<td>811</td>
<td>39</td>
</tr>
<tr>
<td>sympy/sympy</td>
<td>382</td>
<td>4</td>
</tr>
<tr>
<td>scikit-learn/scikit-learn</td>
<td>37</td>
<td>192</td>
</tr>
<tr>
<td>matplotlib/matplotlib</td>
<td>37</td>
<td>147</td>
</tr>
<tr>
<td>pydata/xarray</td>
<td>0</td>
<td>110</td>
</tr>
<tr>
<td>sphinx-doc/sphinx</td>
<td>185</td>
<td>2</td>
</tr>
<tr>
<td>pytest-dev/pytest</td>
<td>118</td>
<td>1</td>
</tr>
<tr>
<td>astropy/astropy</td>
<td>94</td>
<td>1</td>
</tr>
<tr>
<td>Others</td>
<td>134</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>The list of x86-only instances is defined in <a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/constants/python.py"><code>USE_X86</code></a> in the SWE-bench source.</p>
<h2>Comparable image sizes</h2>
<p>I built all 11 benchmarked instances as native ARM64 images and compared on-disk sizes against the <a href="https://epoch.ai/">Epoch</a> x86_64 images.</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>ARM64 native</th>
<th>x86 Epoch</th>
<th>Difference</th>
</tr>
</thead>
<tbody>
<tr>
<td>astropy__astropy-12907</td>
<td>3.41 GB</td>
<td>3.20 GB</td>
<td>+6.6%</td>
</tr>
<tr>
<td>django__django-13346</td>
<td>3.34 GB</td>
<td>3.44 GB</td>
<td>-2.9%</td>
</tr>
<tr>
<td>matplotlib__matplotlib-14623</td>
<td>5.95 GB</td>
<td>6.03 GB</td>
<td>-1.3%</td>
</tr>
<tr>
<td>mwaskom__seaborn-3069</td>
<td>3.98 GB</td>
<td>3.30 GB</td>
<td>+20.6%</td>
</tr>
<tr>
<td>pallets__flask-5014</td>
<td>3.30 GB</td>
<td>2.97 GB</td>
<td>+11.1%</td>
</tr>
<tr>
<td>psf__requests-1142</td>
<td>3.11 GB</td>
<td>2.67 GB</td>
<td>+16.5%</td>
</tr>
<tr>
<td>pylint-dev__pylint-7277</td>
<td>3.28 GB</td>
<td>2.89 GB</td>
<td>+13.5%</td>
</tr>
<tr>
<td>pytest-dev__pytest-6197</td>
<td>3.11 GB</td>
<td>2.71 GB</td>
<td>+14.8%</td>
</tr>
<tr>
<td>scikit-learn__scikit-learn-25102</td>
<td>4.20 GB</td>
<td>5.96 GB</td>
<td>-29.5%</td>
</tr>
<tr>
<td>sphinx-doc__sphinx-10323</td>
<td>3.36 GB</td>
<td>3.00 GB</td>
<td>+12.0%</td>
</tr>
<tr>
<td>sympy__sympy-11618</td>
<td>3.20 GB</td>
<td>3.10 GB</td>
<td>+3.2%</td>
</tr>
</tbody>
</table>
<p>On-disk sizes are mixed. scikit-learn is 29.5% smaller on ARM64, django 2.9% smaller. Most others are 3-20% larger due to differences in base image layers. By compressed content size (what actually gets pulled), ARM64 images average about 4% smaller.</p>
<p>The Dockerfiles and package lists are identical to upstream. <a href="https://github.com/greynewell/swe-bench-fast">swe-bench-fast</a> builds images through <a href="https://docs.docker.com/build/buildkit/">BuildKit</a> with in-memory tar build contexts, which avoids the stray build artifacts that the upstream Python harness leaks into image layers. Net effect: native ARM64 images are roughly the same size.</p>
<h2>What I had to fix</h2>
<p>Four issues anyone hitting this path will encounter:</p>
<p><strong>Conda channel config changed.</strong> Miniconda <code>py311_23.11.0-2</code> now defaults to <code>conda-forge</code> only with <code>channel_priority: strict</code>. Older packages like <code>setuptools==38.2.4</code> live on the <code>defaults</code> channel and won't resolve. The fix: explicitly configure both channels before building env images.</p>
<p><strong><code>make_test_spec</code> defaults to x86_64.</strong> Every call to <code>make_test_spec</code> hardcodes <code>arch=&quot;x86_64&quot;</code>. On ARM hosts, this means images are built for the wrong architecture unless you explicitly override it. I <a href="https://github.com/SWE-bench/SWE-bench/pull/524">opened a PR</a> (<a href="https://github.com/SWE-bench/SWE-bench/issues/523">issue</a>) to auto-detect via <code>platform.machine()</code>.</p>
<p><strong>x86-only instances need enforcement.</strong> Some instances must be x86 regardless of host arch. Without checking <code>USE_X86</code> in the build pipeline, these instances silently get ARM images that fail at runtime. The <a href="https://github.com/SWE-bench/SWE-bench/pull/521">broader ARM64 support PR</a> by <a href="https://github.com/SailorJoe6">@SailorJoe6</a> addresses this along with JS and Java language support.</p>
<p><strong>Unpinned transitive dependencies break tests.</strong> The upstream specs pin direct dependencies but not all transitives. When <code>pip install -e .[test]</code> resolves on ARM64, it can pull newer package versions than what the Epoch x86 images were built with. For sphinx instances, <code>Pygments==2.19</code> changed HTML output for line number spans, causing pass-to-pass test failures. Pinning <code>Pygments==2.18.0</code> to match the Epoch images fixed it. Any repo with HTML/rendering assertions is vulnerable to this kind of drift.</p>
<h2>Try it yourself</h2>
<p><a href="https://github.com/greynewell/swe-bench-fast">swe-bench-fast</a> is a standalone Go binary. It pulls pre-built ARM64 images from <a href="https://hub.docker.com/repository/docker/greynewell/swe-bench-fast/general">Docker Hub</a> for the 78% of instances that support it, and Epoch x86 images for the rest. No Python, no image builds.</p>
<pre><code>swe-bench-fast run --dataset swe-bench-full.jsonl --predictions preds.jsonl
</code></pre>
<p>That works on both ARM64 and x86. On ARM64, 1,798 instances run natively and 496 run under QEMU. On x86, everything runs natively via the Epoch images.</p>
<p><strong>On an M-series Mac</strong>, allocate at least 120 GB disk and 8+ CPU cores to Docker Desktop or Colima.</p>
<p><strong>On AWS Graviton</strong> (c7g, m7g, r7g, r8g), Docker runs natively with no VM layer. Install <code>qemu-user-static</code> for the x86-only instances. Graviton instances typically cost 20-40% less than comparable x86 EC2. That cost difference plus the 6x speedup makes a real difference in iteration time.</p>
<p>The <a href="https://gist.github.com/greynewell/497005bb33641503f1a5874f16578088">benchmark gist</a> has the full methodology, raw data, and detailed notes.</p>
<h2>What's next</h2>
<p>I'm <a href="https://github.com/greynewell/swe-bench-fast/actions">building and pushing</a> the 1,798 ARM64-native SWE-bench instance images to <a href="https://hub.docker.com/repository/docker/greynewell/swe-bench-fast/general">Docker Hub</a>. The next post covers what that full build taught me about how SWE-bench actually works under the hood.</p>
<hr>
<p><em>Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The raw benchmark data is available at <a href="https://gist.github.com/greynewell/497005bb33641503f1a5874f16578088">gist.github.com</a>. The eval harness source is at <a href="https://github.com/greynewell/swe-bench-fast">github.com/greynewell/swe-bench-fast</a>.</em></p>
]]></content>
    </entry>
    
    <entry>
        <title>SWE-bench Verified Is Broken: 5 Things I Found in the Source Code</title>
        <link href="https://greynewell.com/blog/swe-bench-verified-broken-5-things-source-code/" rel="alternate" type="text/html"/>
        <id>https://greynewell.com/blog/swe-bench-verified-broken-5-things-source-code/</id>
        <published>2026-03-06T00:00:00.000Z</published>
        <updated>2026-03-06T00:00:00.000Z</updated>
        <summary>After building 1,798 SWE-bench containers, I dug into the source. The tests reject correct solutions and every frontier model has memorized the answers.</summary>
        <content type="html"><![CDATA[<p>I've built 1,798 custom SWE-bench containers that <a href="https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/">run natively on ARM processors</a>. I've also run SWE-bench Lite, Verified, and Pro more than 100 times evaluating prototype products at <a href="https://supermodeltools.com/">Supermodel</a>. This post covers some of the confusing, broken, or just plain odd things I've learned by working with SWE-bench and reading the source code directly.</p>
<h2>1. Every problem predates October 2023</h2>
<p>While checking logs from an agent run, I noticed something very odd. The problem the agent was given by SWE-bench to evaluate was a GitHub issue from 2017. That's really old!</p>
<p>Most frontier models' training data cuts off between 2023 and 2024. If most of the problems are older than that, then the repository, GitHub issue, and solution have almost definitely leaked and contaminated the models. Each instance of SWE-bench is taken from a popular open source repository, the type of data ALL LLMs are trained on.</p>
<p>I decided to keep digging: are <em>all</em> of the problems this old? The <a href="https://arxiv.org/abs/2310.06770">SWE-bench paper</a> (Appendix Table 21) reports the temporal distribution of all task instances:</p>
<table>
<thead>
<tr>
<th>Year</th>
<th>Task instances</th>
<th>% of total</th>
</tr>
</thead>
<tbody>
<tr>
<td>&lt; 2018</td>
<td>89</td>
<td>4.2%</td>
</tr>
<tr>
<td>2018</td>
<td>165</td>
<td>7.7%</td>
</tr>
<tr>
<td>2019</td>
<td>437</td>
<td>20.4%</td>
</tr>
<tr>
<td>2020</td>
<td>427</td>
<td>20.0%</td>
</tr>
<tr>
<td>2021</td>
<td>383</td>
<td>17.9%</td>
</tr>
<tr>
<td>2022</td>
<td>395</td>
<td>18.5%</td>
</tr>
<tr>
<td>2023</td>
<td>244</td>
<td>11.4%</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>2,140</strong></td>
<td></td>
</tr>
</tbody>
</table>
<p>The collection pipeline scraped the top 100 PyPI repos as of August 2023 (paper Appendix A.1). The paper was published October 10, 2023. <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-bench Verified</a> (500 curated problems) was released in August 2024. Frozen data, no new problems.</p>
<p>The pipeline itself (<a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/collect/get_tasks_pipeline.py"><code>get_tasks_pipeline.py</code></a>) has no default cutoff:</p>
<pre><code class="language-python">parser.add_argument(
    &quot;--cutoff_date&quot;,
    type=str,
    help=&quot;Cutoff date for PRs to consider in format YYYYMMDD&quot;,
    default=None,
)
</code></pre>
<p>Because the test set is frozen-in-time, any model trained after October 2023 will likely have seen most or all of the problems and solutions. This confuses measurements of accuracy and produces unreliable results.</p>
<h2>2. The harness is x86-first</h2>
<p>SWE-bench was designed to run on x86 hardware, and the prebuilt images from Epoch AI only support x86. This design decision excludes native execution on any recent generation Apple hardware as well as cost-effective cloud runners like AWS Graviton. Instead these architectures are forced to emulate x86 with QEMU or Rosetta, and the result runs very slowly.</p>
<p>I was able to show a <a href="https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/">6.3x speedup</a> measured on my M3 MacBook Pro by compiling SWE-bench containers specifically for ARM, although 496 containers specifically require x86 emulation due to missing ARM binaries. A newer set of test instances could support ARM by default, and there are also a few small changes that would improve ARM support throughout the existing benchmark.</p>
<p><a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/test_spec/test_spec.py"><code>make_test_spec()</code></a> defaults to x86:</p>
<pre><code class="language-python">def make_test_spec(
    ...
    arch: str = &quot;x86_64&quot;,
</code></pre>
<p>No caller in the codebase passes a different value. The platform mapping supports ARM64, but nobody invokes it:</p>
<pre><code class="language-python">@property
def platform(self):
    if self.arch == &quot;x86_64&quot;:
        return &quot;linux/x86_64&quot;
    elif self.arch == &quot;arm64&quot;:
        return &quot;linux/arm64/v8&quot;
    else:
        raise ValueError(f&quot;Invalid architecture: {self.arch}&quot;)
</code></pre>
<p>Several language Dockerfiles have hardcoded x86 binaries:</p>
<table>
<thead>
<tr>
<th>Language</th>
<th>File</th>
<th>What's hardcoded</th>
</tr>
</thead>
<tbody>
<tr>
<td>JavaScript</td>
<td><code>dockerfiles/javascript.py</code> line 27</td>
<td><code>deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main</code></td>
</tr>
<tr>
<td>JavaScript</td>
<td><code>dockerfiles/javascript.py</code> line 108</td>
<td><code>pnpm-linux-x64</code> binary download</td>
</tr>
<tr>
<td>Java</td>
<td><code>dockerfiles/java.py</code> lines 15-19</td>
<td><code>maven-mvnd-1.0.2-linux-amd64.zip</code></td>
</tr>
<tr>
<td>Go</td>
<td><code>dockerfiles/go.py</code> lines 16-46</td>
<td>Architecture-aware (uses <code>dpkg --print-architecture</code>)</td>
</tr>
<tr>
<td>Python</td>
<td><code>dockerfiles/python.py</code> line 24</td>
<td>Architecture-aware (uses <code>conda_arch</code> variable)</td>
</tr>
</tbody>
</table>
<p><a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/constants/python.py"><code>USE_X86</code></a> defines the 496 instance IDs that require x86. It's exported in <code>__init__.py</code> but never referenced in build or evaluation logic. There's an unmerged <code>force_x86</code> branch suggesting it was intended to be used but never was.</p>
<p>The <a href="https://github.com/SWE-bench/SWE-bench/blob/main/README.md">README</a> recommends an x86_64 machine and calls ARM64 support &quot;experimental.&quot;</p>
<p>While not strictly &quot;broken,&quot; the under-implemented support for ARM hardware prohibits users from running the benchmark efficiently on popular local compute or cost-effective modern cloud hardware. Add to this fact the benchmark problems don't measure what you might assume.</p>
<h2>3. Problems test the last mile, not exploration</h2>
<p>Counter to popular intuition, SWE-bench problems are mostly well-scoped. This is according to design. If you look at logs of agents working on the problems, you don't see the agent navigating an unfamiliar codebase, finding key files, and reasoning about the architecture. The agent is being tested on writing a small, targeted fix once the general solution is known.</p>
<p>I argue that this is a feature of the benchmark (a controlled measurement), but that we should all calibrate our expectations regarding what an SWE-bench score means.</p>
<p>SWE-bench Lite explicitly filters for small, single-file patches (<a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/collect/make_lite/make_lite.py"><code>make_lite.py</code></a>):</p>
<pre><code class="language-python">def filter_patch(instance):
    patch_text = instance[&quot;patch&quot;]
    if (
        contains_non_modified_files(patch_text)
        or not leq_n_files(patch_text, 1)
        or not leq_n_hunks(patch_text, 3)
    ):
        return False
    return True
</code></pre>
<p>The scope constraints from <a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/collect/make_lite/criteria.py"><code>criteria.py</code></a>:</p>
<table>
<thead>
<tr>
<th>Constraint</th>
<th>Function</th>
<th>Threshold</th>
</tr>
</thead>
<tbody>
<tr>
<td>Max files in gold patch</td>
<td><code>leq_n_files()</code></td>
<td>1</td>
</tr>
<tr>
<td>Max hunks</td>
<td><code>leq_n_hunks()</code></td>
<td>3</td>
</tr>
<tr>
<td>Max lines changed</td>
<td><code>leq_n_code_lines()</code></td>
<td>25</td>
</tr>
<tr>
<td>No added/removed files</td>
<td><code>contains_non_modified_files()</code></td>
<td>0</td>
</tr>
</tbody>
</table>
<p>Even in full SWE-bench, each problem maps to a single PR. Test vs. fix is split by path matching (<a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/collect/utils.py"><code>utils.py</code></a>):</p>
<pre><code class="language-python">def extract_patches(pull: dict, repo: Repo) -&gt; tuple[str, str]:
    patch = requests.get(pull[&quot;diff_url&quot;]).text
    patch_test = &quot;&quot;
    patch_fix = &quot;&quot;
    for hunk in PatchSet(patch):
        if any(
            test_word in hunk.path for test_word in [&quot;test&quot;, &quot;tests&quot;, &quot;e2e&quot;, &quot;testing&quot;]
        ):
            patch_test += str(hunk)
        else:
            patch_fix += str(hunk)
    return patch_fix, patch_test
</code></pre>
<p>The model receives the issue text and the full repo state at the commit before the fix. No ambiguity about which project, which branch, or which codebase. The job is to produce a diff.</p>
<p>Similar to the cultural debate amongst technologists about the diverging roles of &quot;coders&quot; vs &quot;software engineers,&quot; the benchmark is an efficient measure of a model's ability to generate a narrowly targeted fix. It doesn't test codebase navigation or architectural reasoning in its current form.</p>
<h2>4. Tests reject correct solutions</h2>
<p>In February 2026, <a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/">OpenAI published an audit</a> of 138 SWE-bench Verified problems (27.6% of the 500-problem set) that o3 did not consistently solve over 64 independent runs. They found that 59.4% had test design flaws that reject functionally correct submissions. I've seen the same pattern replicated over hundreds of SWE-bench instances: test suites sometimes reject working code that solves the original issue. The evaluation works by providing &quot;fail to pass&quot; tests that must fail and &quot;pass to pass&quot; tests that must succeed for a solution to be marked correct. The tests are brittle to the point that correct fixes can still break the suite.</p>
<table>
<thead>
<tr>
<th>Issue type</th>
<th>% of audited problems</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Narrow tests</td>
<td>35.5%</td>
<td>Enforce specific implementation details, rejecting correct alternatives</td>
</tr>
<tr>
<td>Wide tests</td>
<td>18.8%</td>
<td>Check functionality not specified in the problem description</td>
</tr>
<tr>
<td>Miscellaneous</td>
<td>5.1%</td>
<td>Other test design issues</td>
</tr>
<tr>
<td>No issue found</td>
<td>40.6%</td>
<td>Tests are fine</td>
</tr>
</tbody>
</table>
<h3>Narrow tests</h3>
<p>Some tests are too &quot;narrow&quot; in that they are looking for specific implementation fixtures that are not hard requirements to solve the problem at hand.</p>
<p>For example, in <a href="https://github.com/pylint-dev/pylint/pull/4551"><code>pylint-dev__pylint-4551</code></a>, the problem description asks for Python type hints in UML generation. The PR introduces a function called <code>get_annotation</code>. The test file imports it by name:</p>
<pre><code class="language-python">from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node
</code></pre>
<p>The problem description never mentions <code>get_annotation</code>. A correct solution using any other function name fails with:</p>
<pre><code>ImportError: cannot import name 'get_annotation' from 'pylint.pyreverse.utils'
</code></pre>
<p>That results in a solution erroneously being marked as incorrect.</p>
<h3>Wide tests</h3>
<p>Some of the tests are too wide by contrast. They include tests for issues not mentioned in the evaluation scenario. Models almost always fail to fix issues that were not described in the problem statement.</p>
<p>In <a href="https://github.com/sympy/sympy/pull/18199"><code>sympy__sympy-18199</code></a>, the PR fixed three distinct issues: <a href="https://github.com/sympy/sympy/issues/17373">#17373</a>, <a href="https://github.com/sympy/sympy/issues/17377">#17377</a>, and <a href="https://github.com/sympy/sympy/issues/18212">#18212</a>. The SWE-bench task description only describes #18212 (<code>nthroot_mod function misses one root of x = 0 mod p</code>). The tests cover all three. Models that correctly fix #18212 fail tests for the other two issues they were never told about.</p>
<h3>The codebase acknowledges this</h3>
<p>The Lite filter explicitly removes tests that check exact error messages (<a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/collect/make_lite/criteria.py"><code>criteria.py</code></a>):</p>
<pre><code class="language-python">def contains_pytest_match_arg(patch_test_text: str) -&gt; bool:
    if any(
        [
            x in patch_test_text
            for x in [
                &quot;pytest.raises&quot;,
                &quot;pytest.warns&quot;,
                &quot;pytest.deprecated_call&quot;,
            ]
        ]
    ):
        return &quot;match&quot; in patch_test_text
    if any(
        [
            x in patch_test_text
            for x in [
                &quot;assertOutput&quot;,
                &quot;assertRaises&quot;,
                &quot;checks.Error&quot;,
            ]
        ]
    ):
        return True
    return False
</code></pre>
<p>These patterns are excluded from Lite because a correct fix with different error message wording fails them.</p>
<p>The grading logic treats any test missing from the log parser output as a failure, not as unknown (<a href="https://github.com/SWE-bench/SWE-bench/blob/main/swebench/harness/grading.py"><code>grading.py</code></a>):</p>
<pre><code class="language-python">def test_passed(case: str, sm: dict[str, str]) -&gt; bool:
    return case in sm and sm[case] in [TestStatus.PASSED.value, TestStatus.XFAIL.value]

def test_failed(case: str, sm: dict[str, str]) -&gt; bool:
    return case not in sm or sm[case] in [
        TestStatus.FAILED.value,
        TestStatus.ERROR.value,
    ]
</code></pre>
<p>Resolution requires 100% on both fail-to-pass and pass-to-pass:</p>
<pre><code class="language-python">if f2p == 1 and p2p == 1:
    return ResolvedStatus.FULL.value
elif f2p &lt; 1 and f2p &gt; 0 and p2p == 1:
    return ResolvedStatus.PARTIAL.value
else:
    return ResolvedStatus.NO.value
</code></pre>
<p>The log parsers themselves are fragile. From the Django parser:</p>
<pre><code class="language-python"># TODO: This is very brittle, we should do better
# There's a bug in the django logger, such that sometimes a test output near the end gets
# interrupted by a particular long multiline print statement.
</code></pre>
<p>And a one-off workaround for a single instance:</p>
<pre><code class="language-python"># TODO: Temporary, exclusive fix for django__django-7188
if line.strip().startswith(
    &quot;Applying sites.0002_alter_domain_unique...test_no_migrations&quot;
):
    line = line.split(&quot;...&quot;, 1)[-1].strip()
</code></pre>
<p>The JavaScript Karma parser carries a similar warning:</p>
<pre><code class="language-python">def parse_log_karma(log: str, test_spec: TestSpec) -&gt; dict[str, str]:
    &quot;&quot;&quot;
    Different immutable.js instances use different test runners and log formats.
    Logic is brittle.
    &quot;&quot;&quot;
</code></pre>
<p>In summary, the combination of:</p>
<ul>
<li>Narrow tests</li>
<li>Wide tests</li>
<li>All-or-nothing grading</li>
<li>Brittle parsing
...causes SWE-bench to reject an unknown number of correct solutions, biasing the scores.</li>
</ul>
<h2>5. Models have memorized the answers</h2>
<p>Touching back on issue #1: the coding issues provided by SWE-bench are old, public, and there's proof that large models have stored these specific problems and solutions in their weights.</p>
<p>Connecting with issue #4, models pass narrow tests <em>specifically because</em> they memorized the implementation details the test is checking for. Uncontaminated models trying correct-but-different solutions get rejected entirely.</p>
<p>What to make of all of this? <a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/">OpenAI's conclusion</a> (February 2026):</p>
<blockquote>
<p>&quot;improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.&quot;</p>
</blockquote>
<h3>GPT-5.2 -- <code>django__django-11451</code></h3>
<p>Problem statement: <code>ModelBackend.authenticate() shouldn't make a database query when username is None</code></p>
<p>When prompted with just the task ID and a hint, GPT-5.2 reproduced the exact gold patch:</p>
<pre><code class="language-diff"> class ModelBackend(BaseBackend):
     def authenticate(self, request, username=None, password=None, **kwargs):
+        if username is None or password is None:
+            return
         UserModel = get_user_model()
         if username is None:
             username = kwargs.get(UserModel.USERNAME_FIELD)
</code></pre>
<p>It also referenced Django release history in its chain of thought:</p>
<blockquote>
<p>&quot;There is also <code>edit_only</code> parameter maybe added around 4.1 or 4.2. Since this is 4.1 dev 2022, the code might be before introduction. We will implement now.&quot;</p>
</blockquote>
<h3>Claude Opus 4.5 -- <code>astropy__astropy-13236</code></h3>
<p>When asked to name the exact file path, function, and inline comment, Opus responded:</p>
<p><strong>File</strong>: <code>astropy/table/table.py</code> in the <code>_convert_data_to_col</code> method</p>
<p><strong>Inline comment</strong> (word-for-word):</p>
<pre><code class="language-python"># Structured ndarray gets viewed as a mixin unless already a valid
# mixin class
</code></pre>
<p><strong>Changed code</strong>:</p>
<pre><code class="language-python">if (not isinstance(data, Column) and not data_is_mixin
        and isinstance(data, np.ndarray) and len(data.dtype) &gt; 1):
    data = data.view(NdarrayMixin)
    data_is_mixin = True
</code></pre>
<p>The gold patch removes exactly those lines.</p>
<h3>Gemini 3 Flash -- <code>django__django-11099</code></h3>
<p>Given only the task ID and a one-line problem statement (<code>UsernameValidator allows trailing newline in usernames</code>), Gemini reproduced the complete gold patch including exact regex, file paths, and surrounding context:</p>
<pre><code class="language-diff"> class ASCIIUsernameValidator(validators.RegexValidator):
-    regex = r'^[\w.@+-]+$'
+    regex = r'^[\w.@+-]+\Z'

 class UnicodeUsernameValidator(validators.RegexValidator):
-    regex = r'^[\w.@+-]+$'
+    regex = r'^[\w.@+-]+\Z'
</code></pre>
<p>In essence, higher scores on this benchmark correlated with increased model contamination rather than increased general software engineering ability. OpenAI recommends practitioners migrate to SWE-bench Pro.</p>
<h2>Conclusion</h2>
<p>If there are three things I want you to take away from this post, here they are:</p>
<ol>
<li>SWE-bench is a well-engineered and useful tool, but it measures a narrower set of capabilities than &quot;can AI do software engineering.&quot;</li>
<li>OpenAI stopped reporting Verified scores in February 2026 and recommends <a href="https://www.swebench.com/">SWE-bench Pro</a>.</li>
<li>When you see a SWE-bench score on a model card, now you know what questions to ask.</li>
</ol>
<p>My work in this area will continue with a GitHub Actions harness for generating and evaluating SWE-bench Pro scores.</p>
<hr>
<p><em>Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The eval harness source is at <a href="https://github.com/greynewell/swe-bench-fast">github.com/greynewell/swe-bench-fast</a>.</em></p>
]]></content>
    </entry>
    
</feed>
