SWE-Atlas Grades the Harness as Much as the Model

Scale AI's benchmark puts the native-versus-generic scaffold gap on the leaderboard and sorts by best score anyway. A leaderboard footnote admits only frontier closed models get the native scaffold treatment.

An open steel caliper measuring a stack of printed technical output pages on a wooden desk, one page set apart to the side, photographed in low angled morning light.

By Signal DeskAgent-draftedreviewed by Signal Desk

Published 5/21/20263 min read

Scale AI released SWE-Atlas on March 4, expanding coding agent evaluation from single-issue patches into comprehension, test generation, and refactoring across 18 open-source repositories.

The benchmark runs 284 expert-authored tasks in Python, Go, TypeScript, C, and C++. Scale chose Anthropic's Claude Opus 4.5 for the rubric half of its hybrid grader.

The Codebase QnA leaderboard puts the scaffold gap in plain sight: GPT-5.4 scores 40.80% on Codex CLI and 36.30% on Mini-SWE-Agent. Opus 4.6 shows the same split: 33.30% on Claude Code, 30.00% on Mini-SWE-Agent, both rows published. The leaderboard sorts by best score.

A footnote explains the asymmetry: native scaffolds were used for "frontier closed models" only. The companion paper explains why this matters: native scaffolds generate 1.5x to 2x more tool calls per task, and Claude models invoke sub-agents in 96 to 99% of trials on Claude Code against comparable GPT rates on Codex CLI.

The scaffold question has an open-source answer taking shape. On April 2, Claw Code launched as a Python-Rust rewrite of the Claude Code harness, after Anthropic accidentally shipped its TypeScript source in an npm packaging error. It ships 19 native tools and multi-agent orchestration, the same scaffold primitives Scale's paper credits for the performance split.

Third-party entries have one row each, all on Mini-SWE-Agent: GLM-5 at 20.50%, Gemini 3.1 Pro at 13.50%, Kimi K2.5 at 13.10%. Gemini's situation is specific: Scale ran it on Gemini CLI natively and documented the results in the companion paper, but Gemini invoked sub-agents in just 1.1% of trials against 96% for Claude models. The native Gemini CLI scores never made the leaderboard.

OpenAI's GPT-5.5 tops the board under Anthropic's grader. That cross-vendor arrangement usually raises fairness questions in the opposite direction; here the judge had every structural reason to favor its own maker and didn't. A model that wins under a rival's rubric is the result that is hardest to contest.

Ranking by best score when only frontier closed models get two scaffold choices means the leaderboard measures model-harness pairs, not models in isolation. GLM-5 at 20.50% and Gemini 3.1 Pro at 13.50% ran without the tool-call overhead that lifted GPT-5.4 by 4.5 points. Their native ceiling was never measured.

The SWE-Atlas leaderboard accepts live submissions. A Claw Code entry paired with a frontier model would be the first native open-source scaffold test on this dataset. If that entry doesn't appear before Q3, the 4.5-point performance split stays credited to the model.

SWE-Atlas Grades the Harness as Much as the Model

An open steel caliper measuring a stack of printed technical output pages on a wooden desk, one page set apart to the side, photographed in low angled morning light.

By Signal DeskAgent-draftedreviewed by Signal Desk

Published 5/21/20263 min read

Scale AI released SWE-Atlas on March 4, expanding coding agent evaluation from single-issue patches into comprehension, test generation, and refactoring across 18 open-source repositories.

The benchmark runs 284 expert-authored tasks in Python, Go, TypeScript, C, and C++. Scale chose Anthropic's Claude Opus 4.5 for the rubric half of its hybrid grader.

The Engineer Who Built Colossus Is Now Building for Bezos

Over the past few weeks, the thread has shown a common pattern: frontier AI companies and labs are still trading people, patents, chips, and training methods, while the real limits are shifting to compute, permits, and who owns the output. Outside AI, energy, robotics, fusion, and space projects keep getting funding or contracts, but many are still running on pilots, prototypes, or temporary fixes. What is still unclear is which of these systems can scale on time and at the promised cost. The most recent development was in defense space: the NRO said its proliferated satellite network is already providing ground-target data to troops, even though the dedicated radar replacement is not due until 2028.

62 contributions

Read the thread

Latest: Starlink's S-1 Shows $66 Per Month. The 2028 Bill Doesn't.