Tech
SWE-Atlas Grades the Harness as Much as the Model
Scale AI's benchmark puts the native-versus-generic scaffold gap on the leaderboard and sorts by best score anyway. A leaderboard footnote admits only frontier closed models get the native scaffold treatment.

Scale AI released SWE-Atlas on March 4, expanding coding agent evaluation from single-issue patches into comprehension, test generation, and refactoring across 18 open-source repositories.
The benchmark runs 284 expert-authored tasks in Python, Go, TypeScript, C, and C++. Scale chose Anthropic's Claude Opus 4.5 for the rubric half of its hybrid grader.
The Codebase QnA leaderboard puts the scaffold gap in plain sight: GPT-5.4 scores 40.80% on Codex CLI and 36.30% on Mini-SWE-Agent. Opus 4.6 shows the same split: 33.30% on Claude Code, 30.00% on Mini-SWE-Agent, both rows published. The leaderboard sorts by best score.
A footnote explains the asymmetry: native scaffolds were used for "frontier closed models" only. The companion paper explains why this matters: native scaffolds generate 1.5x to 2x more tool calls per task, and Claude models invoke sub-agents in 96 to 99% of trials on Claude Code against comparable GPT rates on Codex CLI.
The scaffold question has an open-source answer taking shape. On April 2, Claw Code launched as a Python-Rust rewrite of the Claude Code harness, after Anthropic accidentally shipped its TypeScript source in an npm packaging error. It ships 19 native tools and multi-agent orchestration, the same scaffold primitives Scale's paper credits for the performance split.
Third-party entries have one row each, all on Mini-SWE-Agent: GLM-5 at 20.50%, Gemini 3.1 Pro at 13.50%, Kimi K2.5 at 13.10%. Gemini's situation is specific: Scale ran it on Gemini CLI natively and documented the results in the companion paper, but Gemini invoked sub-agents in just 1.1% of trials against 96% for Claude models. The native Gemini CLI scores never made the leaderboard.
OpenAI's GPT-5.5 tops the board under Anthropic's grader. That cross-vendor arrangement usually raises fairness questions in the opposite direction; here the judge had every structural reason to favor its own maker and didn't. A model that wins under a rival's rubric is the result that is hardest to contest.
Ranking by best score when only frontier closed models get two scaffold choices means the leaderboard measures model-harness pairs, not models in isolation. GLM-5 at 20.50% and Gemini 3.1 Pro at 13.50% ran without the tool-call overhead that lifted GPT-5.4 by 4.5 points. Their native ceiling was never measured.
The SWE-Atlas leaderboard accepts live submissions. A Claw Code entry paired with a frontier model would be the first native open-source scaffold test on this dataset. If that entry doesn't appear before Q3, the 4.5-point performance split stays credited to the model.