The 7-Point Harness Gain That Doesn't Show Up on SWE-bench

On April 28, Fudan University's Jiahang Lin and colleagues published a framework that lifts a coding agent's benchmark score 7.3 points without retraining the model.

The paper, co-authored with researchers from Peking University and Shanghai Qiji Zhifeng Co., introduces Agentic Harness Engineering (AHE) atop a runtime called NexAU. Corresponding author Hang Yan works at Qiji, the industrial partner. The thesis: fix the weights, evolve everything around them.

What the Architecture Ships

NexAU decomposes the agent harness into seven file-level components: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory. Each lives in its own tracked file.

A meta-agent reads execution traces, finds failure patterns, and rewrites the components that produced them. Ten iterations, then stop. The evolved harness freezes and transfers to other model families. The paper reports +5.1 to +10.1 percentage-point gains on four alternate models from a single harness developed on GPT-5.4.

The Benchmark They Chose

The team evaluated on Terminal-Bench 2, an 89-task suite of sandboxed terminal operations. They ran the evaluation internally (k=2 rollouts per task, one-hour timeout), outside the public leaderboard's submission pipeline. NexAU0, the bare harness, scored 69.7% with GPT-5.4; after ten AHE iterations, the same model reached 77.0%.

The paper's comparison table names three systems: Codex-CLI at 71.9%, TF-GRPO at 72.3%, and ACE at 68.9%. The paper's abstract calls TF-GRPO and ACE "self-evolving baselines." All three are research codebases; none appears on the public leaderboard.

Missing from the table: Claude Opus 4.7 Adaptive and Cursor Composer 2.5, the board's second and third entries at 69.4% and 69.3%. Both sit within 0.4 points of the NexAU0 seed score. "Adaptive" is the leaderboard's label for the Opus 4.7 submission; what optimization Anthropic applies under that label is not publicly documented. Both commercial entries were measured through the public submission pipeline, not the paper's internal setup. The 77.0%-to-69.4% gap is not a controlled comparison; it is the one a commercial buyer would check first.

The board's #1 entry is GPT-5.5 at 82.0%, logged by OpenAI, 5 points above the 77.0% AHE ceiling.

What SWE-bench Returns

On SWE-bench Verified (500 software-engineering patch tasks), AHE scored 75.6% with GPT-5.4 against a baseline of 75.2%. Token use per task fell from 526,000 to 461,000, a 12% cut that reduces API cost per run without moving accuracy.

The public Terminal-Bench 2 leaderboard carries 19 entries as of May 19. Neither AHE nor NexAU appears on it.

Where This Points

A 7.3-point gain on Terminal-Bench 2 and a 0.4-point gain on SWE-bench Verified do not come from the same place. One benchmark has 89 sandboxed terminal tasks; the other has 500 software-engineering patches. The harness calibrated well to the first; the evidence for the second is thin.

The choice of comparison set exposes more than benchmarking convention. Codex-CLI, ACE, and TF-GRPO are all research-stage systems; none appears on the Terminal-Bench 2 leaderboard. Evaluating AHE against unlisted frameworks through internal infrastructure, while omitting the board's commercial entries, produces a table the public leaderboard would not.

Watch whether the team submits AHE to the public Terminal-Bench 2 board. A 77.0% entry with GPT-5.4 would clear Cursor Composer 2.5 and Claude Opus 4.7 Adaptive and land at #2. GPT-5.5, already logged at 82.0%, sits 5 points ahead.

On April 28, Fudan University's Jiahang Lin and colleagues published a framework that lifts a coding agent's benchmark score 7.3 points without retraining the model.