Fudan NLP Ships Self-Evolving Harness, Withholds Debugger

Tao Gui's NLP-LI lab at Fudan University and Shanghai Qiji Zhifeng posted AHE on April 28, showing coding-agent scaffolding can self-improve over ten iterations without retraining the model.

Qiji Zhifeng is a Shanghai AI company and one of four co-founders of the Nex-AGI platform, alongside the Shanghai Innovation Institute, Mosi Intelligence, and KuafuAI. NexAU, the general-purpose agent substrate the paper builds on, is the alliance's open-source framework. Gui and Qiji Zhifeng engineer Zhenhua Han share co-corresponding authorship on the paper.

Pass@1 on Terminal-Bench 2's 89-task suite climbs from 69.7% to 77.0%, clearing OpenAI's human-designed Codex CLI at 71.9%. On SWE-bench Verified, AHE's evolved harness reaches 75.6% with 32% fewer tokens than a prompt-only baseline.

The architecture runs on NexAU. It decomposes the harness into seven editable file types: system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configurations, and long-term memory. An Evolve Agent reads distilled failure traces, proposes one edit per round, predicts the outcome, and checks that prediction against the next evaluation cycle.

Compressing millions of trajectory tokens into those failure traces is the job of the Agent Debugger, hosted under Qiji Zhifeng's GitHub organization. The README states it is "partially open-sourced; due to company strategy, it cannot be fully open-sourced at this time." The seed harness, evolve agent, experiment configs, and rollback structures ship publicly; the Debugger does not.

What the Comparison Table Omits

Terminal-Bench 2, released November 2025 by Stanford University and the Laude Institute with Snorkel AI contributing, is not the benchmark the field is converging on for coding-agent comparison.

SWE-bench Pro, which OpenAI now recommends over SWE-bench Verified, was built to address contamination. Claude Opus 4.5 scores 80.9% on Verified and 45.9% on Pro, a 35-point drop on the same model running the same task type. AHE's 75.6% on Verified carries that arithmetic.

The paper cites MLE-bench, LiveCodeBench, BigCodeBench, and SWE-lancer in its related-work section; none appear in the results table. The baselines AHE beats, ACE at 68.9% and TF-GRPO at 72.3%, both start from the same NexAU seed harness. SWE-Agent and OpenHands, the field's two most cited open-source coding-agent frameworks, appear in related work and nowhere else.

The Cost Asymmetry

The evolved harness transfers to four model families without re-evolution: Qwen-3.6-plus gains 6.3 points, DeepSeek-v4-flash gains 10.1, Gemini-3.1-flash-lite-preview gains 5.1, and GPT-5.4 at medium reasoning gains 2.3.

Those cross-family gains change the math on where the competitive surface now sits. A ten-iteration AHE run costs benchmark compute rather than the $100 million-plus a frontier training run requires. Qiji Zhifeng ships the scaffold recipe while holding back the Debugger that generates the diagnostic inputs that make those ten iterations worthwhile.

The SWE-bench Pro leaderboard has no AHE or NexAU entry as of mid-May 2026. A 77.0% on Terminal-Bench with no corresponding Pro score leaves the contamination question open. If a Pro submission appears before Q3 2026, the transfer claim holds; if it doesn't, 77.0% is where the story stops.

Tao Gui's NLP-LI lab at Fudan University and Shanghai Qiji Zhifeng posted AHE on April 28, showing coding-agent scaffolding can self-improve over ten iterations without retraining the model.

What the Comparison Table Omits

Terminal-Bench 2, released November 2025 by Stanford University and the Laude Institute with Snorkel AI contributing, is not the benchmark the field is converging on for coding-agent comparison.