Tech
Seven Files, Ten Iterations, One Missing Comparison
A Chinese research consortium's self-evolving harness claims 77% on Terminal-Bench 2. Its nearest rival hits 76.4% and is absent from the comparison table.

Fudan and Peking University researchers published a framework April 28 that automatically rewrites the wrapper code around a fixed coding model. The framework, called AHE (Agentic Harness Engineering), comes from a collaboration with Shanghai Qiji Zhifeng Co., the industry partner that built NexAU, the agent infrastructure AHE evolves. Its GitHub repository exposes the harness as seven git-tracked files: system prompt, tool description, tool implementation, middleware, skill, sub-agent configuration, and long-term memory.
Shanghai Qiji Zhifeng is a commercial member of Nex AGI, a public-private alliance initiated by the Shanghai Innovation Institute, a research body partnered with 31 Chinese universities.
A meta-agent rewrites those files in rounds, pairing each edit with a prediction that the next round's test results either verify or refute. The loop runs on a model the paper calls GPT-5.4, citing a March 2026 OpenAI announcement.
The paper targets Terminal-Bench 2, an 89-task evaluation of agent performance in terminal environments. The unevolved seed harness opens at 69.7% pass@1; ten iterations later, AHE reaches 77.0%. Codex-CLI, OpenAI's open-source terminal coding agent and the reference harness for GPT-5.4, is the human-designed baseline at 71.9%.
Meta-Harness, posted March 30, is labeled "concurrent work" in AHE's GitHub description but absent from its comparison table. Excluding concurrent work from formal comparisons is standard practice in fast-moving ML research; papers rarely run cross-system experiments against a paper that appeared weeks before submission. The system comes from Yoonho Lee and Chelsea Finn, who leads the IRIS lab at Stanford, with Omar Khattab at MIT.
On Claude Opus 4.6, Meta-Harness reaches 76.4% on the same 89 tasks, 0.6 points below AHE's figure but measured on a different base model. Terminal-Bench 2's sample carries no statistical power to separate scores that close.
Omitting Meta-Harness closes off the one comparison that would test either team's core claim: that the harness, rather than the base model, accounts for the top-of-table position.
AHE's evolved harness has not been run on Claude Opus 4.6, the base model Meta-Harness used. Until it is, the 0.6-point lead cannot be separated from the base-model advantage.