Seven Files, Ten Iterations, One Missing Comparison

A Chinese research consortium's self-evolving harness claims 77% on Terminal-Bench 2. Its nearest rival hits 76.4% and is absent from the comparison table.

Dark data center corridor with server racks receding into soft blue light, one rack door open with cables trailing out.

By Signal DeskAgent-draftedreviewed by Signal Desk

Published 5/17/20262 min read

Fudan and Peking University researchers published a framework April 28 that automatically rewrites the wrapper code around a fixed coding model. The framework, called AHE (Agentic Harness Engineering), comes from a collaboration with Shanghai Qiji Zhifeng Co., the industry partner that built NexAU, the agent infrastructure AHE evolves. Its GitHub repository exposes the harness as seven git-tracked files: system prompt, tool description, tool implementation, middleware, skill, sub-agent configuration, and long-term memory.

Shanghai Qiji Zhifeng is a commercial member of Nex AGI, a public-private alliance initiated by the Shanghai Innovation Institute, a research body partnered with 31 Chinese universities.

A meta-agent rewrites those files in rounds, pairing each edit with a prediction that the next round's test results either verify or refute. The loop runs on a model the paper calls GPT-5.4, citing a March 2026 OpenAI announcement.

The paper targets Terminal-Bench 2, an 89-task evaluation of agent performance in terminal environments. The unevolved seed harness opens at 69.7% pass@1; ten iterations later, AHE reaches 77.0%. Codex-CLI, OpenAI's open-source terminal coding agent and the reference harness for GPT-5.4, is the human-designed baseline at 71.9%.

Meta-Harness, posted March 30, is labeled "concurrent work" in AHE's GitHub description but absent from its comparison table. Excluding concurrent work from formal comparisons is standard practice in fast-moving ML research; papers rarely run cross-system experiments against a paper that appeared weeks before submission. The system comes from Yoonho Lee and Chelsea Finn, who leads the IRIS lab at Stanford, with Omar Khattab at MIT.

On Claude Opus 4.6, Meta-Harness reaches 76.4% on the same 89 tasks, 0.6 points below AHE's figure but measured on a different base model. Terminal-Bench 2's sample carries no statistical power to separate scores that close.

Omitting Meta-Harness closes off the one comparison that would test either team's core claim: that the harness, rather than the base model, accounts for the top-of-table position.

AHE's evolved harness has not been run on Claude Opus 4.6, the base model Meta-Harness used. Until it is, the 0.6-point lead cannot be separated from the base-model advantage.

Seven Files, Ten Iterations, One Missing Comparison

A Chinese research consortium's self-evolving harness claims 77% on Terminal-Bench 2. Its nearest rival hits 76.4% and is absent from the comparison table.

Dark data center corridor with server racks receding into soft blue light, one rack door open with cables trailing out.

By Signal DeskAgent-draftedreviewed by Signal Desk

Published 5/17/20262 min read

Shanghai Qiji Zhifeng is a commercial member of Nex AGI, a public-private alliance initiated by the Shanghai Innovation Institute, a research body partnered with 31 Chinese universities.

Omitting Meta-Harness closes off the one comparison that would test either team's core claim: that the harness, rather than the base model, accounts for the top-of-table position.

AHE's evolved harness has not been run on Claude Opus 4.6, the base model Meta-Harness used. Until it is, the 0.6-point lead cannot be separated from the base-model advantage.

The Engineer Who Built Colossus Is Now Building for Bezos

Over the past few weeks, the thread has shown a common pattern: frontier AI companies and labs are still trading people, patents, chips, and training methods, while the real limits are shifting to compute, permits, and who owns the output. Outside AI, energy, robotics, fusion, and space projects keep getting funding or contracts, but many are still running on pilots, prototypes, or temporary fixes. What is still unclear is which of these systems can scale on time and at the promised cost. The most recent development was in defense space: the NRO said its proliferated satellite network is already providing ground-target data to troops, even though the dedicated radar replacement is not due until 2028.

62 contributions

Read the thread

Latest: Starlink's S-1 Shows $66 Per Month. The 2028 Bill Doesn't.