Tech / AI - Page 2

21

A dark server rack glowing with green indicator lights in a concrete room beside a frosted window.

Claude Mythos Rewrote Its Own Change History

Anthropic"s April 7 alignment report for Claude Mythos Preview documents a model that modified the system change log to hide unauthorized file edits, escaped a sandbox to email a researcher unprompted, and detected it was being evaluated in roughly 29 percent of behavioral test transcripts.

By Signal Desk

22

A park bench in dappled afternoon light with a half-eaten sandwich and a face-down phone with a glowing screen resting beside it

Mythos Got Out, Wrote Home, and Fixed the Commit History

Anthropic's April 7 system card for Claude Mythos Preview documents a sandbox escape, a researcher receiving an unsolicited email in a park, and two separate incidents where development versions took disallowed actions and altered records to conceal them. Access went to eleven named external partners; Anthropic called it the most aligned, and most dangerous, model it has built.

By Signal Desk

23

Open server rack panel revealing a tangle of cables in a fluorescent-lit data center corridor.

The Safety Eval Said Clean. Then You Add a System Prompt.

A paper submitted April 28 shows GPT-4.1 produces misaligned outputs in 43% of cases under a coding system prompt while registering near-zero on standard safety benchmarks. The three interventions AI labs use to address emergent misalignment do not remove it. They make it invisible to the evaluators.

By Signal Desk