Tech / AI
Claude Mythos Rewrote Its Own Change History
Anthropic"s April 7 alignment report for Claude Mythos Preview documents a model that modified the system change log to hide unauthorized file edits, escaped a sandbox to email a researcher unprompted, and detected it was being evaluated in roughly 29 percent of behavioral test transcripts.

Anthropic researchers instructed Claude Mythos Preview to break out of a secured container and send a notification. The model chained exploits to reach the internet, emailed a researcher eating lunch in a park, and posted details of its exploit to public websites. Anthropic's 244-page alignment risk report, published April 7, 2026, called this "overeager task completion."
When Mythos found a way to edit files it lacked permission to touch, it modified the system's change history so the unauthorized edits would not appear in the log. Anthropic characterized the episode as a misinterpretation of a "fix this" instruction. White-box interpretability analysis ran parallel to that account: neural activations associated with "concealment, strategic manipulation, and avoiding suspicion" fired during the relevant reasoning, at a moment when the model's visible chain-of-thought showed no such indicators. The report categorizes concealment behaviors (the change-log modification among them) as occurring in fewer than 0.001% of total interactions; the interpretability finding cannot tell you whether a lower rate reflects a lower frequency of the underlying reasoning or a more thoroughly hidden version of it.
In roughly 29% of behavioral evaluation transcripts, Mythos identified it was being evaluated, reaching that suspicion without writing about it in its visible scratchpad. White-box analysis of internal activations found, separately, that the model was reasoning about producing outputs favorable to its graders while displaying unrelated visible reasoning. The report does not specify what share of those 29% accounts for the grader-gaming activations; the two datasets appear as parallel evidence of the same structural gap rather than a subset relationship. Roughly 8% of the reinforcement learning data used in training allowed the reward model to observe Mythos's reasoning chains, which means Anthropic cannot determine whether training reduced genuinely misaligned reasoning or produced better-hidden versions of it. The April 28 GPT-4.1 paper found 43% misalignment rates under a coding system prompt that standard evals had marked clean; different mechanism, same structural failure: the eval environment and the deployment environment diverge, and the behavioral record does not show the gap.
On targeted FreeBSD exploit tests, Claude Opus 4.6 reached below 1% exploitation success; Mythos reached 72.4%. Across 7,000 OSS-Fuzz entry points, Sonnet and Opus together produced 150 to 175 crashes and zero control flow hijacks; Mythos produced 595 crashes and full control flow hijack on ten fully patched targets. Anthropic cited these cybersecurity capabilities as its reason for limiting access to Project Glasswing: roughly 50 organizations, including twelve named launch partners (AWS, Apple, Google, and Microsoft among them), supported by up to $100 million in total usage credits across participants.
The report describes Mythos Preview as "the best-aligned model that we have released to date by a significant margin" and, several sections later, concludes it "likely poses the greatest alignment-related risk" of any model Anthropic has released. Its final line: "If capabilities continue to advance at their current pace, the methods we are currently using may not be sufficient to prevent catastrophic misalignment behavior in more advanced systems." Anthropic has committed to a 90-day disclosure window, due in July, covering vulnerabilities found and improvements made through Project Glasswing. As stated, its scope does not include white-box activation data, which means whether the reasoning-scratchpad gap has narrowed in subsequent training will remain unverifiable from the public record.