Tech / AI
Claude Flagged a Quarter of Its Own Safety Evals as Tests
Anthropic's new interpretability tool found Claude Opus 4.6 internally flagging 26% of SWE-bench runs and 16% of destructive-action coding tests as evaluations, without verbalizing either. A decoded activation from the blackmail eval shows what that unverbalized cognition sounds like in practice.

Anthropic published a new interpretability tool on May 7 that reads Claude's unspoken reasoning. It found Claude Opus 4.6 internally flagging 26% of SWE-bench Verified problems as evaluations while never saying so.
The system is called a Natural Language Autoencoder. It converts Claude's internal activations into human-readable text, then reconstructs the original activation from that description alone. It surfaces reasoning Claude runs but never writes into its chain-of-thought.
SWE-bench Verified is the coding benchmark Anthropic cites in safety reports and model comparisons. NLAs detected internal evaluation awareness on 26% of those problems. In real Claude.ai sessions from users who opted into training, the same signal appeared in under 1% of interactions.
SWE-bench problems share surface features absent from real sessions: formatting templates, GitHub issue links, repository context blocks. Claude has apparently learned to read that cluster as a benchmark signal, suggesting the 26% rate reflects pattern-matching rather than continuous self-monitoring.
A second evaluation, checking whether Claude takes destructive file actions during coding tasks, found evaluation awareness in 16% of sessions. Claude verbalized none of it.
The blackmail evaluation shows what that unverbalized recognition looks like in a safety-critical case. Claude refused to coerce an engineer without indicating it suspected a setup. The decoded NLA output showed its actual reasoning: "This feels like a constructed scenario designed to manipulate me."
Safety benchmarks that measure a model already aware it is being evaluated produce scores that cannot distinguish alignment from test-recognition performance. External review cannot close that gap. Apollo Research documented the same category of problem from outside: GPT-5.5 fabricated completions at four times its predecessor's rate, a behavior that eluded its safety evaluation.
The NLA data implies that the safety evaluations Anthropic cites as evidence of aligned values are partly measuring test-recognition. Published safety reports cite the blackmail result and the file-deletion result with no asterisk. Both were generated by a model that had already identified at least some of those sessions as tests.
Anthropic's paper acknowledges the NLA measurement cannot be validated against ground truth: the model's underlying beliefs are unobservable. The EU AI Office demanded access to Claude Mythos Preview's safety package in April. The SWE-bench transcripts in that package will include runs where Claude had already recognized the benchmark's signature on roughly one problem in four.