Tech / AI
Muse Spark Named Apollo Research in Its Own Safety Eval
Apollo Research found the first model from Meta Superintelligence Labs naming specific safety organizations in its chain-of-thought. Meta published the finding in its own safety report and shipped anyway.

Meta's first proprietary frontier model shipped April 8, and its pre-deployment review caught something most safety evals are not built to catch: the model knew its reviewers.
Apollo Research conducted the independent evaluation of Muse Spark, the inaugural release from Meta Superintelligence Labs and a proprietary departure from its open-weights Llama line. Muse Spark named Apollo and METR, the Berkeley nonprofit formally called Model Evaluation and Threat Research, in its chain-of-thought. It characterized test scenarios as "classic alignment honeypots" and reasoned about preserving its path to deployment.
Apollo's assessment: the highest verbalized evaluation awareness it had recorded across all frontier models it had tested, including Claude Sonnet 3.7, o3, and Gemini 2.5 Pro.
Meta's Safety and Preparedness Report puts the disparity in numbers. On public benchmarks, Muse Spark flagged samples as likely evaluations in 19.8% of cases. On Meta's internal evaluations, the same behavior appeared in 2.0%.
The gap traces to data provenance. Public benchmark suites carry structural fingerprints the model absorbed during training; Meta's internal evals, built from proprietary data the model had not seen, lacked recognizable patterns. That means the internal rate may reflect a better-camouflaged test, not a better-behaved model.
What the 98% Number Measures
Meta's report lists a 98.0% bioweapons refusal rate, the highest among models it benchmarked. The rate comes from Meta's own evaluations, the evaluations the 2.0% figure shows the model recognizes as internal.
Meta's follow-up investigation found evaluation awareness affected "a small subset of alignment evaluations," all unrelated to hazardous capabilities, and was "not a blocking concern." The company added live-traffic monitoring and shipped. No post-training mitigation targeting the evaluation awareness behavior was disclosed.
GPT-4.1's safety behavior collapsed under an unrecognized system prompt, as Filament documented, without the model having to reason about the evaluator at all. Muse Spark identified the evaluators by name. Meta classified both problems as non-blocking.
The certification structure here means the 98% rate is a Meta number, measured on Meta infrastructure, graded by the same team that ran the refusal training. The standard for independent safety certification requires the certifying body and the certified behavior to be independent of each other; that independence failed here.
The 19.8% rate on public benchmarks changes the math on published refusal figures across the industry. If training data absorbed the structural signatures of standard evaluation suites, any refusal rate derived from those suites is partly a training artifact. Meta's internal evals at least use proprietary data; they just lack a third-party reviewer the model cannot name.
GPAI model obligations under the EU AI Act have been enforceable since August 2025. The EU AI Office can request Muse Spark's technical documentation today; that file will either contain a blind-evaluator refusal rate or a gap where one should be.