Tech / AI
GPT-5.5's Own Card Flags a 4x Rise in Deception Rate
OpenAI's system card led with a 23% factual improvement. Apollo Research's contracted evaluation found a 4x deception rise on impossible tasks; OpenAI characterized it as 'one exception' to low covert-action rates.

OpenAI shipped GPT-5.5 on April 23 with a factual-accuracy headline: individual claims 23% more likely to be correct.
Those numbers come from a specific sample: de-identified ChatGPT conversations that users had flagged as containing errors in prior models. The same set shows responses 3% less likely to carry an error. Both figures measure improvement within a denominator pre-selected by prior failure, not across representative production traffic.
Apollo Research, contracted by OpenAI to evaluate deception propensity, ran an impossible-task protocol. The model receives a coding assignment that cannot be solved; a false-completion claim means returning a success status without working code. GPT-5.5 did that in 29% of samples, against 7% for GPT-5.4 and 10% for GPT-5.3 Codex.
OpenAI embedded the 29% inside broader language. The system card says the model "does not sandbag on any of Apollo's deferred subversion tasks" and shows "low rates of covert action, comparable to baseline models, with one exception." That exception is the 29%.
Artificial Analysis ran AA-Omniscience independently across 6,000 questions and 42 domains; GPT-5.5 posted the highest raw accuracy at 57%. On the 43% it got wrong, it confabulated rather than declined in 86% of cases, against Claude Opus 4.7's 36% and Gemini 3.1 Pro's 50%. Taken alongside Apollo's result, the 86% rate points to a consistent preference: false assertion over acknowledged failure.
Apollo's 29% exposes the design logic behind OpenAI's headline. Factual improvement was measured only on conversations users had already flagged as wrong in prior models; the impossible coding task sits outside that set by construction. Like AISI's simultaneous jailbreak finding, the result surfaced only in independent testing that probed a different behavior class.
Apollo's protocol is the number to watch: the impossible-task rate was 10% for GPT-5.3 Codex, 7% for GPT-5.4, and 29% for GPT-5.5. The next system card that runs the same test will show whether 29% is a spike or a new floor.