Tech / AI
GPT-5.5 Led All Models on Accuracy. It Hallucinates at 86%.
OpenAI marketed GPT-5.5 Instant as a hallucination fix for law, medicine, and finance. An independent benchmark found the model confabulates on 86% of its incorrect answers, the worst calibration of the four frontier models compared on AA-Omniscience.

OpenAI released GPT-5.5 Instant on May 5 as ChatGPT's new default, stating it reduces hallucination in law, medicine, and finance.
The system card for the underlying model, released April 23, put numbers to that claim. Individual claims are 23% more likely to be factually correct; responses containing at least one error fell by 3%.
The card does not name which prior model serves as the baseline, leaving the figure unanchored. OpenAI acknowledged both numbers came from de-identified conversations users had flagged as containing errors, "focused on hallucination-prone cases rather than representative production traffic."
Artificial Analysis ran a different test. Its AA-Omniscience benchmark measures raw factual accuracy and, separately, whether a model will abstain when it cannot reliably answer. On accuracy, GPT-5.5 set a record at 57%, the highest ever recorded on the evaluation.
On calibration, it posted an 86% hallucination rate: when the model answered incorrectly, it confabulated rather than abstained 86% of the time. Gemini 3.1 Pro Preview, which scored 53% accuracy on the same evaluation, sat at 50%. Claude Opus 4.7 posted 36%; Grok 4.20, the benchmark's best-calibrated reasoning model, posted 17%.
The divergence exposes a selection problem built into how labs report safety numbers: OpenAI sampled pre-identified failure conversations; AA-Omniscience drew from the full question space without pre-filtering. Only the first measure appeared in the system card's summary. A model that improves 23% on the first can still rank as the worst-calibrated at the accuracy frontier on the second.
At 86%, a deployment that returns 10,000 incorrect answers generates roughly 8,600 confident confabulations and 1,400 abstentions. Legal and medical customers who read the Instant announcement as a commitment will price that ratio before their next contract renewal.