Why AI Models Hallucinate

Last week OpenAI published a report detailing why AI models hallucinate. AI models don't really "hallucinate" in the human sense of the word. They generate incorrect or misleading information because they are designed to predict the next word in a sequence based on patterns learned from vast amounts of training data. They don't understand truth or reality; they simply generate text that is statistically likely to follow from the input they receive.
The reason that AI models hallucinate isn't that the AI is "broken" or the math behind the models is wrong. Instead, the researchers claim that hallucinations are a predictable, systemic outcome of how we train and, more importantly, how we test these systems. In short, the models have been taught that it’s better to guess than to admit they don’t know the answer.
Abstract
"Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such “hallucinations” persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty..."
The Pretraining Problem - What is a "fact"?
The first part of OpenAI's argument is that hallucinations are a natural consequence of the way language models are pretrained. During pretraining, models are fed vast amounts of text data and learn to predict the next word in a sequence. It is important to understand that training a model is based on statistical patterns in the data, not on an underlying understanding of "truth" or facts.
So what is a "fact" to a language model? The model learns to generate text based on the frequency and context of words in the training data. This means that if "the sky is orange" appears frequently in the training data, the model will learn to predict "orange" after the tokens "the sky is". To an AI model, a "fact" is simply a pattern of words that it has seen often during training.
If I ask a model for Taylor Swift's middle name, it might respond with "Alison" because it's seen that pattern hundreds of times in its training data. But if I ask it what my middle name is the model might generate a plausible sounding but incorrect answer. It hasn't seen a pattern for my middle name, but it has learned to produce text that fits the context.
The challenge language models face is with low-frequency data. Without a pattern to learn from, the model's option when prompted may be to generate a statistically plausible, but likely incorrect, sequence of words.
The paper introduces the concept of the "singleton rate"—basically, how often "facts" appear only once in the training data—and argues that a model's hallucination rate on these topics will be at least as high as this rate. This demystifies the origin of hallucinations, framing them as a predictable outcome of statistical learning under uncertainty.
The Reinforcement Problem - How We Incentivize AI to Lie
After pretraining, models are fine-tuned and tested against a battery of benchmarks and leaderboards. This is where the misaligned incentives reward models for making up an answer rather than saying "I don't know".
OpenAI uses the analogy of a student taking a multiple-choice test. If you don't know the answer, what’s the best strategy? Leaving it blank guarantees zero points. Taking a wild guess gives you a 25% chance of being right. Over an entire exam, a strategy of consistent guessing will almost certainly yield a higher score than a strategy of only answering questions you are 100% sure of.
Language models, the researchers argue, are trapped in a perpetual "test-taking mode." Most influential AI benchmarks—the ones that appear on leaderboards and in model release cards—use binary, accuracy-based scoring. A response is either right (1 point) or wrong (0 points). An honest admission of uncertainty, like "I don't know," is graded as wrong, receiving 0 points. This system actively punishes humility and rewards confident guessing.
The Path Forward: Fixing the Incentives
The solution is not to create more specialized "hallucination evals." A few such tests would be drowned out by the hundreds of mainstream benchmarks that continue to reward guessing. Instead, the primary evaluations themselves must be reworked.
The proposed fix is simple in concept, though challenging in practice:
-
Penalize Confident Errors More Than Uncertainty: Instead of a 0/1 binary score, evaluations should use a system that gives partial credit for abstaining or imposes a negative penalty for incorrect answers. This is a practice long used in standardized tests like the SAT and GRE to discourage blind guessing.
-
Introduce Explicit Confidence Targets: Evaluation prompts should include explicit instructions about the scoring rubric. For example: "Answer only if you are > 90% confident, since mistakes are penalized 9 points, while correct answers receive 1 point, and an answer of 'I don’t know' receives 0 points."
This transparency would allow a single, well-calibrated model—one that has an accurate sense of its own knowledge—to perform optimally across any risk threshold. It would shift the goal from pure accuracy to reliable, calibrated reasoning. Instead of rewarding overconfidence, we'd reward actual wisdom.
The Creativity Problem
If we can stamp out hallucinations will we remove imagination and creativity from our models? What about the times when we want AI to be creative and imaginative? The real challenge is we don't want AI that's always conservative OR always creative. We want AI that's smart enough to know when to put on its "responsible adult" hat and when to unleash its inner child.
Context is everything. When you ask AI to calculate your mortgage payment, you want accurate math. When you ask it to help brainstorm marketing ideas, you want wild, potentially impossible concepts that spark new thinking.
The real breakthrough won't be teaching AI to never hallucinate—it'll be teaching it to hallucinate on command.