OpenAI’s newest GPT models are struggling more than ever with hallucinations, making up false information at rates that have increased compared to earlier versions. OpenAI’s tests revealed this surprising trend, leaving many wondering why it’s happening.
The New York Times reports that OpenAI’s GPT-3 and GPT-4-mini models hallucinate far more than the older GPT-1. For example, in the PersonQA benchmark, which tests answering questions about public figures, o3 hallucinated 33% of the time, more than double the 15% hallucination rate of o1. The o4-mini model did even worse, hitting 48% hallucination.
When tested with SimpleQA, which covers more general questions, hallucination rates jumped to 51% for o3 and a staggering 79% for o4-mini, compared to 44% for o1. These numbers are pretty wild, especially since newer models are expected to be more innovative and accurate.
OpenAI admits it doesn’t fully understand why these newer models hallucinate more. Some experts think it might be linked to the so-called “reasoning” models, which try to break down problems step-by-step, mimicking human thought processes. These models are designed to handle complex tasks better than simple text prediction.
OpenAI’s first reasoning model, o1, was praised for matching or beating PhD students in physics, chemistry, biology, math, and coding. It uses a “chain of thought” approach, thinking through problems carefully before answering.
Despite this, OpenAI’s Gaby Raila told the Times that hallucinations aren’t inherently worse in reasoning models, though they are working on reducing the high hallucination rates seen in o3 and o4-mini.
AI models need to reduce nonsense and falsehoods if they want to be truly useful. Right now, it’s tough to trust their answers without double-checking everything. That defeats the purpose of saving time or effort, which is the main reason people turn to AI in the first place.
We’ll have to wait and see if OpenAI and other AI developers can get these hallucinations under control. Until then, it’s a wild ride with AI that sometimes just makes stuff up.
What do you think about these rising hallucination rates? Have you noticed weird or false answers from AI lately? Drop your thoughts in the comments below.