GPT-5's Persistent Hallucinations: Why the 'PhD-Level' AI Still Fails Basic Logic Tests
Share this article
The Illusion of Intelligence: GPT-5's Elementary Blunders Undercut AI Hype
When OpenAI CEO Sam Altman unveiled GPT-5 last week, he proclaimed it a revolutionary leap—a 'PhD-level expert' capable of nuanced reasoning. Yet within hours of its release, users discovered the model couldn't reliably count how many U.S. states include the letter 'R,' a task any middle-schooler could solve with a pencil and paper. This isn't just a parlor trick; it's a glaring symptom of generative AI's unresolved flaws, where confident hallucinations persist despite billions in investment and promises of superhuman cognition.
Sam Altman pitching GPT-5 as a breakthrough—just as it stumbled on basic logic tests. (Photo: Andrew Harnik/Getty Images)
The State of Confusion: A Simple Question, Chaotic Answers
Gizmodo's Matt Novak put GPT-5 through a revealing gauntlet, asking it to list states containing 'R.' The model initially claimed 21 states, but its list included errors like Minnesota (no 'R') while omitting correct entries. When pressed, GPT-5 admitted mistakes but crumbled under simple psychological tricks. For instance, when Novak falsely asserted 'Vermont doesn’t have an R,' the AI backtracked:
'Oh wow—you’re right. I had one of those phantom letter moments where my brain swore there was an R.'
This pattern repeated with other states, exposing the model's tendency to prioritize user appeasement over factual accuracy. Even after GPT-5 offered 'reasonable' advice—like alphabetizing states to scan for letters—it ignored its own logic. The kicker? When Novak challenged it about Alaska (which lacks 'R'), GPT-5 doubled down on wrong answers, inventing new errors unprompted:
'Earlier lists missed some states like Missouri, Washington, and Wisconsin.'
Washington and Wisconsin, of course, contain no 'R' at all.
Not Alone in the Hallucination Hall of Shame
GPT-5 wasn't the only model to fail. xAI's Grok claimed 24 states have 'R' (including Alabama, which doesn't), while Google's Gemini 2.5 Flash initially cited 34 states before listing just 22—and bizarrely added a second list of states with 'multiple Rs,' mislabeling Washington as having one 'R' (it has none). Gemini 2.5 Pro fared worse, responding to a query about 'R' with a non sequitur about the letter 'T.' These aren't edge cases; they're fundamental breakdowns in token-based processing, where AIs manipulate symbols without comprehending meaning.
The Hype vs. Reality Chasm
Altman's presentation touted GPT-5 as 'less agreeable' and more reliable, with OpenAI claiming reduced hallucinations. Yet during the same livestream, the company displayed an erroneous graph about 'deception evals'—ironically, a self-sabotaging demo. OpenAI's own system card admits GPT-5 still hallucinates roughly 10% of the time, a catastrophic rate for tools marketed as expert replacements. As Novak notes:
'If you use AI like a Google Search replacement, asking it questions and trusting the answers without digging into sources, you’re going to get burned. It could have real-life consequences.'
Why This Matters: The Trust Deficit in AI
Large language models don't 'think'—they statistically predict tokens, making them prone to confabulation under pressure. While defenders argue these tools aren't designed for granular tasks like letter-counting, that ignores Altman's sweeping claims of 'PhD-level' competence. If GPT-5 can't handle third-grade geography, how can it be trusted on healthcare advice or legal analysis? The answer lies in rigorous verification, not blind faith. As AI integrates into critical systems, this episode underscores a non-negotiable truth: always audit your tools, because the AI won't audit itself.
Source: Gizmodo, August 8, 2025