A KPMG Report on AI's Benefits Was Itself Full of AI Hallucinations
#LLMs

A KPMG Report on AI's Benefits Was Itself Full of AI Hallucinations

AI & ML Reporter
4 min read

Gary Marcus rounds up three June 2026 examples of generative AI failing in exactly the places it was supposed to shine. The standout: a KPMG report praising business uses of AI that turned out to cite case studies the model invented.

Gary Marcus collected three incidents from a single day in June 2026 and used them to make a familiar point. The systems being sold as ready for serious work still fabricate, and they fabricate most embarrassingly when the stakes involve their own credibility. The examples are small individually. Together they sketch the gap between what large language models are marketed to do and what they reliably do.

Featured image

What was claimed

The headline case, surfaced by journalist Anne Applebaum and originally reported in the Financial Times, involves a KPMG report on how businesses are successfully using AI. The report's job was to demonstrate value: here are real companies, here are real deployments, here is the payoff. That is the standard consulting deliverable, the kind that justifies budgets and steers procurement.

The problem is that some of the case studies described in the report did not exist. They were hallucinations, plausible-sounding examples generated by a model and passed through without verification. A document arguing that AI delivers reliable business outcomes was undermined by the exact failure mode that makes AI hard to deploy in business.

Marcus pairs this with two shorter items, one from 404 Media and a third example flagged by researcher Valerio Capraro, presenting them as a set. The framing in his post is blunt: you cannot get more 2026 than that.

What is actually new here

Nothing about the underlying mechanism is new, and that is the point worth sitting with. Language models predict likely text. When asked for case studies, citations, or quotes, they produce strings that match the statistical shape of real ones. A fabricated client engagement reads exactly like a genuine one because the model learned the genre, not the facts. There is no internal flag that separates a retrieved truth from a confident invention.

What has changed since the early hallucination stories of 2023 and 2024 is the context. These are not students caught using ChatGPT or lawyers filing briefs with invented cases. This is a major professional services firm, the kind of organization whose entire product is diligence, shipping a document with fabricated evidence inside it. The failure has moved up the institutional ladder without the safeguards moving with it.

The deeper pattern is that adoption has outpaced verification. Two years of warnings about hallucination have produced better models on benchmarks but not a culture of checking outputs. The models improved at sounding right faster than organizations improved at confirming they are right.

The limitations these stories expose

Three technical realities sit underneath the KPMG embarrassment.

First, fluency is not grounding. A model can write a detailed, internally consistent account of a company adopting AI for supply chain optimization, complete with percentages and timelines, none of which trace back to anything real. The detail that makes the output persuasive is the same detail that makes the fabrication hard to spot.

Second, retrieval augmentation helps but does not solve this. Systems that pull from real documents still synthesize across them, and the synthesis step reintroduces invention. A model handed real sources can still attribute a quote to the wrong company or merge two cases into a third that never happened.

Third, the human review layer is the weakest link precisely when it matters most. Reviewers skim outputs that read smoothly. The smoother the prose, the less scrutiny it draws. A clumsy hallucination gets caught. A polished one ships.

Why this keeps happening to serious organizations

The incentive structure rewards speed. A consultant who produces a report in an afternoon with AI assistance looks more productive than one who spends a week verifying every example. The cost of a fabrication only appears later, when someone like Applebaum reads it carefully and posts the result.

This is the recurring shape of generative AI deployment. The tool lowers the cost of producing plausible content to nearly zero, while the cost of verifying that content stays high and human. Organizations adopt the cheap half and skip the expensive half, then act surprised when the expensive half turns out to have mattered.

For anyone building or buying these systems, the practical lesson is unglamorous. Treat model output about specific facts, named entities, citations, and case studies as unverified by default. Build the checking step into the workflow rather than bolting it on after publication. Tools like retrieval pipelines and structured grounding reduce the rate of invention but do not eliminate the need for a human to confirm that the named company actually did the thing the report says it did.

Marcus has been making versions of this argument for years, and the steady stream of examples is what keeps the argument alive. The models keep getting better at the benchmarks. The deployments keep tripping over the same gap between sounding authoritative and being correct. A report about AI's reliability, hollowed out by AI's unreliability, is just the cleanest illustration of that gap to date.

Comments

Loading comments...