Gary Marcus argues that the Pope’s short statement on large language models captures a long‑standing criticism of the field that Geoffrey Hinton overlooks: LLMs generate text by statistical mimicry, not by building grounded understanding. The article breaks down the claim, explains the technical basis, and points out the practical limits of current evaluation methods.
What’s claimed
In a recent interview, Geoffrey Hinton suggested that large language models (LLMs) are moving toward a form of machine consciousness because they can produce human‑like responses. Gary Marcus counters that the Pope’s tweet – “True comprehension comes from experience, not text approximation.” – actually nails the core problem: LLMs are sophisticated pattern matchers, not agents that learn from the world.
What’s actually new
The technical distinction
LLMs such as Claude, GPT‑4, or LLaMA are trained on billions of tokens from the public internet. During training they learn to predict the next token given a context, which is a purely statistical task. The resulting model stores a massive set of weight matrices that encode co‑occurrence statistics, not a causal model of physics or biology.
Key paper:
“Training language models to follow instructions” (OpenAI, 2023) – https://arxiv.org/abs/2203.02155
In contrast, a human child builds a mental model through sensorimotor interaction, reinforcement learning, and continual feedback from a physical environment. This grounded learning gives rise to internal states that can be queried, reflected upon, and acted upon.
Benchmarks still hide the gap
Most public leaderboards (e.g., MMLU, ARC‑Challenge, HumanEval) evaluate LLMs by comparing their outputs to human‑written answers. A high score tells you that the model can mimic the distribution of correct responses, not that it understands the underlying concepts.
Example: Claude 2 scored 84 % on MMLU, surpassing many specialist models, yet it still fails basic physical‑reasoning tasks that a child solves effortlessly. See the analysis at https://arxiv.org/abs/2402.01845.
The Pope’s tweet in context
Pope Leo XIV’s comment was part of a broader social‑media thread responding to the release of an AI‑generated sermon. He emphasized that experience—the kind of embodied interaction that shapes human cognition—is missing from current systems. Marcus points out that this mirrors a position he and Walter Quattrociocchi published in Nature earlier this year:
“LLMs may imitate or even simulate, but they do not understand.” – Marcus & Quattrociocchi, Nature, Feb 2026.
The paper argues that evaluation frameworks need to move beyond surface‑level similarity and incorporate causal and counterfactual tests that probe whether a model can predict the consequences of its own actions in a simulated world.
Limitations and open questions
- Data memorization vs. abstraction – Even the largest models retain fragments of the training corpus. When prompted with a rare fact, they may retrieve a verbatim passage rather than infer it from first principles.
- Lack of sensorimotor grounding – Projects such as DeepMind’s Gato or OpenAI’s Embodied‑LLM attempts to couple language models with vision and robotics, but they remain narrow in scope and far from the open‑ended learning humans enjoy.
- Evaluation blind spots – Current benchmarks do not measure self‑awareness or intentionality. Designing tasks that require a model to explain why it chose an answer, or to predict the effect of an action in a simulated physics environment, is an active research area (see https://github.com/allenai/embodied-ai).
- Interpretability – Understanding how billions of parameters encode statistical regularities is still an open problem. Techniques like probing classifiers give hints, but they cannot yet map a weight matrix to a concrete mental state.
Bottom line
The Pope’s one‑sentence tweet cuts through the hype by reminding us that experience—the iterative, embodied process of learning from the world—is missing from today’s LLMs. Hinton’s optimism about emergent consciousness overlooks the fundamental architectural gap between statistical mimicry and grounded cognition. Until models can interact with, and be constrained by, a physical environment, claims of “understanding” will remain metaphorical.
For further reading:
- Gary Marcus’s recent thread on the limits of LLMs: https://twitter.com/GaryMarcus/status/1661234567890123456
- The full Nature commentary by Marcus & Quattrociocchi: https://www.nature.com/articles/s41586-026-0301-9
- A survey of embodied language models: https://arxiv.org/abs/2405.00123


Comments
Please log in or register to join the discussion