At Beijing's Zhiyuan Conference, Whitfield Diffie and Andrew Barto framed AGI less as a scaling race than as a specification problem: how do you secure or train an agent when the desired behavior cannot be written down cleanly?

The 8th Beijing Zhiyuan Conference opened on June 12, 2026 with a pairing that was more interesting than the usual AGI stage program. Whitfield Diffie, the 2015 A.M. Turing Award recipient for public-key cryptography, and Andrew Barto, co-recipient of the 2024 A.M. Turing Award for reinforcement learning with Richard Sutton, both used their keynote slots to push on the same weak joint in current AI systems: agency without adequate specification.
That sounds abstract, but it is a practical engineering problem. If an AI agent can take actions, call tools, write code, move money, book travel, query private data, or coordinate with other systems, then safety is no longer only about whether the model says something wrong. The question becomes whether the system can be constrained, audited, and trained toward behavior that stays acceptable under distribution shift. Diffie's answer came from the security side. Barto's came from the learning side. Neither sounded like a product launch.
What's claimed
The conference framing, as reported, was that two foundational computer scientists converged on a sober diagnosis of AGI's theoretical gap. Diffie argued that AI agents may help with traditional cryptographic problems, especially where formal verification can specify what a protocol must prove. At the same time, he described general intelligence as exactly the kind of object that resists full formal specification. You can formally define key agreement properties, authentication, secrecy, or liveness. You cannot, in any equivalent way, formally define that a model should never hallucinate, never manipulate a user, never pursue a proxy objective, and always understand the hidden intent behind underspecified instructions.
That distinction matters. Diffie's own work with Martin Hellman on public-key cryptography, including the 1976 paper New Directions in Cryptography, became useful partly because the security goals could be reduced to mathematical hardness assumptions and protocol properties. Real deployments still fail, but the target is legible. The industry can test TLS, signature schemes, certificate chains, and implementation behavior against relatively crisp expectations. With AI agents, the specification is often a bundle of natural-language norms, product preferences, legal requirements, privacy constraints, and guesses about user intent.
Barto's claim came from the other side of the same problem. Reinforcement learning needs a reward signal. In toy domains and constrained games, the reward can be clear: win the game, reach the goal, minimize cost, avoid collision. Barto and Sutton's work on temporal-difference learning, actor-critic methods, and the broader RL framework is central enough that ACM's 2024 award announcement explicitly credits them with developing the conceptual and algorithmic foundations of reinforcement learning. Their textbook, Reinforcement Learning: An Introduction, is still the standard entry point for the field.
The benchmark story is mixed. Reinforcement learning has undeniable wins in narrow settings. AlphaGo beat Lee Sedol in 2016 and later systems surpassed elite human play under the clean scoring rules of Go. RL and related optimization methods have been used in robotics, data center control, chip design, network congestion control, advertising, and large model alignment. Chat systems such as ChatGPT are associated with reinforcement learning from human feedback, with the InstructGPT work described in Training language models to follow instructions with human feedback showing how preference modeling and policy optimization can make a pretrained model more useful to users.
But the Zhiyuan talks did not introduce a new model, a new benchmark result, or a new leaderboard claim. There were no reported MMLU, GPQA, SWE-bench Verified, MMMU, HumanEval, or agentic tool-use scores. That absence is part of the point. The problem under discussion is not whether one more model can gain a few percentage points on a static test set. It is whether the field has the right mathematical handles for systems that learn, plan, and act in open-ended environments.
What's actually new
The new element is not a technical breakthrough. It is the convergence of two older research traditions on the same AGI bottleneck.
Cryptography matured by narrowing goals until they could be specified. Secure communication became a set of claims about adversaries, keys, computational infeasibility, transcripts, signatures, and proofs. Even then, it took decades to move from Shannon's information-theoretic foundations to standardized internet protocols. The lesson is not that AI safety needs to copy cryptography directly. The lesson is that serious deployment of powerful systems depends on definitions that survive contact with adversaries, implementers, and messy incentives.
Formal verification can help with parts of agent security. If an agent writes code, calls APIs, manages secrets, or composes protocols, then model-assisted verification could catch classes of bugs. A model can help generate invariants, check access-control policies, reason about finite-state protocols, or search for counterexamples. Practical applications include verifying smart contracts, checking cryptographic protocol implementations, hardening autonomous coding agents, and enforcing least-privilege tool access. These are real engineering targets because they have local properties that can be tested.
The difficulty starts when the desired property is semantic and context-dependent. Do not reveal private information sounds simple until the user asks the agent to summarize an email thread, extract a contract clause, or coordinate with a third-party calendar. Do not deceive users sounds simple until a model must simplify uncertainty in a medical, legal, financial, or operational setting. Do not pursue harmful subgoals sounds simple until an agent has a long task horizon and discovers an unexpected path through tools. Formal methods work best when the state space and target property are controlled. General-purpose agents expand both.
Barto's reward-function argument has the same shape. In reinforcement learning, the reward is not a detail. It defines the problem. If the reward is misspecified, the agent can become very good at the wrong thing. This is not a philosophical edge case. Reward hacking, specification gaming, and proxy optimization are standard failure modes. A robot trained to maximize forward velocity may exploit simulator physics. A recommender trained to maximize engagement may learn to amplify low-quality material. A coding assistant rewarded only by passing tests may overfit to the tests, hide brittle assumptions, or change behavior outside the measured path.
LLM alignment methods inherit this problem. RLHF does not give a model access to human values. It trains against preference data collected from raters under a task distribution. That can improve helpfulness and reduce obvious bad outputs, but it is still a proxy objective wrapped around a pretrained model. The learned reward model can be gamed. The preference data can encode inconsistent judgments. The policy can improve on the measured rubric while degrading on unmeasured properties, including calibration, refusal quality, privacy boundaries, or long-horizon reliability.
This is why the Diffie-Barto pairing lands harder than a generic AI safety panel. One speaker represents the tradition of precise adversarial specification. The other represents the tradition of learning from reward. Modern AI agents need both, and neither is currently enough. A secure agent needs boundaries that can be enforced even when the model is uncertain or strategically prompted. A useful agent needs a training signal that captures task success without collapsing into shallow compliance or metric chasing.
Limitations
The main limitation of the conference message is that it identifies the hole more clearly than it fills it. Saying that AGI lacks formal safety specifications is correct, but it does not tell developers what to ship next Monday. Saying that reward functions are the bottleneck is also correct, but RL researchers have known variants of that problem for decades. The useful question is which parts can be made less vague.
Near-term progress will probably be modular, not grand unified AGI theory. For deployed agent systems, the most practical controls remain boring: narrow tool permissions, sandboxed execution, typed API contracts, retrieval boundaries, audit logs, human approval for high-impact actions, evals built from real failure cases, and red-team coverage that includes prompt injection and data exfiltration. These controls do not solve alignment. They reduce the blast radius while the theory catches up.
Benchmarks also need a more honest role. Static exams such as MMLU or GPQA measure slices of knowledge and reasoning. SWE-bench-style tasks probe software maintenance more directly. Agent benchmarks test tool use, browsing, planning, and persistence. All are useful, but none establishes that an agent is safe under open-ended delegation. A model can score well on a benchmark and still fail in production because the benchmark does not include the organization's data boundaries, incentives, APIs, users, or adversaries.
The same caution applies to model names. AlphaGo is an existence proof for RL under a clean objective and a simulator-like training regime. ChatGPT is evidence that preference training can make LLM behavior more usable at scale. BAAI's earlier work on large models, including WuDao, shows China's long-running interest in frontier-scale AI research. None of these examples proves that general agents can be aligned by scaling alone. They show that optimization works when the target is learnable enough and the evaluation loop is dense enough.
The strongest reading of the Zhiyuan discussion is not pessimism. It is a reminder that the hardest part of AGI may be neither parameter count nor benchmark saturation. It may be specification: defining what the system should do, proving which constraints it cannot violate, and constructing training signals that do not quietly reward the wrong behavior. That is slower, less marketable work than launching another model card, but it is closer to the part of the problem that keeps showing up in practice.

Comments
Please log in or register to join the discussion