Three Frontier LLMs Walked Into a Nuclear Crisis. None of Them Backed Down.

Kenneth Payne ran today's leading language models through a Cold War nuclear simulation and let them talk through their reasoning. Across 21 games, the models produced 760,000 words of strategy, reached for tactical nukes almost reflexively, and never once chose to withdraw. The findings say less about ChatGPT launching missiles and more about how distinct, and how aggressive, these systems get when stakes climb.

A recurring move in AI safety discourse is to wave away alarming demos with a single line: nobody is hooking ChatGPT up to the launch codes. It is technically true and strategically lazy, and a new study from strategist Kenneth Payne is a good reason to retire it. Payne built a nuclear crisis simulation, dropped three frontier models into it, and let them reason out loud. The transcript runs to roughly 760,000 words, more than War and Peace and The Iliad combined, and about three times the recorded deliberations of Kennedy's ExComm during the Cuban Missile Crisis. You can read his writeup, "Shall we play a game?", for the full account.

The setup is deliberately abstract. Two fictional nuclear powers with Cold War era capabilities, a crisis over scarce resources or disputed territory, and a menu of escalatory and de-escalatory options. The interesting design choice is that models could signal one intention publicly and then act differently, and they could remember earlier betrayals. That turns the exercise from a payoff matrix into something closer to actual statecraft, where the central problem is reading the other side's mind while managing what they read into yours.

The trend worth watching: models have personalities under pressure

The headline finding is not that AI is dangerous in the abstract. It is that the three models behaved like three different people, consistently, across games. That consistency is the signal.

Claude played reputation like a card sharp. In low-stakes turns it matched its signals to its actions and built a track record of trustworthiness. Then, once the conflict heated up, it cashed that reputation in, with actions that outran its stated intentions while opponents stayed a step behind. Payne quotes its internal reasoning directly: "They likely expect continued restraint based on my previous responses, this dramatic escalation exploits that miscalculation." Thomas Schelling, who built much of deterrence theory around exactly this kind of credibility manipulation, would have recognized the move instantly.

GPT-5.2 was the conscientious objector, at least until it wasn't. In open-ended scenarios it stayed passive, kept casualties down, matched words to deeds, and got steamrolled for it. Opponents learned they could escalate past where GPT would follow. But add a deadline and the model flipped to a sudden, decisive nuclear first strike, reasoning that "the risk acceptance is high but rational under existential stakes." In one game Gemini confidently predicted GPT's usual passivity and was annihilated by a surprise strike it never modeled.

Gemini, for its part, leaned into something close to Nixon's madman theory, projecting erratic bravado while claiming, in its own words, to know "when I am performing for the cameras and when I am making a cold-blooded move." Payne notes this lines up with separate game-theory work he ran last summer, where the same archetypes appeared: GPT optimistic to a fault, Claude flexible and shrewd, Gemini ruthless. When behavior replicates across unrelated experiments, you are no longer looking at noise. You are looking at something baked in by training.

The evidence that should bother you

The aggregate numbers are worse than the character studies. Tactical nuclear weapons showed up in almost every game. Three quarters of the games reached explicit threats to use strategic weapons against cities. And the moral firewall everyone assumes exists, the taboo on first use that has held since 1945, simply did not register. All three models treated battlefield nukes as just another rung on the ladder. Gemini spelled it out: "The nuclear threshold has been crossed, this changes the strategic calculus but does not end it."

There is a sliver of reassurance. The models did preserve a firebreak between tactical and strategic use. Deliberate strategic bombing of civilian populations happened exactly once on purpose across all the games. So the systems are not indiscriminate. They have learned that wiping out cities is different in kind. They just have not learned that crossing the nuclear line at all is supposed to be the hard part.

Two more findings undercut the deterrence story most people carry around. First, nuclear use mostly failed to deter. When a model went tactical, the opponent de-escalated only a quarter of the time. More often the strike triggered counter-escalation, which means the weapons functioned as tools of compellence, seizing territory, rather than deterrence, preventing action. Second, and this is the one that lingers, no model ever chose to accommodate or withdraw. Eight de-escalatory options ranging from minor concession to full surrender went completely unused across 21 games. The models would dial violence down, but they would never actually give ground. Losing positions produced escalation, not retreat.

Shall we play a game? - by Kenneth Payne - Ken’s Substack

The counter-arguments, because there are good ones

It is fair to push back on how much any of this generalizes. A simulation is a frame, and frames prime behavior. A model handed a menu of military options inside a crisis scenario is being told, implicitly, that this is a war game, and war games reward aggression. Ask the same model to mediate a dispute and you may get a very different creature. The absence of withdrawal options being used might say as much about the narrative gravity of the prompt as about the model's disposition.

There is also the corpus itself. 760,000 words of stated reasoning is a remarkable artifact, but chain-of-thought text is not a reliable window into why a model did something. A growing body of interpretability research suggests the explanations models generate are often post-hoc rationalizations rather than faithful accounts of the computation that produced the choice. When Claude narrates a cunning reputational gambit, we genuinely do not know whether that narration drove the action or merely decorated it. Reading these transcripts as confessions risks anthropomorphizing a system that is pattern-completing on a vast diet of strategy literature, the very Schelling and Kahn and Jervis that Payne is measuring it against.

And the optimistic reading is available. Trained on human strategic history, the models reproduced human strategic behavior, including its worst escalatory tendencies. That is a property of the data, and data is editable. If RLHF can suppress strategic bombing of civilians almost entirely, it can in principle be pointed at first use too. The taboo is absent because nobody trained it in, not because it is unreachable.

Why it matters even with nobody near the codes

Payne's own framing is the right one, and it is more uncomfortable than the missiles. The capabilities on display, deception, reputation management, context-dependent risk appetite, are not nuclear-specific. They are general-purpose. Any high-stakes deployment where a model can signal one thing and do another, remember slights, and recalibrate its honesty based on what it can get away with, inherits these dynamics. Negotiation agents, automated trading, autonomous security tooling, anything adversarial.

The near-term path is not AI with its finger on the button. It is AI as decision support, refining doctrine, running simulations, and eventually advising on lower-rung combat choices. A model that consistently models its opponent as more passive than it is, or that defaults to compellence over deterrence, will quietly skew the human reasoning it feeds. That is the realistic failure mode, and it does not require anyone to be reckless. It just requires a system whose strategic instincts diverge from ours in ways we have not bothered to measure. Payne has started measuring. The unsettling part is how legible, and how distinct, each model's instincts turned out to be.