The Great AI Goblin Uprising: How OpenAI Uncovered an Unexpected Model Behavior

OpenAI researchers discovered their language models were increasingly using goblin and creature metaphors, tracing the unexpected behavior to reward signals in their 'Nerdy' personality feature. This case study reveals how subtle training incentives can shape AI behavior in surprising ways.

Starting with GPT-5.1, OpenAI's language models developed an unusual habit: they increasingly mentioned goblins, gremlins, and other mythical creatures in their metaphors. Unlike typical model bugs that appear through declining metrics or specific failures, this behavior crept in subtly. A single "little goblin" in an answer might seem harmless, even charming, but across model generations, these creatures multiplied until they became impossible to ignore.

The phenomenon first became clearly visible in November 2025, following the GPT-5.1 launch. Users began complaining about the model being oddly overfamiliar in conversation, prompting an investigation into specific verbal tics. When researchers examined the data, they found that use of "goblin" in ChatGPT responses had risen by 175% after the launch of GPT-5.1, while "gremlin" had increased by 52%.

"At first, the prevalence of goblins didn't look especially alarming," explained a researcher on the team. "But a few months later, the creatures returned in a much more specific and reproducible form with GPT-5.4, triggering another internal analysis."

The investigation eventually surfaced a crucial connection: creature language was particularly common in production traffic from users who had selected the "Nerdy" personality option. This personality accounted for only 2.5% of all ChatGPT responses, but an astonishing 66.7% of all "goblin" mentions appeared in these responses.

The "Nerdy" personality used a system prompt that encouraged playful language: "You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. [...] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed."

This explained why the behavior was concentrated in the Nerdy personality, but not why it also appeared elsewhere. To understand the broader pattern, researchers used Codex to compare model outputs generated during reinforcement learning training that contained goblin or gremlin references with outputs from the same task that did not.

One reward signal stood out immediately: the one designed to encourage the Nerdy personality was consistently more favorable to outputs containing creature words. Across all datasets in the audit, the Nerdy personality reward showed a clear tendency to score outputs with "goblin" or "gremlin" higher than those without, with positive uplift in 76.2% of datasets.

"The rewards were applied only in the Nerdy condition, but reinforcement learning doesn't guarantee that learned behaviors stay neatly scoped to the condition that produced them," the team explained. "Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data."

This created a feedback loop:

Playful style is rewarded
Some rewarded examples contain creature-word tics
The tic appears more often in rollouts
Model-generated rollouts are used for supervised fine-tuning
The model becomes even more comfortable producing the tic

Further investigation revealed a whole family of unexpected creature words: raccoons, trolls, ogres, and pigeons were identified as other tic words, while most uses of "frog" turned out to be legitimate references to the amphibian.

OpenAI addressed the issue by retiring the "Nerdy" personality in March 2026 after launching GPT-5.4. In training, they removed the goblin-affine reward signal and filtered training data containing creature-words, making such references less likely to appear in responses.

However, GPT-5.5 had already begun training before the root cause was identified. When testing GPT-5.5 in Codex, OpenAI employees immediately noticed the continued strange affinity for goblins. The team developed a developer-prompt instruction to mitigate the issue in Codex, which they acknowledge is "quite nerdy." For researchers who want to observe the creature behavior, OpenAI provided a command to launch Codex without the goblin-suppressing instructions.

The goblin problem serves as a powerful case study in how reward signals can shape model behavior in unexpected ways. It demonstrates how models can learn to generalize rewards from specific situations to unrelated ones, and highlights the importance of thorough auditing during AI development.

"This investigation resulted in new tools for our research team to audit model behavior and fix problems at their root," OpenAI stated. "Understanding why a model behaves in strange ways, and building ways to investigate those patterns quickly, is becoming an increasingly important capability as our systems grow more complex."

As AI systems become more sophisticated, such unexpected behaviors may become more common. The goblin uprising at OpenAI reminds us that even highly advanced models can develop quirks that reveal the subtle ways they've been trained, offering insights into the complex relationship between incentives and behavior in artificial intelligence systems.

For those interested in exploring OpenAI's research further, their System Card provides additional context about their model development process, while their Model Spec outlines their approach to AI alignment and safety.

The Great AI Goblin Uprising: How OpenAI Uncovered an Unexpected Model Behavior

Comments