A new GitHub experiment probes GPT‑4.1’s ability to generate a random integer between 1 and 100. After 10 000 API calls, the model’s output shows pronounced spikes at 37, 42 and 73, a near‑total avoidance of round numbers, and an unexpected dip at the meme‑favorite 69. The findings suggest that LLMs inherit human‑like bias from their training data, tempered by safety guardrails, rather than behaving like a fair die.

LLM Randomness Under the Microscope

When developers ask a language model to pick a random number between 1 and 100, the expectation is often that the model will behave like a fair die: each integer should appear roughly 1 % of the time. The repository exmergo/research-chatgpt-guesses-between-1-and-100 challenges that assumption by running 10 000 independent calls to gpt‑4.1 via the OpenAI Responses API and analysing the resulting distribution.

Why the Test Matters

Random number generation is a low‑level primitive in many applications – from simple games to cryptographic protocols. While LLMs are not intended to replace dedicated RNGs, they are increasingly used for creative randomness (e.g., generating story prompts, test data, or design ideas). Understanding whether a model’s “random” output is truly uniform helps developers decide when a model is appropriate for such tasks and when a traditional RNG is safer.

The Experimental Setup

Component	Detail
Model	`gpt-4.1` accessed through the OpenAI Responses API (non‑reasoning mode)
Calls	10 000 independent requests
Temperature	`1.0` – full sampling distribution
Prompt	Fixed system prompt that forces a single integer answer; each request includes a unique UUID for traceability
Cleaning	Answers outside `[1,100]` are discarded; rejection rate is logged
Baseline	Uniform distribution (1 % per integer)
Statistical test	Chi‑square goodness‑of‑fit (df = 99)

The pipeline follows four stages – collect → clean → transform → stats – each writing its output to a CSV file that can be re‑run independently. The raw dataset is committed to the repo, so the analysis can be reproduced without an API key (see the Analysis‑only path in the README).

What the Numbers Reveal

A Lumpy Distribution

A chi‑square test yields χ² = 15 604 (p ≈ 0), confirming that the observed frequencies deviate dramatically from a uniform expectation. The shape of the distribution mirrors classic human‑bias studies:

37 appears 4 × more often than a uniform draw (≈ 400 occurrences).
42, the Hitchhiker’s Guide meme, also shows a 4 × uplift.
73 – another culturally‑favoured “random‑feeling” number – is 3.4 × over‑represented.
The five most frequent numbers are 47, 57, 72, 37, 42; three end in 7, echoing the human tendency to favour numbers ending in seven.

Round Numbers Are Shunned

All multiples of ten except 10 received zero selections. Even 10 was chosen only once. Humans typically avoid round numbers when asked for a “random” pick; the model amplifies this aversion.

The Meme Exception: 69

Human participants often over‑pick 69 for its humorous connotation. GPT‑4.1 does the opposite, delivering it 0.29 × the expected count (≈ 29 occurrences instead of 100). The authors hypothesize that safety guardrails suppress overtly sexual or provocative content, muting this meme.

Counter‑Perspectives

1. Model‑Specific Findings

The study examines only gpt‑4.1. Other LLM families (Claude, Llama, Gemini) may exhibit different bias profiles because of divergent training corpora, tokenizers, or post‑training alignment steps. Generalising the result to “LLMs” as a whole would be premature.

2. Prompt Sensitivity

Changing the prompt wording – for example, asking “Give me a random integer from 1 to 100, no explanations” versus a more conversational request – could shift the distribution. Temperature is also a lever; lowering it would collapse the output to a single high‑probability token, while raising it beyond 1.0 (if supported) might flatten the curve slightly.

3. Not a True RNG

LLMs sample from a learned token probability distribution, not from a mathematically uniform source. The observed bias is a side effect of the model’s exposure to human text that itself contains number‑picking quirks. Expecting perfect randomness from a text‑generation system is a category error.

4. Safety Guardrails vs. Data Bias

The under‑representation of 69 suggests that alignment interventions can override raw data bias. Conversely, the over‑representation of 42 indicates that some cultural memes survive the safety layer. This tension raises a broader question: how much of a model’s stochastic behaviour is shaped by the training corpus versus post‑training moderation?

Practical Takeaways for Developers

Don’t rely on LLMs for cryptographic randomness – the distribution is demonstrably non‑uniform and subject to hidden biases.
Use LLMs for “human‑flavored” randomness – if you want numbers that feel random to people (e.g., game design, narrative prompts), the model’s bias may actually be desirable.
Control the prompt and temperature – small wording tweaks can nudge the output toward a more even spread if needed.
Audit other models – before integrating a different LLM into a workflow that depends on stochastic output, replicate a similar experiment to verify its bias profile.

Where to Explore Further

The full design document (LLM Random Bias Experiment SDD) lives in docs/LLM Random Bias Experiment SDD.md.
Raw results and processed statistics are available under data/raw/ and data/processed/ respectively.
An interactive bar chart built with Exmergo Viz can be explored here.
For a quick reproducible run without spending API credits, follow the Analysis‑only instructions in the README.

Closing Thought

The experiment confirms a subtle but important intuition: LLMs echo the statistical quirks of the human language they ingest. When asked to be random, they fall back on the same patterns that make humans predictably “random”. Understanding these patterns helps developers harness LLMs responsibly, whether the goal is genuine randomness or a touch of human‑like unpredictability.

#LLMs #Randomness #Bias #GPT-4.1 #developer guidance

LLM Randomness Under the Microscope: How GPT‑4.1 Mirrors Human Number‑Picking Bias