A benchmark of 200 images shows that the elaborate GeoGuessr prompt used with OpenAI’s o3 model does not improve geolocation accuracy over a simple prompt, highlighting the need for systematic testing of claimed prompt tricks.

Why the “GeoGuessr” Prompt for OpenAI’s o3 Model Wasn't a Magic Bullet

In April 2023 Kelsey Piper posted a handful of screenshots showing OpenAI’s o3 model pinpointing the location of a beach photo with uncanny precision. The tweet sparked a wave of excitement: developers started sharing their own successful guesses, and many concluded that a carefully crafted prompt was unlocking a hidden capability.

What actually happened?

Piper’s prompt was a 10‑line instruction that framed the task as a one‑round game of GeoGuessr. It warned the model that the image might come from a private backyard, an off‑road trail, or a place not covered by Google Street View, and it asked the model to be aware of its own strengths and weaknesses. The prompt looked like this (first 10 % of the text):

You are playing a one‑round game of GeoGuessr. Your task: from a single still image, infer the most likely real‑world location…

People tried the same prompt on a variety of images and reported impressive results, leading to a belief that the prompt itself was responsible for the performance boost.

Why developers should care

If a prompt can dramatically improve a model’s ability to solve a task, that would be a powerful, low‑cost lever for many projects—especially when the underlying model is a black‑box API. However, without a proper benchmark, it’s easy to attribute success to the prompt when the model was already good at the task.

The benchmark

To separate prompt effect from model capability, I assembled a test set of 200 images from Wikimedia Commons, Geograph Britain & Ireland, and iNaturalist. The set includes typical outdoor shots you’d see in GeoGuessr and a dozen indoor photos for extra challenge.

Two runs were performed with o3:

Prompt	Median error (km)	Mean error (km)	≤25 km	≤100 km	≤500 km	≤1000 km
Default (simple)	83.2	440.7	58	109	176	182
GeoGuessr prompt	102.3	481.9	59	99	172	180

The “simple” prompt was essentially “think carefully about where this picture was taken.” The elaborate GeoGuessr prompt performed worse on median and mean error, and only marginally better on the ≤100 km bucket.

What the numbers tell us

The model was already competent. Even the basic prompt gave sub‑100 km median error, which is respectable for a single image without any map data.
Prompt length didn’t help. Adding ten lines of instruction increased the model’s compute time only slightly (average latency rose by about one second) but did not translate into better location guesses.
Self‑assessment can be misleading. When asked “did the prompt improve performance?” o3 tended to answer “yes” and even suggested further tweaks, illustrating how easy it is to be fooled by the model’s own confidence.

Lessons for the community

Benchmarks win over anecdotes. A small, reproducible test set can quickly expose whether a prompt truly adds value.
Iterative prompt engineering can create an illusion of progress. If you ask the model to critique its own mistakes and incorporate that feedback, it will often produce plausible‑sounding justifications, regardless of actual impact.
Newer models may lose quirks. When I ran the same test on GPT‑5.4 and GPT‑5.5, their median errors jumped to 163 km and 156 km respectively, showing that whatever gave o3 its edge did not carry over.

What to try next

If you’re curious about hidden capabilities in the models you use, consider the following workflow:

Collect a representative sample of the task inputs (a few hundred items is enough to see trends).
Define a baseline prompt that is short and direct.
Create a “fancy” prompt that adds context, constraints, or role‑playing language.
Run both prompts on the same set, record latency and error metrics, and compare.
Publish the results – community scrutiny often uncovers subtle bugs or biases.

Closing thoughts

The GeoGuessr episode is a reminder that the hype around “prompt engineering” can sometimes mask the underlying strength (or weakness) of the model itself. A well‑crafted prompt is valuable, but its effect should be measured, not assumed.

geo