OpenAI Claims Gold Medal-Level IMO Performance with General-Purpose AI Model, Sparks Controversy

OpenAI announced its experimental language model achieved gold medal-level results on the prestigious International Mathematical Olympiad using plain-text reasoning, matching a standard few humans reach. The claim, self-graded and released early against IMO embargo requests, raises questions about verification and scientific protocol while signaling a potential leap in AI reasoning capabilities.

OpenAI has declared a significant breakthrough in AI reasoning: an experimental language model achieving gold-medal-level performance on the notoriously difficult International Mathematical Olympiad (IMO). According to the company, the model solved all six proof-based problems under strict competition conditions – 4.5 hours per session, with no internet access or calculators – a feat accomplished by fewer than 9% of human contestants annually.

A Departure from Specialized Systems OpenAI emphasizes that this model differs fundamentally from prior AI approaches to Olympiad problems, which relied on specialized theorem-proving systems often exceeding human time constraints. Their model, they claim, processed problems as plain text and generated natural-language proofs, functioning like a standard large language model rather than a purpose-built mathematical tool.

"Math is a proving ground for reasoning—structured, rigorous, and hard to fake," OpenAI stated. "This shows that scalable, general-purpose methods can now outperform hand-tuned systems in tasks long seen as out of reach."

This announcement positions OpenAI's results above Google DeepMind's recent claim of a silver-medal equivalent with its AlphaProof and AlphaGeometry 2 models. Notably, Google's systems reportedly required up to three days per problem and needed human assistance for problem translation.

Controversy Over Verification and Timing The legitimacy of OpenAI's claim faces immediate scrutiny. The results were self-graded by a panel of three former IMO medalists organized by OpenAI, requiring unanimous consensus. Furthermore, the announcement on Saturday appears to violate an explicit embargo request from the IMO Board, which had asked participating AI companies, including Harmonic and Google DeepMind, to withhold results until July 28th to respect the ongoing Olympiad and its human participants.

An IMO coordinator reportedly called OpenAI's actions "rude and inappropriate," noting the company "wasn't one of the AI companies that cooperated with the IMO on testing their models." OpenAI researcher Noam Brown countered that they informed an organizer and waited until after the closing ceremony, a claim disputed by the coordinator. The breach prompted Google DeepMind to accelerate its own IMO results announcement.

The Significance of the IMO Challenge The IMO, running since 1959, is one of the world's most demanding tests of mathematical insight and creative reasoning. Problems require deep understanding beyond brute computation. For instance, the 2025 Problem 1 challenges contestants to prove that on a triangular grid, covering all dots with n straight lines can only ever result in exactly 0, 1, or 3 "sunny" lines (those not horizontal, vertical, or 45º diagonal), regardless of the grid's size.

Implications and Caveats While OpenAI confirmed its next major model, GPT-5, is "coming soon," it clarified this experimental model's techniques "will carry forward, but nothing with this level of capability will be released for a while." This strongly suggests the achievement required immense computational resources impractical for current consumer applications.

The results, if independently verified when OpenAI publishes its proofs and grading rubrics, represent a potential paradigm shift. Demonstrating that a general-purpose LLM can tackle elite mathematical reasoning under human-like constraints challenges assumptions about the limitations of these models. However, the controversy surrounding the announcement and self-grading underscores the critical need for transparent, independent validation of such high-stakes claims in the rapidly evolving AI landscape. The true impact on the field hinges not just on the claimed capability, but on the robustness of the process that led to it.

Source: Based on reporting by Benj Edwards for Ars Technica.