OpenAI's o3 Model Nears Perfection on Olympiad Math Problems, But Open-Source Alternatives Close the Gap
Share this article
In the high-stakes arena of AI-driven mathematical reasoning, a new benchmark has been set. The Artificial Intelligence Mathematical Olympiad (AIMO), launched in 2023 to foster open-source AI models capable of solving complex math problems, recently pitted OpenAI's cutting-edge o3-preview model against top performers from its second Progress Prize (AIMO2). The results, detailed in a report from the AIMO Prize team, reveal both astonishing capabilities and narrowing divides in the AI landscape.
The Experiment: A Contamination-Free Showdown
AIMO2, hosted on Kaggle, concluded in April 2025 with over 2,000 teams competing on 50 unreleased Olympiad-level problems—equivalent in difficulty to national contests like the USAMO or BMO. To avoid data contamination, these problems were kept entirely hidden from all models. After the competition, AIMO collaborated with OpenAI to evaluate o3-preview, a generalist model not specifically fine-tuned for math, against the top two open-source entrants:
- NemoSkills (Nvidia researchers), which scored 33/50 on Kaggle.
- imagination-research (Tsinghua University and Microsoft Research), scoring 34/50.
Additionally, the "AIMO2-combined" score—representing the best solution from any of the 2,000+ Kaggle submissions per problem—was included for comparison. Evaluations ran under strict isolation to prevent data leaks, with models restricted to a single attempt per problem.
Performance Breakdown: O3 Shines, But Not Alone
OpenAI's o3-preview was tested across three compute levels:
- Low-compute: Solved 40/50 problems.
- Medium-compute: Solved 43/50.
- High-compute: Achieved 47/50 using a sample-and-rank mechanism (and 50/50 including second-ranked answers).
Meanwhile, the open-source models were re-evaluated on enhanced hardware (8x H100 GPUs) without Kaggle's constraints:
- NemoSkills improved to 35/50.
- imagination-research reached 35/50.
The AIMO2-combined effort solved 47/50—matching o3-preview's high-compute score. This "pass@2k+" metric indicates that for each problem, at least one submission among thousands was correct, though it approximates performance rather than reflecting a single model's capability.
Key Insights and Implications
- Reasoning Gaps: o3-preview excelled on problems like "TRIPAR" and "POLYDI," unsolved by any Kaggle team, but struggled with "RUNNER," which NemoSkills cracked. This suggests domain-specific weaknesses even in state-of-the-art models.
- Compute vs. Cost: Running o3-preview low-compute cost ~$1 per problem, comparable to renting an 8x H100 setup for open-source models. Yet, the top five AIMO2 models combined scored only 38/50, trailing o3's low-compute by 5 points—indicating that raw reasoning power is converging when accounting for resources.
- Open-Source Progress: Models like NemoSkills and imagination-research, built on openly shared frameworks, now achieve scores within striking distance of proprietary giants. Their technical write-ups, available publicly, accelerate community innovation.
The Bigger Picture: AI Reasoning Evolves
This evaluation, one of the largest contamination-free tests of math AI, demonstrates that generalist models like o3 can near-saturate Olympiad-level benchmarks. However, the rise of collaborative open-source efforts—where diverse approaches from thousands of participants collectively rival elite systems—signals a democratization of advanced AI. As AIMO Prize Manager Simon Frieder notes, the forthcoming AIMO3 in late 2025 will escalate challenges to International Olympiad difficulty, pushing boundaries further. For developers, this underscores a dual trajectory: proprietary models lead in raw performance, but optimized open-source alternatives are closing in, making high-level reasoning more accessible and reproducible.
Source: AIMO Prize