MIT researchers have created MathNet, the largest high-quality dataset of proof-based math problems ever assembled, containing over 30,000 expert-authored problems from 47 countries, 17 languages, and 143 competitions. This unprecedented resource serves both as a training ground for math competition students worldwide and a rigorous benchmark for AI reasoning systems, revealing significant gaps in current AI capabilities even for advanced models like GPT-5.
MathNet represents a monumental achievement in mathematical data collection and organization, addressing a critical gap in both educational resources and AI research. The dataset, developed by researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and HUMAIN, contains more than 30,000 proof-based math problems and solutions—five times larger than any previous dataset of its kind.
The creation of MathNet was driven by a simple observation: every year, countries participating in the International Mathematical Olympiad (IMO) share their most creative and challenging problems, but these collections had never been systematically gathered, cleaned, and made publicly available. As Shaden Alshammari, an MIT PhD student and lead author on the paper, explains, "Every country brings a booklet of its most novel and most creative problems. They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online."

Building this comprehensive dataset was no small feat. The research team tracked down 1,595 PDF volumes totaling over 25,000 pages, spanning digital documents and decades-old scans in more than a dozen languages. A significant portion of this archive came from Navid Safaei, a longtime IMO community figure who had been collecting and scanning competition booklets by hand since 2006. His personal archive formed much of the backbone of MathNet.
What distinguishes MathNet from previous mathematical datasets is not only its scale but its remarkable breadth. While earlier Olympiad-level datasets focused primarily on competitions in the United States and China, MathNet spans dozens of countries across six continents, covers 17 languages, and includes both text- and image-based problems and solutions spanning four decades of competition mathematics. The goal was to capture the full range of mathematical perspectives and problem-solving traditions globally, rather than just the most visible ones.
The sourcing of MathNet is particularly important. Unlike most existing math datasets that pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets. The solutions in these booklets are expert-written and peer-reviewed, often running to multiple pages with authors exploring several approaches to the same problem. This depth provides AI models with a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community-sourced datasets.
For students preparing for mathematical competitions, MathNet represents an invaluable resource. "I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition," says Alshammari, who competed in the IMO herself. The dataset provides a centralized, searchable collection of high-quality problems and worked solutions from diverse mathematical traditions, leveling the playing field for students worldwide.
The researchers have deep connections to the IMO community, with co-author Sultan Albarakati serving on the IMO board. They are working to share the dataset directly with the IMO foundation. To validate the quality and accuracy of MathNet, the team assembled a grading group of more than 30 human evaluators from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated to verify thousands of solutions.
Tanish Patil, deputy leader of Switzerland's IMO, emphasizes the value of MathNet: "The MathNet database has the potential to be an excellent resource for both students and leaders seeking new problems to work on or looking for the solution to a difficult question. Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and important problem metadata that topics and theory require."

Beyond its educational value, MathNet serves as a rigorous benchmark for AI performance in mathematical reasoning. The results reveal a more nuanced picture than recent headlines about AI mathematical prowess might suggest. While frontier models have made extraordinary progress—some reportedly achieving gold-medal performance at the IMO—MathNet demonstrates that this progress is uneven. Even GPT-5, the top-performing model tested, averaged only about 69.3 percent on MathNet's main benchmark of 6,400 problems, failing nearly one-in-three Olympiad-level problems.
The benchmark results expose several consistent weaknesses in current AI systems. When problems include figures, performance drops significantly across all models, highlighting visual reasoning as a persistent challenge. Several open-source models scored 0 percent on Mongolian-language problems, revealing limitations in multilingual capabilities despite overall strength in English and Chinese.
The diversity of MathNet is designed to address a deeper limitation in how AI models learn mathematics. When training data skews toward English and Chinese problems, models absorb a narrow slice of mathematical culture. Problems from different countries may approach the same underlying concept from completely different angles. Exposure to this range, the researchers argue, makes both humans and AI systems better mathematical thinkers.
MathNet also introduces novel benchmarks beyond simple problem-solving. The retrieval benchmark tests whether models can recognize when two problems share the same underlying mathematical structure—a capability crucial for both AI development and the math community itself. Testing eight state-of-the-art embedding models, the researchers found that even the strongest identified the correct match only about 5 percent of the time on the first try, frequently ranking structurally unrelated problems as more similar than equivalent ones.
The dataset includes a retrieval-augmented generation benchmark, testing whether providing a model with a structurally related problem before asking it to solve a new one improves performance. Results were mixed: DeepSeek-V3.2-Speciale gained up to 12 percentage points with well-matched retrieval, while irrelevant retrieval degraded performance in roughly 22 percent of cases.

The implications of this research extend beyond mathematics education and AI benchmarking. Mathematical reasoning is fundamental to many advanced autonomous systems and robotic applications. The ability to solve complex problems, recognize underlying structures, and apply knowledge across different contexts represents capabilities that would significantly enhance the performance of AI systems in real-world scenarios.
The MathNet project demonstrates the importance of diverse, high-quality datasets in advancing AI capabilities. As autonomous systems become more sophisticated, they will require not just larger datasets, but more diverse and representative ones that capture the breadth of human knowledge and reasoning across different cultures and languages.
The research team includes Alshammari, Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy Director Sultan Albarakati, and MIT CSAIL colleagues Kevin Wen, Mark Hamilton, and professors William Freeman and Antonio Torralba. Their work was funded by the Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is now publicly available at mathnet.csail.mit.edu, providing researchers, educators, and students worldwide with unprecedented access to this valuable resource. The dataset promises to accelerate progress in both mathematical education and AI reasoning capabilities, with potential applications extending to fields that require sophisticated problem-solving and analytical thinking.

Comments
Please log in or register to join the discussion