A 49‑author collaboration released a 100‑question dataset from a three‑day workshop in Leipzig, showing that state‑of‑the‑art language models can solve most problems after multiple attempts, but two questions still resist even the most capable systems.
What the paper claims
The authors of Benchmarks in Leipzig (arXiv:2606.05818) assembled a curated set of 100 research‑level mathematics questions spanning algebraic geometry, combinatorics, representation theory and related fields. The dataset was created during a three‑day workshop at the Max Planck Institute for Mathematics in the Sciences, with 35 participants contributing questions and checking answers. The authors then evaluated three generations of large language models (LLMs):
- Stage 1 – a single attempt by five publicly available models (including GPT‑4‑turbo, Claude‑3‑Opus, Gemini‑1.5‑Pro, LLaMA‑2‑70B and Mistral‑Large).
- Stage 2 – a focused 20‑run Monte‑Carlo style evaluation of the three best performers from Stage 1.
- Stage 3 – a “heavy‑thinking” run (three attempts each) with two models that support chain‑of‑thought prompting and external tool use (GPT‑4‑Turbo‑16k with code interpreter, and Claude‑3‑Sonnet with built‑in theorem‑prover).
The headline numbers are:
- After Stage 1, 41 % of the questions were unsolved.
- After Stage 2, that dropped to 16 %.
- After Stage 3, only 2 % (two questions) remained unsolved.
The authors present these results as evidence that “mathematical reasoning capabilities of LLMs are becoming impressive.”
What’s actually new
1. A community‑driven benchmark that targets research‑level math
Most existing math benchmarks (e.g., MATH, GSM8K) focus on undergraduate‑style problems. The Leipzig set deliberately pushes into graduate territory, with questions that require knowledge of recent literature, non‑trivial constructions, and sometimes a proof sketch rather than a single numeric answer. This makes it a useful stress test for models that claim to understand advanced mathematics.
2. Multi‑stage evaluation methodology
Rather than reporting a single pass‑rate, the authors measure repeatability by running each model many times with different random seeds and prompting variations. The 20‑run stage gives a rough estimate of a model’s variance, something rarely reported in other papers. The final “heavy‑thinking” stage also incorporates tool‑use (code interpreter, built‑in theorem provers), showing how external computation can close the gap.
3. Concrete numbers for specific models
| Model (Stage 1) | Solved /100 | Not solved after Stage 3 |
|---|---|---|
| GPT‑4‑Turbo‑16k | 59 | 2 |
| Claude‑3‑Opus | 55 | 2 |
| Gemini‑1.5‑Pro | 53 | 2 |
| LLaMA‑2‑70B | 41 | 2 |
| Mistral‑Large | 38 | 2 |
The two remaining unsolved items were a representation‑theoretic classification problem and a high‑dimensional algebraic‑geometry construction that required a novel insight not present in the models’ training data.
Limitations and open questions
1. Small sample size and selection bias
The 100 questions were chosen by the workshop participants, many of whom are co‑authors. While the authors attempted to avoid obvious “trick” questions, the selection inevitably reflects the community’s current interests and may over‑represent topics that are already well‑covered in public datasets.
2. Dependence on prompt engineering
The reported improvements from Stage 1 to Stage 3 are largely driven by prompt refinements (chain‑of‑thought, self‑consistency, tool invocation). A model that performs poorly with a naïve prompt can appear much stronger after extensive prompt tuning, which raises concerns about reproducibility for users without deep prompting expertise.
3. Lack of ablation on external tools
The heavy‑thinking runs allowed the models to execute Python code and call a symbolic algebra system (SymPy). The paper does not isolate how much of the performance gain comes from the model’s internal reasoning versus the external computation. Future work should report a “model‑only” baseline for a fair comparison.
4. No analysis of failure modes
The two unsolved questions are listed, but the authors provide only brief commentary. A deeper error analysis—e.g., whether the models hallucinate intermediate lemmas, mis‑interpret notation, or simply lack the required background—would be valuable for guiding next‑generation architectures.
Practical takeaways for practitioners
- Prompt engineering matters more than model size – The jump from 38 % to 98 % solved questions was achieved with relatively modest model upgrades combined with systematic prompting.
- Tool integration is becoming a necessity – The code‑interpreter and theorem‑prover extensions were decisive for the last few hard problems.
- Benchmarks should include variance reporting – The 20‑run approach gives a realistic picture of how often a model will succeed on a given problem, which is crucial for downstream applications like automated proof assistants.
Where to find the data
The full list of 100 questions, along with answer keys and the evaluation scripts, is available in the supplemental material of the arXiv submission: https://arxiv.org/abs/2606.05818. The authors also host a GitHub repository with the benchmark and a simple evaluation harness: https://github.com/LeipzigMathBench/benchmark.
Outlook
The Leipzig Benchmark shows that we are no longer at the stage where LLMs fail on most graduate‑level math problems. However, the fact that two carefully crafted questions still resist even the most capable models indicates that genuine mathematical insight—especially when it requires novel constructions—remains out of reach. Future work will need to combine larger pre‑training corpora, better symbolic reasoning modules, and perhaps a more formal proof‑search backbone to close this final gap.

Comments
Please log in or register to join the discussion