MaxProof Claims Gold-Medal Math Olympiad Performance Through Population-Level Test-Time Scaling

A new paper from the MiniMax-M3 team reports a system that scores above the human gold-medal cutoff on both IMO 2025 and USAMO 2026 by combining a generative verifier with tournament-style search over many candidate proofs. The numbers are striking, but the method leans heavily on test-time compute and a verifier whose reliability is doing most of the work.

A paper posted to arXiv on June 11, 2026 (arXiv:2606.13473) describes MaxProof, a framework for generating competition-level mathematical proofs built on the MiniMax-M3 model series. The headline results are the kind that get screenshotted: 35 out of 42 on IMO 2025 and 36 out of 42 on USAMO 2026, both above the threshold that would earn a human contestant a gold medal.

That claim deserves to be read carefully, because proof generation is exactly the domain where benchmark numbers and actual capability tend to drift apart. A model can produce text that looks like a proof, contains the right keywords, reaches the correct final answer, and is still logically broken in the middle. The interesting part of this paper is less the score and more the machinery the authors built specifically to stop that from happening.

What's claimed

MaxProof is described as a "population-level test-time scaling framework." Stripped of the phrasing, that means the system does not try to produce one good proof in one shot. Instead it generates a large pool of candidate proofs and then spends substantial compute at inference time filtering, repairing, and ranking them until one survives.

The M3 model is trained for three distinct skills rather than one. The first is ordinary proof generation. The second is proof verification, judging whether a given proof is actually correct. The third is critique-conditioned proof repair, taking a flawed proof plus a critique of what is wrong with it and producing a fixed version. The authors then merge all three capabilities back into a single released model, so the same weights act as generator, verifier, refiner, and ranker depending on how they are prompted at test time.

At inference, MaxProof runs these roles in a loop. It generates candidates, verifies them, feeds failing ones back through repair, and finally runs a tournament selection to return a single proof. The reported IMO and USAMO scores are the output of this whole pipeline, not of a single forward pass.

What's actually new

The core technical bet is on the verifier. The paper calls it a "defense-in-depth generative verifier engineered for low false-positive rate." In proof grading, a false positive is the dangerous error: the verifier accepts a proof that is actually wrong. If your verifier has a high false-positive rate, then search makes things worse rather than better, because generating thousands of candidates and selecting the ones the verifier likes will reliably surface confident-looking garbage. Search amplifies whatever bias the selector has.

So the engineering emphasis on driving down false positives, even at the cost of rejecting some valid proofs, is the right priority for this setup. It is the difference between a search process that converges on correctness and one that converges on whatever fools the judge. The "defense-in-depth" framing suggests multiple verification passes or layered checks rather than a single judgment, which is consistent with treating the verifier as the component that has to be trustworthy for everything else to work.

The second genuinely useful idea is critique-conditioned repair as a trained capability rather than a prompting trick. Many systems ask a model to "reflect" and fix its own work, with mixed results. Training a dedicated repair behavior that consumes a structured critique is a more disciplined version of that loop, and it pairs naturally with a verifier that can produce those critiques. Generate, find the flaw, describe the flaw, patch the flaw, re-verify. That cycle is the actual product here.

Merging the three capabilities into one released model is also a practical choice worth noting. It means the test-time system is not orchestrating three separate checkpoints, which simplifies deployment even if it complicates training.

Limitations

The results are real numbers on real competitions, but several things temper how much they say about model capability in the usual sense.

First, the score is a property of the full pipeline plus its compute budget, not of the model alone. "Population-level" search implies many candidate proofs per problem and repeated verification and repair passes. The paper does not, in the abstract, quantify how much inference compute produces 35/42, and that number matters enormously for interpreting the result. A system that reaches gold-medal level with a fixed modest budget is a different claim than one that gets there by throwing very large amounts of compute at each problem. Test-time scaling is a legitimate axis of progress, but it shifts cost from training to every single query.

Second, the entire approach rests on the verifier being right. The authors clearly know this, which is why so much of the design targets false-positive rate. The open question is what the residual false-positive rate actually is on hard, novel problems, the exact regime where a verifier is most likely to be fooled. A verifier that is excellent on training-distribution proofs and weaker on genuinely unfamiliar olympiad arguments would still produce good aggregate scores while masking the failure mode that matters most.

Third, IMO 2025 and USAMO 2026 are recent, but contamination and near-contamination are persistent concerns for any model trained near the date of a competition. Olympiad problems generate large volumes of online discussion and solution writeups quickly. The paper would need to address how it isolates genuine reasoning from absorbed solution patterns, and the abstract alone does not settle that.

Fourth, "gold-medal threshold" compares a search system with unlimited time and many attempts against humans working under strict time limits and producing one written proof each. The comparison is rhetorically clean and genuinely impressive on its own terms, but it is not measuring the same thing the medal measures.

The broader pattern

MaxProof fits a direction that has become clear over the past two years: for reasoning tasks with checkable structure, a strong verifier plus search at inference time can extract performance well beyond what the base generator produces unaided. Proof is close to an ideal domain for this because correctness is, in principle, a formal property, and because a flawed proof can often be repaired rather than discarded. The generator does not need to be reliable if the verifier and the search loop are.

That is also the catch. The achievement is increasingly located in the verifier and the orchestration, and verification of natural-language mathematical proofs is itself an unsolved problem. Formal proof assistants like Lean sidestep this by making correctness machine-checkable, at the cost of requiring everything be formalized. A generative verifier operating on natural-language proofs trades that hard guarantee for flexibility, and the quality of the whole system is bounded by how good that trade turns out to be.

The MiniMax-M3 work is a substantial engineering result and the reported scores are notable. The number to watch, once the full paper and any released code are available, is not 35/42. It is the verifier's false-positive rate on problems it has never seen, and the compute cost of reaching that score. Those two numbers determine whether this is a durable advance in machine reasoning or a well-tuned demonstration of how much search can do when you can afford it.

Twitter image