John D. Cook’s experiment shows how far an AI assistant can push a theorem in Lean 4 before the hard parts still demand a human proof engineer.
John D. Cook tested Claude on a Lean 4 formalization task: prove a partial fraction theorem for elements of the fraction field of a principal ideal domain. Claude produced a structured proof after 11 rounds of error messages, but the result still contained five sorry placeholders.
{{IMAGE:1}}
Cook gave Claude a theorem, not a worked proof. The prompt asked for a Lean proof that each element c of the fraction field over a principal ideal domain R can take the form of a finite sum a_i / p_i^{r_i}, with nonassociate irreducible p_i and coprime numerators.
Claude chose a standard algebra plan. It wrote c = a/b, factored the denominator, then used Bezout identities and a Chinese remainder style decomposition to split the fraction into prime-power pieces. That plan helps a human mathematician because it names the right ingredients: lowest terms, factorization, coprimality, and denominator splitting.
The generated Lean code shows a different problem. A mathematical proof sketch can survive gaps that Lean cannot. Claude had to match Mathlib names, typeclass assumptions, imported modules, coercions into FractionRing R, and library lemmas about unique factorization. Cook reports two failure sources: Mathlib refactors that moved or renamed facts, and hallucinated API calls.
The final file defines a PFDSummand structure with a prime, exponent, and numerator. It also defines IsPFD, which records irreducibility, nonassociation, coprimality, and the equality between the list sum and the target fraction. Claude then builds helper lemmas for numerator-denominator representation, Bezout splitting, numerator reduction, and denominator induction.
The strongest part of the result sits in the architecture. Claude identified useful intermediate claims and arranged the proof around UniqueFactorizationMonoid.induction_on_coprime. That choice matches the theorem’s shape: split denominators into coprime factors until only prime powers remain.
The weak parts sit in the exact obligations Lean asks for. Claude left holes in the coprime numerator reduction, the unit-denominator case, the prime-power summand equality, and the cross-list nonassociation checks after appending decompositions. Those gaps do not look cosmetic. They contain the places where a proof assistant forces the author to justify arithmetic equivalence, residue choices, and independence between factors.
The experiment says something useful about AI-assisted formalization. Claude can propose the proof plan, create data structures, search through compile errors, and get close enough that a Lean user can inspect a coherent file. It cannot yet replace the human who knows which lemma closes a coercion problem, which invariant the induction must preserve, and which theorem statement needs stronger hypotheses.
Cook’s session limit also matters. Interactive theorem proving rewards short feedback loops, but a model that burns turns on broken imports or invented names can spend the budget before it reaches the mathematical core. A human collaborator would change tactics: inspect Mathlib, narrow a lemma, prove a smaller claim, or restate the theorem to match existing library machinery.
A stronger workflow would pair Claude with local search over Mathlib and require each generated lemma to compile before it enters the main proof. The model could still suggest the algebra, but the editor would force smaller claims: first prove bezout_split, then prove a prime-power numerator lemma, then prove append preserves nonassociation under the coprime-factor induction.
Cook’s result lands between failure and full automation. Claude did enough work to show mathematical direction and Lean structure. The five sorry holes show where theorem proving still turns on exact library knowledge, careful statement design, and the discipline to make each algebraic step check.
Comments
Please log in or register to join the discussion