When the Proof Almost Builds: Claude, Lean, and the Honest Limits of Machine-Generated Mathematics

An experiment asking Claude to translate six lines of calculus into a formal Lean proof produced something more interesting than success or failure. The result built after eight rounds of error-message ping-pong, but it leaned on axioms and four placeholder 'sorry' markers, raising a real question about what it means for a machine to prove something.

John D. Cook ran a small experiment that deserves more attention than its modest framing suggests. He took six lines of calculus, the derivation showing that a particular Fourier coefficient of Kepler's equation equals a Bessel function, and asked Claude to render that argument as a formal proof in Lean, the interactive theorem prover whose mathematical library Mathlib now encodes a substantial fraction of undergraduate and graduate mathematics. The original calculation lives at the bottom of his post, where the LaTeX source sits, somewhat cleverly, inside the alt tag of an SVG image. Claude read the URL, parsed the mathematics out of that hidden text, and generated a complete Lean file on the first attempt.

The file did not build. What happened next is the part worth thinking about carefully, because it reveals something about the current relationship between large language models and formal verification that neither the optimists nor the skeptics quite capture.

The thesis hiding inside an experiment

The naive reading of this episode is a binary one: either Claude proved the theorem or it did not. Cook deliberately withheld all assistance beyond pasting error messages back into the conversation, and after eight iterations the file compiled. By the strict standard of "does Lean accept it," the answer is yes. By the standard a mathematician actually cares about, the answer is more layered, and the structure of that layering is the real content of the experiment.

The finished proof contains five axioms and four instances of sorry, Lean's keyword for an admitted gap, a hole in the logical fabric that the kernel tolerates but flags. A proof with a sorry in it is, in the formal sense, not a proof of the stated theorem. It is a proof of the theorem conditional on the admitted gaps. So what Claude produced was not a verified result so much as a verified scaffold, a precise accounting of exactly which pieces remain to be filled and which Mathlib lemmas would fill them.

This distinction matters, and Claude itself named it correctly. The model described the output as "the accepted style for roadmap Lean proofs ahead of library coverage." That phrase is doing a lot of work. A roadmap proof is a recognized genre among Lean practitioners: you lay out the skeleton, you state every intermediate lemma with its correct type signature, you connect them with the right tactics, and you leave sorry where a standard but not-yet-formalized result belongs. The value of such an artifact is that it converts an open-ended translation problem into a finite checklist of citable lemmas.

How the proof is actually built

To see why this is more impressive and more honest than a simple pass-or-fail framing allows, it helps to walk through the architecture, because the architecture is where the genuine reasoning lives.

The mathematical target is the identity that the n-th Fourier sine coefficient of the eccentric-anomaly displacement, the difference between the eccentric anomaly E and the mean anomaly M in Kepler's equation, equals (2/n) times the Bessel function J_n evaluated at n·e. This is a classical result, the historical origin of Bessel functions in fact, since Bessel introduced them precisely to study this orbital problem.

Claude structured the Lean file in a way a working mathematician would recognize as sound. It defined the Kepler map M = E − e·sin E directly, then proved, with no gaps at all, that this map has derivative 1 − e·cos E at every point. That lemma compiles cleanly from hasDerivAt_id, hasDerivAt_sin, and the sub and const_mul combinators. It then proved that the derivative stays positive whenever e < 1, using nlinarith fed with the fact that cosine is bounded by one, and from positivity it concluded strict monotonicity via strictMono_of_hasDerivAt_pos. These are real, complete, machine-checked lemmas. The model chose the version of the monotonicity theorem that needs only pointwise differentiability and positivity, sparing itself a separate continuity argument, which is exactly the kind of small economy an experienced Lean user develops a feel for.

Where the proof becomes a roadmap is at the inverse function. The eccentric anomaly is the inverse of the Kepler map, and proving that inverse exists, is differentiable, and has the right derivative would require invoking the inverse function theorem and threading it through Mathlib's machinery for local inverses. Rather than attempt that, Claude axiomatized the inverse: it declared eccAnom as an axiom along with five properties, that it satisfies Kepler's equation, that it has the expected derivative, and its boundary values at 0 and π. Every one of these is a true consequence of the inverse function theorem applied to the monotone smooth map already established. Axiomatizing them is intellectually honest in a way that matters: the model isolated precisely the analytic fact it was choosing not to formalize, stated it correctly, and made it visible rather than smuggling it into a tactic block.

The four sorry markers then cover the integration steps: the integration by parts that moves the derivative off the displacement and onto the cosine, the vanishing of the leftover integral of cos(nM) over [0, π] for integer n, the integrability side condition that the by-parts step needs, and the change of variables M ↦ E that turns the integral into the Bessel integral representation. For each one, the file names the specific Mathlib lemma that closes it: integral_deriv_mul_eq_sub_of_hasDerivAt for the by-parts, integral_cos together with Real.sin_nat_mul_pi for the vanishing integral, integral_image_eq_integral_abs_deriv_smul for the substitution. The final two steps, recognizing the Bessel integral and combining the constants, are fully proved; the constant-juggling at the end closes with a plain ring.

The implication: verification as conversation

What this experiment surfaces is that the useful unit of work is not the finished proof but the iterative loop. Cook fed errors back eight times and offered nothing else. The model's ability to read a Lean error, locate the offending term, and revise without human mathematical insight is the actual capability on display. Lean's type checker is an unusually unforgiving oracle. It does not hallucinate agreement. When it accepts a term, that term is correct relative to whatever axioms and sorrys remain, and there is no rhetorical wiggle room. This makes the error-message loop a tight, grounded feedback channel of exactly the kind that language models otherwise lack.

That grounding is the deeper point connecting this to a broader pattern in machine-assisted mathematics. The recurring weakness of language models on mathematics is that they produce fluent arguments that sound right and are wrong, with no internal mechanism to tell the difference. Pairing a model with a proof assistant replaces the model's self-assessment with an external, infallible judge. The model proposes; the kernel disposes. Projects like DeepMind's AlphaProof and the various Lean-based autoformalization efforts all exploit this same structural arrangement, and Cook's hobby-scale experiment is a miniature of the same idea. The trustworthiness comes not from the model but from the verifier the model is forced to satisfy.

The remaining honest gap is autoformalization of the library-coverage frontier. Three of the four sorrys exist not because the mathematics is hard but because the relevant lemmas, in exactly the form needed, may not yet sit in Mathlib in a directly applicable shape, or because connecting the abstract measure-theoretic substitution lemma to a concrete interval integral takes more glue than a few iterations allow. The model knew which lemmas to cite, which is itself a nontrivial retrieval feat across a library of tens of thousands of results. Bridging from "I know the lemma name" to "the types unify and the side conditions discharge" is where the genuine difficulty concentrates, and it is precisely the part the roadmap defers.

Counter-perspectives worth holding

A skeptic could fairly say that a proof with five axioms and four holes is not a proof, and the skeptic would be technically right. The axiomatized inverse function is not a minor convenience; in a different problem an axiomatized object could quietly encode a falsehood, and the whole edifice would prove nothing. The discipline that protects against this is exactly the discipline of reading what was axiomatized and checking that each axiom is true, which returns the burden of mathematical judgment to a human. The machine did not remove the need for a mathematician; it relocated that need to the auditing of five clearly stated axioms, which is a far smaller and more tractable task than verifying six lines of analysis by hand, but a task nonetheless.

An optimist could counter that this relocation is the entire value proposition. A roadmap proof that correctly names every missing lemma is a work order. A graduate student or a future model can pick it up and close the gaps mechanically, and the hard creative act, deciding to axiomatize the inverse rather than fight it, choosing the pointwise-derivative form of the monotonicity lemma, recognizing the Bessel integral at the end, has already been done. The eight iterations compressed hours of a human's translation labor into a short conversation.

Both readings are correct, which is why the experiment is more instructive than a clean success would have been. It shows a system that is genuinely good at the architecture of a formal proof and still dependent on a human or a verifier for the closing of its admitted gaps. The interesting frontier is not whether models can write Lean, since they plainly can, but how much of the gap-closing can be pushed into the same grounded feedback loop that produced the scaffold. For anyone wanting to try the same thing, the building blocks are public: Lean and Mathlib are open source, the community documentation is extensive, and the #min_imports command Claude thoughtfully suggested at the bottom of the file will report the minimal import set once every sorry is gone. The proof that almost builds is, in the end, a fairly precise map of the distance still to travel.