A deep dive into Lean 4's theorem-proving capabilities through implementing and verifying Fizzbuzz, comparing it with Dafny's approach to program verification.
This is part of an ongoing introduction to Lean 4 series:
--> Part one - theorem-proving basics Part two - static bounds checks and dependent types Part three - completing the spec with tactic combinators. Intermezzo - equality proofs between different Fizzbuzzes Part four - proof-carrying code
All previous Proving The Coding Interview posts can be found here.
Back in grad school, I ran a directed reading group for undergrads interested in the intersection of systems and PL research. Most weeks we just discussed papers, but the final session of the semester was special: I'd tell them, "Okay, all term you've seen what others have built; today, let's write and verify some programs of our own!" So, we'd sit in a conference room with a text editor and a box of donuts, and bang out a few verified insertion sorts in the various languages used in the semester's papers. (Looking back, it's a bummer that this was as close as I ever got to actually teaching in the program, and that this was one of the few genuinely-enjoyable parts of the PhD.)
Repeatedly writing the same program in different languages was always illuminating, because we'd get tripped up on wildly-different parts of the implementation, depending on what the given language's strengths and sharp corners were.
In an earlier installment, we implemented everyone's least-favourite interview question, Fizzbuzz, in the Dafny language. We saw how natural it was to program in Dafny: we programed with loops, common datatypes and objects, and mutable variables, just like in a Real Language™, and annotated our code with logical propositions (our program specification). Dafny's automated prover would do the rest: it would attempt to verify our program and either give us the Big Okay, or error out with a counterexample indicating how our implementation diverged from the spec.
It felt like magic, but things are never as simple as they seem. We had to be careful to:
Explicitly state loop invariants so that the compiler would understand what changed (and what stayed the same!) between iterations; Manually write common routines (like toString functions, notably) that the standard library didn't provide for us; Avoid writing specs that might cause the underlying automated solver to time out
What would it mean if it didn't necessarily have to be this way?
The Lean programming language
Lean is a fundamentally different kind of programming language to Dafny: It's both a programming language and interactive proof assistant. (If you've heard of languages like Rocq (formerly Coq), Idris, or Agda, it's definitely cut from the same cloth). Like Dafny, it's actively used in industrial settings like the AWS Automated Reasoning group, and has also found its way into highly-theoretical settings like mathematics research.
In theory, Lean's design avoids all of Dafny's sharp corners:
Don't fall into the trap of thinking that purity is a moral judgement. All "pure" means is that functions don't have side-effects like mutating global variables, raising exceptions, or I/O. Don't be like my ex-coworker who compared non-pure programming to the antivaxxer movement.
It is a pure functional language, with only recursion and no mutable state, we don't have to worry about writing tricky loop invariants (since we can't even syntactically write a loop!); There's enough of a third-party ecosystem in Lean that we shouldn't have to reinvent the wheel for simple helpers, like we had to do with our Dafny toString() function; As a proof-assistant, Lean doesn't attempt to automatically prove your programs correct as Dafny does, but relies on the programmer to write and prove theorems about them. This means there's no SMT solver with unstable performance characteristics to worry about, as you may have noticed if you played with our Dafny fizzbuzz program yourself. Once your theorems and proofs are written, verifying your program boils down to just typechecking it.
Of course, it also stands to reason that what was easy to do in Dafny might be problematic in Lean. Shall we see if that's in fact true…?
In this post, we'll introduce the basics of Lean, both as a programming language and as a theorem prover. We'll implement fizzbuzz in Lean-the-programming-language and specify parts of its behaviour in Lean-the-theorem-prover, learning some simple proof tactics along the way. By the end of this series, you'll have enough experience with dependent types and theorem provers to verify simple programs, probably equivalent to the first several chapters of the excellent Rocq textbook Software Foundations Vol. 1. No prior experience with theorem proving is assumed, though you should probably read the earlier Proving The Coding Interview Dafny series if you're unfamiliar with Dafny, since I'll contrasting the two languages fairly extensively. (Plus, Dafny is very different but nonetheless very neat, anyway.)
If you're familiar with similar languages like Rocq, you may want to skip straight to the next part in the series since most of what this post discusses will already be familiar to you.
Our problem statement, revisited
Fizzbuzz as a problem has not become any less familiar nor any less irritating since my Dafny post got published all those years ago. But, just to remind us what we're trying to do here:
First, construct a list of strings of length n, such that the following hold for all values i from 1 through to n:
If i is a multiple of 3 (but not of 5) then the i-1th element is the string "Fizz"; If i is a multiple of 5 (but not of 3) then the i-1th element is the string "Buzz"; If i is a multiple of 15 then the i-1th element is the string "Fizzbuzz"; Otherwise, the i-1th element equals the string representation of i.
Then, print out, separated by a newline, every element in the list.
Enough of you got on my case that the Dafny specification was zero-indexed that I've adjusted this here - now I have to think hard about avoiding off-by-one errors between the value i residing at index i-1. The good news, though is that verifying our implementation against this specification can help me get this right!
A note to the Lean experts: I'm new to the language and most of my proof assistant experience has been in Coq, and I'll be avoiding opaque crunch-like automated tactics and tacticals in favour of clear and explicit steps, so please don't get too upset if my solution here is not super idomatic or character-count optimised!
A starting non-solution solution
Another note to the Lean experts: I'm conflating = (propositional equality) with == (boolean equality) in this implementation because I don't want to get too in the weeds with beq_iff_eq bridges.
If you're comfortable with more, ahem, "popular" functional languages like Scala or OCaml, you might write Fizzbuzz in such a way:
def fb_one (i : Nat) : String := if i % 15 = 0 then "Fizzbuzz" else if i % 5 = 0 then "Buzz" else if i % 3 = 0 then "Fizz" else Nat.repr i
def fizzbuzz (n : Nat) : List String := List.range' 1 n |> List.map fb_one
As with the Dafny example, I don't expect you to pick up every piece of syntax, but: this defines the function fizzbuzz with one argument n of type Nat, and produces a List of Strings (this polymorphic type is written List String; imagine that String was in angle-brackets like it'd be in a more conventional syntax). Following the := is the body of the function, which calls a helper function fb_one. In case it wasn't clear by now, when you call a function you don't wrap the arguments in parens, so if in C you'd write f(x, y, g(z)), you'd simply in Lean write f x y (g z). This function first creates the List Nat expression [1, 2, 3, 4 ... n+1] and then "pipes" each number into the string-creation heper function. So, fizzbuzz 5 would produce ["1", "2", "Fizz", "4", "Buzz"], as we would hope it would. (You can type #eval fizzbuzz 5 in your Lean file and the IDE will show the result of that computation.)
As a child of the How To Design Programs curriculum, I like explicitly separating out and giving a name to the helper that we pass to List.map, as opposed to just writing it as an anonymous function in the body. In HtDP, this was so we could more easily write unit tests, but here, of course, the hope is that a more decomposed implementation will also allow a more decomposed specification.
If you're interested in typing along with this post, great! You'll want to be sure to include mathlib4 as a dependency for your Lean project, since it includes theorems about modular arithmetic that we'll want to make use of.
Our first theorem
In fact, why don't we codify a fact like "we expect fizzbuzz 3 to be some particular list of elements" by stating the above #eval more formally! In a lot of languages we might write an assert() or a unit test or something; here, we'll do it in Lean's proof system by by stating it as a theorem:
theorem fizzbuzz_thm : fizzbuzz 3 = ["1", "2", "Fizz"] := rfl
In VS Code, I see a little blue check next to the theorem, indicating that Lean has validated it. (If I changed the final element of the list, or broke something in fizzbuzz, I'd, as you'd expect, see an immediate and reasonable-sounding error in my IDE: the left-hand side fizzbuzz 3 is not definitionally equal to the right-hand side ["1", "2", "Buzz"]
Before going any further, you should take a moment and think about how you would convince yourselves that this equality always holds. (It probably involves mentally evaluating the left hand side and confirming the final result happens to be the right-hand side.) While it's pretty easy to see that it's the case here, and would still be easy to see even if fizzbuzz was written in Python or Java or something, in more complicated functions you never know if some non-obvious piece of global mutable state will ultimately change the returned value between different function calls. I once worked at a company whose large C codebase had a 2000-line long function void poke(void) - clearly the only thing that function could ever do is read and modify global variables!
Lean's functional purity means such mutable state fundamentally can't exist anywhere in the program, so the function's output can only vary based on its inputs. For formal proofs about our programs, this is an incredibly nice property to have.
So what is a theorem, really?
Calling it a "theorem" certainly sounds fancier that a test, but how is it actually different than one? Here's the kicker: no part of this definition has any runtime semantics; you can think of it as invisible "ghost code" (which should sound familiar to you from Dafny). the proposition that this theorem states, fizzbuzz 3 = ["1", "2", "Fizz"] might at first glance look like a boolean expression, but just like with Dafny's invariants, it's not! Look where it resides in the theorem definition, between : and :=. In the function definition earlier, this was where the return type went. And with good reason: the proposition fizzbuzz 3 = ["1", "2", "Fizz"] is, in lean, the type of the theorem!!
That a type can look like this might feel ridiculous if you've never seen a type system like Lean's before. In most languages that, well, we can make money programming in, a type doesn't tell you much more than the set of possible values that a variable can have. If you're used to low-level programming, maybe you think of a type more in terms of its bitwise-representation in memory. Here, though, we're thinking about a type as being the same as a logical proposition. Edwin Brady (and probably others, but I learned it from his book) draw a nice comparision between types and tests: a test can demonstrate the presence of a bug, but a type can demonstrate the absence of a bug. (Of course, this is contingent on having a type system expressive enough to state interesting facts about your program. int and String might technically be "logical propositions" in Java, but they're not logically proposing anything to us).
You might be wondering why, if we prove theorems through its type system, Lean doesn't seem to have much in the way of type inference. It turns out that its type system is too expressive to support it - it's in fact proven to be formally undecidable! This is why we, the programmer, need to help Lean along by writing proofs ourselves. Because Dafny's type system (and underlying prover) is not as expressive as a fully-dependent type system like Lean's, we give up the ability to articulate certain propositions but we gain the ability for Dafny to automatically deduce more aspects of their correctness. I find this tradeoff really fascinating and there is lots of interesting research, like liquid type systems, that plays in this space.
Back to figuring out the syntax of our theorem: in the function definition, what followed after := was the function body. What this is is our explanation – our proof – of why this theorem is correct. If you want to think of this theorem as "a function of zero arguments, that evaluates to rfl", that's not a terrible way to do it for now. It's certainly not far off from how Lean's typechecker thinks about it, at least.
The good news is that this is an easier process than writing, say, first-year discrete math proofs, because even though you have to write the proof yourself, Lean will tell you if you made a mistake in the proof, like assuming something that isn't true or missing an edge case in a case analysis or something.
So what is a proof, really?
We can look at the definition of rfl in Lean's frankly excellent documentation to understand what the body of the theorem – the theorem's proof – is doing:
rfl : a = a is the unique constructor of the equality type. This is a more powerful theorem than it may appear at first, because although the statement of the theorem is a = a, Lean will allow anything that is definitionally equal to that type. So, for instance, 2 + 2 = 4 is proven in Lean by rfl, because both sides are the same up to definitional equality.
Huh, so rfl is a constructor for a value of type a = a! Again, a = a doesn't look like the sort of type we're used to, so if you want to mentally substitute the traditional polymorphic IsTheSame<T, T> type until you're more familiar, that's fine. So, in Java, the body of an equivalent "theorem" might be return new IsTheSame<>() or something like that. This is a theme – as well as a fundamental result in theoretical computer science – that we'll keep coming back to: a type states a logical proposition, and the proof of that logical proposition takes the form of some computation, like the body of a function that returns a value of that type.
I always like getting my bearings by seeing what we could express in a not-fancy type system. It's pretty easy to imagine in, say, Java, getting a IsTheSame<Employee, Employee> to typecheck - Employee it makes sense that any class (irrespective of the the equality of any objects of that class) is identical to itself. If we implemented IsTheSame with variance, it'd also be reasonable to expect IsTheSame<Employee, Person> or IsTheSame<Employee, Student>, since they presumably share a common superclass. What we certainly can't do in Java is to write what we just wrote in Lean: Whatever goes between the angle brackets in Java needs to be a type, not an arbitrary program expression (which PL people typically call "terms"), so IsTheSame<2 + 2, 4> would for sure be totally nonsensical, because in most languages types and terms occupy different syntactic domains - this wouldn't even parse, much less typecheck. Clearly what Lean's type system is doing is blurring the lines between types and terms somehow. We'll have a lot more to say about this down the road.
So, if we're writing a theorem whose goal it is to prove that two things are equal to each other, that will probably mean we'll be using rfl at some point in its proof. (In general: two things might be equal but not definitionally so, so we might need to first transform the left and right-hand sides ourselves to get them into a state that rfl can handle.)
When rfl isn't enough
Here's another theorem that we might want to write: let's suppose we want to validate that the the list that fizzbuzz 3 produces isn't empty. We might start the theorem like this:
theorem fb_of_3_len_nonzero : 0 < (fizzbuzz 3).length := -- TODO: finish me
rfl is all about proving equalities, but we have an inequality here, so it's not going to help us here. In order to complete this proof, we're going to not directly return a type constructor like rfl but instead use Lean's tactic system. Tactics are like "commands" in Lean that allow us to transform types into (hopefully) simpler ones. So, a tactics-oriented proof looks like a sequence of rewrites. To enter tactics mode, we use the by keyword in the theorem body:
theorem fb_of_3_len_nonzero : 0 < (fizzbuzz 3).length := by
And the IDE now shows us, in its "tactics state" panel:
1 goal ⊢ 0 < (fizzbuzz 3).length
This is sometimes called the proof context. Right now we just have one goal (the expression that follows the sideways-T thing) but in more complicated cases we might have several goals to handle individually, or might list hypotheses that have been introduced to us. This is the first example of a common theme, here: Dafny figured out automatically what we have to explicitly state in Lean. For this cage-match to not be utterly one-sided, we better make sure that we get something in return for this tradeoff.
To prove a goal means to simplify it down, step by step, to something that Lean axiomatically knows to be True. (We did this earlier with rfl; having an equality where both sides are "the same" is enough to satisfy the proof assistant.)
A note: true is a value of type Bool, the usual boolean type. True is, by contrast, a logical proposition (that is, a type, whose type is of type Prop).
Tackling the proof with some basic tactics
So now we're kind of on our own! It's now up to our intuition to transform this goal into something that's "obviously" true. This is a problem if you are new to proof assistants and haven't yet developed your intuition! Here's how I proved this theorem: I started with the following proof and proof context:
theorem fb_of_3_len_nonzero : 0 < (fizzbuzz 3).length := by 1 goal ⊢ 0 < (fizzbuzz 3).length
That is, the goal starts out by looking just exactly like the theorem we're trying to prove. Where can we go from here? rw rewrites the goal based on a known equality
What can we say about what fizzbuzz 3 evaluates to? Well, by the theorem we wrote earlier, we have a proof of equality about its value. So, we can use the rw tactic with fizzbuzz_thm to transform the goal into:
theorem fb_of_3_len_nonzero : 0 < (fizzbuzz 3).length := by rw [fizzbuzz_thm] -- NEW 1 goal ⊢ 0 < ["1", "2", "Fizz"].length
As a rough heuristic, usually, but not always, if your goal is getting smaller or simpler, you're on the right track. A theorem that helps us prove another one is sometimes called a lemma. Rewriting with fizzbuzz_thm has banished the function call away, leaving us with the raw list expression (that we then take the length of). This feels like pretty good progress so far!
What we're left with is another plausible goal, and good news, with a bit more proof search we've found a nice property about List.range': in essence, it states (range' s n).length = n! match! Once you fill in the arguments for s and n, this is exactly what's left in our goal, so this completes the proof. We can group multiple adjacent rewrites onto one line like this:
rw [List.length_map, List.length_range']
theorem fb_length_n (n : Nat) : (fizzbuzz n).length = n := by unfold fizzbuzz rw [List.length_map] rw [List.length_range'] -- NEW 0 goals Goals accomplished!
Usually, when students start learning about automated theorem provers, they define inductive datatypes like List, write recursive functions that operate on values of that type, and then write proofs by induction to state things about those functions (So much so that "uh, how do we start this proof? INDUCTION!" becomes a running joke). I'm intentionally taking a different strategy here: By being a consumer of the standard library's List, not only did we not need to write a recursive function to build up the list, we could prove properties about our functions by composing already-made theorems together. This division of labour is super powerful and pays dividends when it comes time to write our theorems. Functional programming often gets unfairly maligned as "just being a bunch of recursion", but clearly "real-world" FP doesn't have to feel this way. (If you know what the State monad is, you also know that the "FP is annoying because every value is immutable" canard doesn't have to be entirely true either. We'll get there soon enough.)
I don't know how long this series will go, but it wouldn't surprise me if we finish the final installment without having written a single recursive function or proof by induction.
Simplifying proof-writing with automation tactics
We had a slightly annoying proof-writing loop just now: our path to the goal was clear but we had to hit a search engine to figure out the exact name of the theorem we wanted to use. Lean is able to take over this process for us in simple cases with the simp tactic: it tries to simplify the goal down by exploring some possible theorems to apply or rewrite with. Trying it out is a good first strategy when you're not sure how to proceed. Interestingly, there's a tactic called aesop which tries to prove theorems with an SMT solver, just like Dafny. I've never used it but maybe it's useful in some cases!
Using automation tactics can make your proof a bit more opaque if you're not sure exactly how your goal changed. The variation simp? will show in the context which helpful theorems it found. (There's no guarantee that simp will find an efficient or intuitive path, though!)
Thinking in Lean means thinking in types, which we should all do anyway
We could proceed with trying to write some theorems about, say, every third, fifth, and fifteenth element of fizzbuzz n, but we're leaving some complexity on the table by doing so. Think about what each of those theorems would have to state and how wd'd have to try to prove them: "Every 5th index in the produced list contains a string that ends, case insenstively, in 'buzz'" would involve some annoying list operations and substring slicing. This sounds brittle to get right.
Let's put aside all this fancy theorem proving stuff and just look a bit at the type signature of our fizzbuzz function: it consumes a Nat and produces a List String (we write this function type with an arrow between each argument and the return type, so: Nat -> List String).
Here's a thought experiment: in what ways could a broken Fizzbuzz implementation still satisfy this signature? I can think of at least a few different ways:
It could return a list of length unequal to the input value. (In other words, it would violate the theorem we just wrote.) In the worst case, it could just unconditionally return the empty list! The type checker might be satisified but we certainly wouldn't be.
The strings in the list might not relate to the the Fizzbuzz problem. I could produce the first n words of Jabberwocky and in the absence of any theorems stating otherwise, Lean would be perfectly happy with it.
By contrast: a way this function could not be broken is if we passed a negative number to the function. That's because the input argument's type is not an Int but rather a Nat. And, the good news is that we were at least able to write a theorem guaranteeing that the first situation couldn't happen. So, we are somewhat taking advantage of Lean's cool feautres. Nonetheless, though, it kind of feels like we're being a bit sloppy with our type signatures and are hoping that our theorems will pick up the slack for us. This might not be the best way to program in Lean, and, it might not even be the best way to program in Scala or OCaml (or Java (or Python!)!)
Perhaps the more "type-friendly" way to write this function would be to:
Produce not just an ordinary List of values, but a special collection type that knows its own length at the type level, so just like with add_one_pos, the return type can depend on the value of the function argument; Encode the elements returned not as strings but as a finite type that forces a "fizz", "buzz", "fizzbuzz", or number value.
So, we'll make a modification to our English-language specification:
First, construct a list of values, of length n, such that the following hold for all values i from 1 through to n:
If i is a multiple of 3 (but not of 5) then the i-1th element is an internal representation of "fizz"; If i is a multiple of 5 (but not of 3) then the i-1th element is an internal representation of "buzz"; If i is a multiple of 15 then the i-1th element is an internal representation of "fizzbuzz"; Otherwise, the i-1th element equals an internal representation of i.
Then, print out, separated by a newline, every element in the list, after converting them from their private representation to their expected string form.
Next time, we'll do this simple refactoring and then see where that leads us.
Conclusion: how's the cage match going?
Frankly, so far, it doesn't seem like things are looking good for Lean. By the end of the first Dafny post, we had an almost verified Fizzbuzz implementation, whereas here it feels like we've been just building type-theoretic mechanisms to get off the ground. There's a saying about programming in types: it goes fast but feels slow. In the next section, we'll design a more "type-oriented" function, get it verifed, and confirm whether that saying's actually true!
Comments
Please log in or register to join the discussion