Database branching and the art of gaslighting Postgres into cheap checkpoints

Databricks Lakebase leans on a trick that storage engineers have used for years: convince a database that its data files are immutable and the whole problem of fast checkpoints turns into a pointer copy. The interesting part is what that costs you on the consistency and operations side.

Most databases treat a checkpoint as a heavy, physical event. You flush dirty pages, fsync, write a record into the write-ahead log, and accept that the storage cost of keeping an old snapshot around is roughly the size of the data you snapshotted. That model is fine when checkpoints are rare. It falls apart the moment you want them to be cheap enough to create on every CI run, every schema migration, or every developer who wants a throwaway copy of production.

Databricks Lakebase, a Postgres-compatible operational database built around fast branching with separated compute and storage, is the latest system to take the opposite position. Its Lakebase product page frames branching as a first-class operation. The mechanism underneath is older than the marketing, and it is worth understanding because it shows up in Neon, Aurora, and every copy-on-write filesystem you have ever used. The short version: you stop letting Postgres believe it owns its storage, and you start lying to it about what is durable and what is shared.

The problem: checkpoints are expensive because storage is mutable

Postgres, like most relational engines, writes pages in place. A row update finds the 8KB page that holds the tuple, modifies it in the buffer cache, and eventually flushes it back to the same location on disk. The write-ahead log exists precisely because in-place mutation is dangerous. If you crash halfway through writing a page, the WAL lets you replay or undo to a consistent state.

This design makes a snapshot inherently a copy. If the bytes on disk are going to change in place, the only way to preserve a point in time is to duplicate the bytes before they change. pg_basebackup does exactly this, and the cost scales with database size. A 500GB database produces a 500GB snapshot, and creating ten of them costs you five terabytes. Nobody branches a 500GB database ten times a day under that model.

The consistency story is also rigid. A physical backup captures a single timeline. You can do point-in-time recovery by replaying WAL forward from a base backup, but you cannot cheaply fork the timeline, run a destructive migration on one fork, and keep the other fork live. Postgres assumes there is one authoritative copy of each page.

The solution: separate compute from storage and make pages immutable

The trick that Lakebase, Neon, and Aurora share is to pull the storage layer out from under Postgres and replace it with a service that never overwrites anything. Instead of mutating page 4096 in place, the storage layer treats each version of a page as a new, immutable record keyed by page number and the log sequence number (LSN) that produced it. The WAL stops being a recovery aid bolted onto a mutable heap and becomes the actual source of truth. Pages are just materialized views of the log.

Once pages are immutable and addressed by LSN, a branch is almost free. A branch is a new pointer that says "start from LSN 9000 on the parent, and from here forward, my writes go to a private segment of the log." Reads on the branch resolve a page by walking backward: is there a version of this page written after the branch point? If yes, use it. If no, fall through to the parent's history. This is copy-on-write applied to database pages instead of filesystem blocks.

This is where the gaslighting happens, and it is a precise kind of deception. The Postgres process running on a branch genuinely believes it is talking to a normal, exclusive storage volume. It issues the same read and write calls it always has. The storage layer underneath presents a coherent view assembled from shared immutable parent pages plus the branch's private deltas, and Postgres never learns that 99 percent of what it reads is bytes it does not own and is sharing with a dozen sibling branches. The engine's mental model of "my files, my pages, my durable state" is intact and entirely false. That gap between what Postgres believes about its storage and what is physically true is the whole feature.

Checkpoints get the same treatment. A checkpoint becomes a named LSN rather than a flushed copy of the heap. Creating one is recording a number. Restoring to one is pointing a new compute instance at that LSN and letting the storage layer serve the right page versions.

How a read actually resolves

It helps to trace a single read. Suppose you branch prod at LSN 9000 to create migration-test, then run an ALTER TABLE that rewrites a table on the branch. A query on migration-test asks for page 50000.

The storage layer checks the branch's private write set for page 50000 at or before the current branch LSN. The migration touched that page, so a branch-local version exists and is returned. Now ask for page 12, which the migration never touched. No branch-local version exists, so the lookup falls through to prod's history and returns the version that was current at LSN 9000, the branch point. The branch sees a consistent snapshot: its own changes layered over a frozen view of the parent.

The parent, meanwhile, keeps writing. prod advances to LSN 15000 with its own new page versions. Because those live in prod's forward history past LSN 9000, the branch never sees them. Two timelines, one shared base, no copying. The cost of the branch is proportional to what you changed on it, not to the size of the database.

Trade-offs, because there are always trade-offs

The first cost is read amplification on long history chains. If a page has not been rewritten in a long time, resolving its current version may mean reconstructing it from a base image plus a run of WAL records. Storage layers fight this by periodically materializing fresh page images, the equivalent of compaction. Get the compaction cadence wrong and read latency on cold pages becomes unpredictable, which is exactly the kind of tail-latency problem that does not show up in a demo and does show up at 2 a.m.

The second cost is that you have moved durability into a distributed system, and distributed systems fail in more interesting ways than a local disk. When the WAL is the source of truth and it is replicated across nodes for durability, a commit is not acknowledged until the log record is safely stored on enough replicas. That introduces network round-trips into the commit path. Architectures hide this with same-zone replicas and fast interconnects, but the physics do not disappear. You traded the fsync latency of a local SSD for the consensus latency of a replicated log, and under a partition the system has to choose between blocking writes and risking a durability gap. Read the fine print on what the storage layer promises during a partition.

The third cost is the consistency model around branches themselves. A branch is a snapshot at an LSN, so it is internally consistent. But branches do not stay in sync with their parent, and there is usually no merge operation. You cannot reconcile diverged page histories the way Git reconciles text, because two branches that both rewrote page 50000 have no semantic way to combine those bytes. Branching is cheap; merging is undefined. If your mental model is Git, that model breaks the moment you want to integrate changes back.

The fourth cost is operational coupling. Garbage collection now has to reason about which page versions are still reachable from some branch or retained checkpoint. A page version from LSN 9000 cannot be reclaimed while any branch still points at or before it. Long-lived branches pin history and quietly inflate storage, the same way a forgotten Git branch keeps old objects alive. Teams that create branches liberally and never delete them rediscover that immutable storage is cheap to write and expensive to forget.

Where this fits

The pattern generalizes well beyond one product. Copy-on-write storage with log-structured durability is how you get cheap snapshots in ZFS, instant clones in modern SAN arrays, and database branching in Neon and Aurora. The Postgres-compatibility angle matters because it means your application code, drivers, and SQL do not change. You get the branching semantics from the storage layer while the engine keeps speaking the protocol your ORM already knows.

The useful framing for an engineer evaluating this is to stop thinking of a checkpoint as a copy and start thinking of it as a name for a position in an append-only log. Once data is immutable and addressed by version, snapshots, branches, and point-in-time recovery collapse into the same primitive. The hard problems move out of the engine, where Postgres has been happily fooled, and into the storage service, where read amplification, replicated durability, and garbage collection are now your real operational surface. That is not a worse place for the problems to live. It is just a different one, and you should know it is where the bodies are buried before you bet a production system on it.