Bab: Bringing Verifiable Partial Transfers to Peer-to-Peer Storage

Bab is a family of hash functions built for the messy reality of peer-to-peer networks, where transfers stall, sources multiply, and not every peer is honest. By treating hashing as a storage problem rather than a cryptographic afterthought, it lets a system verify pieces of data before the whole thing has arrived.

Most cryptographic hash functions answer a single question: does this exact blob of bytes match this digest? That question assumes you already hold the entire blob. In peer-to-peer systems that assumption rarely holds. Connections drop halfway through a multi-gigabyte transfer, a file arrives in fragments from a dozen different sources, and some of those sources may be actively hostile, feeding you corrupted chunks to waste your bandwidth or poison your store. Bab is a family of hashing functions that reframes the problem from the ground up, asking instead: can I trust this piece of data before the rest of it shows up?

The project positions itself as a relative of Blake3 and Bao, and the lineage matters. Blake3 organizes its hashing into a binary tree of chunks rather than a linear chain, which is what makes parallel and incremental verification possible in the first place. Bao took that tree structure and built a verified streaming format on top of it, so a receiver could authenticate each chunk as it arrived rather than waiting for a final checksum over the whole file. Bab inherits this tree-based foundation but tunes it for a different set of priorities, and the tuning is where the interesting design work lives.

A storage perspective, not a hashing one

The central argument behind Bab is that APIs designed by cryptographers tend to expose the wrong surface for people building storage systems. A hashing library wants to take bytes and return a digest. A storage system wants to ask which ranges of a file it already has, which it still needs, and whether a freshly received range is genuine without re-reading everything around it. These are different shapes of question, and an API built for the first shape forces awkward contortions to answer the second.

By designing from a storage rather than a hashing perspective, Bab makes partial data a first-class concept. You can request a specific slice of a file, receive a proof alongside it, and verify that the slice belongs to the file identified by a known root hash, all without holding the neighboring bytes. This is what lets a client make progress through interrupted connections. If a download dies at sixty percent, the sixty percent already on disk is verifiable and reusable, and the client can resume from a precise offset rather than starting over or trusting unverified leftovers.

Constant-size length proofs

One of the more pointed technical choices is constant-size length proofs. In a tree-hashing scheme, proving how long a file is can be surprisingly subtle, because a naive structure lets an adversary lie about the total length or pad the end of a file with chunks you cannot distinguish from legitimate data. Bao addressed length extension concerns through its tree encoding, and Bab carries the concern further by keeping the proof of a file's length to a fixed size regardless of how large the file actually is.

The practical payoff is predictability. A peer can commit to the length of a dataset, and any other peer can verify that claim with a small, bounded amount of work and bandwidth. In a network where you might be talking to many sources about many files, bounded costs are what keep the bookkeeping from exploding. A proof whose size grows with the data it describes would undermine the whole point of fetching small pieces verifiably.

Parameterisable by design

Blake3 makes a deliberate set of fixed choices, a single chunk size and a single output length, and that rigidity is part of why it is fast and simple to implement correctly. Bab instead exposes a parameterisable specification. Chunk sizes and related structural decisions become knobs rather than constants. A system storing many tiny records has different needs from one streaming large media files, and a fixed chunk size that suits one will be wasteful for the other. Too small a chunk size means a proportionally larger proof tree and more per-chunk overhead; too large a chunk size means coarse granularity, so a single corrupted byte invalidates a big span of data and forces a larger re-fetch.

Parameterisation lets an implementer pick the trade-off that fits the workload instead of accepting someone else's compromise. The cost is that interoperating systems must agree on parameters, which pushes complexity into specification and negotiation, but for a protocol aiming to serve varied storage patterns that flexibility is the point.

Optimised for repeating substrings

The optimisation for repeating substrings is the detail that most clearly signals storage thinking. Real datasets are full of redundancy: duplicated files, shared headers, blocks of zeros, near-identical versions of the same document. A hash function indifferent to this redundancy recomputes everything from scratch. Bab is built to recognize and exploit repeated content so that work done on one occurrence of a substring can inform another. In a content-addressed store, where deduplication is often the entire economic argument, computation that collapses repeated structure is not a minor efficiency, it shapes how cheaply the system scales.

Designed for Willow

Bab does not exist in a vacuum; it was designed for Willow, a protocol for peer-to-peer data stores that organizes data into namespaces and addresses entries by paths and timestamps. Willow needs to sync partial views of a store between peers who each hold different subsets, who may come and go unpredictably, and who cannot assume good faith from everyone they talk to. Those requirements explain almost every choice in Bab: verifiable partial transfers for peers holding fragments, malicious-transfer detection for an open network, concurrent multi-source downloads for resilience, and bounded proofs for predictable cost.

The broader pattern here is worth sitting with. The first generation of content addressing treated a hash as an identity, a name you could use to ask for a thing. What systems like Willow and tools like Bab are reaching toward is content addressing as a transport discipline, where the same cryptographic structure that names data also governs how it moves, in what order, from whom, and with what guarantees at every intermediate step. The hash stops being a label checked at the door and becomes the grammar of the conversation itself.

There are reasons for caution. A parameterisable, storage-oriented hash family is inherently more complex than a fixed function, and complexity is where cryptographic mistakes hide. The value of any verification scheme rests entirely on the correctness of its proof construction, and a subtle flaw in how partial proofs compose could let an adversary smuggle through exactly the corrupted data the system claims to reject. The tight coupling to Willow is also double-edged: it gives Bab a coherent purpose and a real consumer, but a hash function that travels well beyond its origin protocol earns scrutiny that a single-application design may not yet have received. Those are the questions worth watching as the specification matures and independent implementations appear, because a verification primitive is only as trustworthy as the breadth of attention it has survived.

#hash-functions #Peer-to-Peer #Cryptography #distributed storage #content-addressing