An introduction to XET, Hugging Face's storage system (part 1)

XET is a content-addressable storage protocol that solves the inefficiency of storing large, versioned files by deduplicating at the chunk level rather than the file level, offering significant storage savings for datasets, models, and other binary artifacts.

Version control systems like Git are fundamentally brilliant for tracking the evolution of source code. They allow us to clone, update, and share the complete history of a project with remarkable efficiency. The core mechanism is elegant: Git stores objects indexed by a hash of their content. Files with identical content, regardless of their name or version, map to the same object. This content-addressable approach, combined with delta compression for related blobs, works beautifully for text-based source code where changes are often incremental and localized.

However, this model breaks down for large binary files. Even minor modifications to a model checkpoint or a dataset can result in a delta that is nearly as large as the entire file. The consequence is that preserving history for binaries often requires storing and transferring a new large blob for each change. This is the storage problem that Git LFS attempts to mitigate by replacing large files with tiny pointer files, storing the actual content in a separate object store. While Git LFS solves the client-side storage issue by allowing selective checkout of specific versions, it does not address the server-side redundancy. If a file changes, an entire new file must still be uploaded and stored, even when the vast majority of its bytes are identical to a previous version. This inefficiency becomes particularly acute in environments like the Hugging Face Hub, where models, datasets, and artifacts are frequently forked and versioned. A small edit to a large file can trigger the upload and storage of a completely new object, wasting bandwidth and storage space.

To solve this problem, XET introduces a more granular approach: chunk-level deduplication. Instead of treating files as monolithic blobs, XET intelligently splits them into smaller, variable-sized chunks. When a file is modified, only the chunks that have actually changed need to be stored anew. Consider two versions of a file: v1: ABCDEXGHIJKLMNOPQRSTUVWXYZ and v2: ABCDEXGH42IJKLMNOPQRSTUVWXYZ. At the file level, these are entirely different objects. But by splitting them into chunks—say, ABCDEXGH and IJKLMNOPQRSTUVWXYZ for the first, and ABCDEXGH, 42, and IJKLMNOPQRSTUVWXYZ for the second—we see that two chunks are identical. The system can store these shared chunks only once, leading to substantial server-side storage savings.

Reconstructing a file from these chunks presents a new challenge. The server must send the client a plan: an ordered set of chunk ranges that, when reassembled, produce the final file. To avoid the inefficiency of thousands of HTTP requests for a single file (a 1GB file could require 16,000 requests with 64 KiB chunks), XET groups chunks into containers called xorbs. These are large objects, capped at 64 MiB, that aggregate many compressed chunks. A typical xorb might contain around 1,024 chunks. Within a xorb, chunks are stored in a single recorded order, which is often the ingest order but is not strictly required to match the original file order.

The reconstruction plan is therefore not a simple list of chunks. It is a list of terms, each specifying a xorb hash and a chunk-index range, along with per-xorb download instructions that map those ranges to exact byte ranges. The client can then use the HTTP Range header to download the specific byte ranges from the different xorbs it needs, parse the chunk headers, decompress the data, and concatenate everything to reconstruct the original file. Integrity is verified by recomputing a Merkle-style hash over the chunk hashes and comparing it to the expected file hash. Since chunks, xorbs, and files are all content-addressed by hash, each level of the system can be independently verified.

The implications of this design are significant. XET is not merely a storage optimization for the Hugging Face Hub; it is a general-purpose content-addressable storage protocol with broad applicability. Any system that stores large, versioned objects with high redundancy can benefit. OCI image registries, for instance, store a vast amount of redundant data between image layers and could see dramatic storage and transfer savings. The protocol's use of content-defined chunking creates variable-sized blocks, which allows for deduplication even when bytes are inserted or deleted—a limitation of systems using fixed-size blocks. Its clean design, public specification, and open-source implementations make it a compelling candidate for standardization.

This high-level overview only scratches the surface. The true sophistication of XET lies in its details: the algorithms for content-defined chunking, the compression strategies, the mechanisms for maximizing bandwidth usage, and the design choices that ensure integrity and prevent unauthorized access. These topics will be explored in the second part of this introduction. For those interested in the technical specifics, Hugging Face's official documentation provides a comprehensive resource, and a preliminary XET draft document outlines the protocol in detail. A simple Python implementation is also available for hands-on exploration.

In essence, XET represents a thoughtful evolution of storage systems, moving beyond file-level deduplication to address the inherent redundancy in large, versioned binary data. By leveraging chunking, content-addressing, and efficient HTTP range requests, it offers a scalable solution that could reshape how we store and transfer the massive artifacts that power modern machine learning and software development.

#Machine Learning #Python #Infrastructure

An introduction to XET, Hugging Face's storage system (part 1)

Comments