Inside TernFS: How XTX Markets Built a Petabyte-Scale Filesystem for Machine Learning at the Edge

In the high-stakes world of algorithmic trading, where microseconds dictate profits, XTX Markets faced a storage crisis. As their machine learning models expanded—driving forecasts for 50,000+ financial instruments—their infrastructure ballooned from "a couple of desktops and an NFS server" to tens of thousands of GPUs and CPUs. By 2022, commercial and open-source filesystems buckled under hundreds of petabytes of data. The solution? TernFS: a bespoke distributed filesystem now open-sourced on GitHub, handling 500PB across three data centers with zero data loss since deployment.

Why Existing Filesystems Failed

XTX's journey mirrors tech giants like Google and Meta: scale demands custom solutions. NFS faltered first, followed by off-the-shelf alternatives, as random-access I/O from GPU clusters overwhelmed metadata servers. TernFS emerged to unify "cold" market data storage and hot, ephemeral compute workloads. As noted in their technical deep dive, "Every major tech company builds its own filesystem because disruption here is catastrophic."

Architecture: Four Pillars of Scale

TernFS decomposes into stateless, independently scalable services:

Metadata Shards (256 logical groups)
- Directories round-robin assigned to shards, each with 5 replicas (1 leader, 4 followers) using a Raft-like consensus engine (LogsDB) and RocksDB for storage.
- Handles 100K+ compute nodes with just 10 metadata servers per datacenter. Reads/writes hit leaders, but follower reads could 100× throughput.
Cross-Directory Coordinator (CDC)
- Manages distributed transactions (e.g., directory moves) across shards via a privileged API. A bottleneck for write-heavy ops but ensures atomicity.
Block Services
- Each drive (HDD or SSD) acts as an independent service. Files split into ≤100MB "spans," then into D data + P parity blocks (configurable per directory). Go-based TCP API writes directly to local FS—proving "idiomatic Go is performant enough."
Registry
- Service discovery hub storing locations (IPv4-only for kernel simplicity) and drive states. Clients need only this endpoint to mount TernFS.

// Simplified block write flow in TernFS
shard := registry.GetShardForDir("/ml/data")
blockServices := shard.AssignBlockServices(D+P) // D=10, P=4 typical
client.WriteBlocks(blockServices, checksummedData)

Breaking POSIX for Performance

TernFS’s kernel module—not FUSE—delivers near-native speed but defies POSIX: files are immutable. Once written, contents can’t be modified, only replaced. This simplifies consistency but breaks apps that edit files in-place. XTX’s workaround? Temp-file staging. "Programs writing left-to-right just work," they note, while others adapt. For broader access, an S3 gateway (partially open-sourced) bridges multi-tenant needs.

Resilience: No Byte Left Behind

Checksumming: 4KiB data pages interleaved with CRC32-C checksums, enabling partial reads and scrubbing.
Reed-Solomon Coding: Default D=10/P=4 encoding allows any 4 drive failures per file without data loss.
Failure Domains: Blocks distributed across servers to mitigate correlated outages.
Asynchronous Multi-Region Replication: One "primary" datacenter handles metadata writes; file contents replicate proactively or on-demand. "Losing a whole DC is rare, but we tolerate replication lag," admits XTX.

"Buggy clients are the real threat. Block proofs force idempotency—each write/delete gets a cryptographic signature. We’ve never lost data, but paranoia pays off." — XTX Engineering

Why No Copysets? A Calculated Risk

While copysets reduce data-loss probability, TernFS avoids them. Random block placement speeds evacuation: a 20TB drive migrates in "minutes." With triple-DC replication, "thousands of drives failing at once" would be needed for permanent loss. The trade-off? Simplicity and agility.

The Human Factor: Snapshots and Scrubbing

Lightweight Snapshots: rm doesn’t delete data—it creates "weak references" recoverable via API. A garbage collector purges expired snapshots.
Scrubbing: Continuously reads all blocks to catch bitrot, crucial for "cold" but critical data like raw market feeds.

Open Source, Open Questions

TernFS reflects XTX’s ethos: build when commodity tools can’t scale. While optimized for finance-grade ML, its patterns—stateless UDP APIs, policy-driven storage tiers (flash for random I/O, HDDs for sequential), and meticulous checksumming—offer blueprints for AI/cloud engineers. As datasets explode, TernFS challenges the status quo: sometimes, the best filesystem is the one you craft yourself.

Source: XTX Markets, "TernFS: A Distributed Filesystem for Large-Scale Machine Learning", September 2025.

#DistributedFilesystems #StorageEngineering #HighFrequencyTrading