How S3 is built - by Gergely Orosz - The Pragmatic Engineer | LavX News

A behind-the-scenes look at how the world's largest storage system is designed for durability and correctness at massive scale - including the use of techniques like formal methods.

Amazon S3 is one of the largest distributed systems ever built, storing and serving data for a significant portion of the internet. Behind its simple interfaces hides an enormous amount of engineering work, careful tradeoffs, and long-term thinking. In this episode, I sit down with Mai-Lan Tomsen Bukovec, VP of Data and Analytics at AWS, who has been running Amazon S3 for more than a decade. Mai-Lan shares how S3 operates at extreme scale, what it takes to design for durability and availability across millions of servers, and why building for failure is a core principle. We also go deep into how AWS approaches correctness using formal methods, how storage tiers and limits shape system design, and why simplicity remains one of the hardest and most important goals at S3's scale.

Twitter image

Ten things I learned from talking with Mai-Lan

1. The mind-boggling scale of S3

S3 handles hundreds of millions of transactions per second (!!) worldwide and stores over 500 trillion objects across hundreds of exabytes of data (one exabyte is 1,000 petabytes or 1 million terabytes). To visualize the hardware: if you stacked all of S3's tens of millions of hard drives on top of each other, they would reach the International Space Station and almost back.

2. The ridiculous engineering feat of rolling out strong consistency practically overnight

When S3 launched in 2006, it was designed with eventual consistency (also called optimized replication) to optimize for high availability, but relaxing data consistency promises. The engineering team later achieved strong consistency without compromising availability or increasing costs to customers — a very difficult feat! A key innovation was a replicated journal — a distributed data structure where nodes are chained together so writes flow sequentially through all nodes. This was combined with a new cache coherency protocol that provides a "failure allowance" — allowing multiple servers to receive requests while tolerating some failures.

3. The quiet Rust takeover

S3 was not written in Rust in 2006, but the team has rewritten almost every performance-critical piece of code in the request path to Rust. The motivation: achieving the highest performance and lowest latency possible. S3 now has a lot of Rust engineers focused on this ongoing optimization work.

4. The 11 nines of durability is measured, not just promised

One claim S3 has is 11 nines of durability (99.999999999%) — which was something I had to take a double check on, as it is such a high target. The system has auditor microservices that continuously inspect every single byte across the entire fleet. When signs of needed repair are detected, separate repair systems automatically kick in. At any given moment, servers are failing, and the system is designed with the assumption that failure is constant. As Mai-Lan confirmed to me, the AWS team can answer this question any time: "What is our actual durability this week/month/year?"

5. Formal methods are practice, not theory at S3

S3 uses formal methods (automated reasoning) extensively in production. When engineers check in code to the index subsystem that handles consistency, the system automatically runs formal proofs to verify that the consistency model hasn't regressed. As Mai-Lan put it: "At a certain scale, math has to save you. Because at a certain scale, you can’t do all the combinatorics of every single edge case, but math can." Formal methods are also used to prove correctness in cross-region authorization.

6. Correlated failures cause most problems these days at S3

Individual failures are expected and handled. The real threat is correlated failures — when multiple components fail together because they share a fault domain (same rack, same availability zone, same power source). As Mai-Lan explained, S3's architecture is designed around preventing correlated failures with approaches like:

Data is replicated across multiple availability zones
Quorum-based algorithms tolerate individual node failures
Physical and logical infrastructure are designed to avoid correlation
Every object is stored multiple times across different fault domains

7. S3 operates on 200+ services

It's not too surprising to learn that S3 has several (micro)services behind each regional endpoint — but it was interesting to learn that the number of services in the range of ~200. This is much smaller than Uber's 4,500+ microservices — and to me it's yet another confirmation that there's no one "right" way to slice and dice your services. If anything, it proves that there's not much correlation between the number of services and the scale a system handles! Mai-Lan explained that a significant portion of the AWS microservices are dedicated solely to durability — health checks, repair systems, and auditors. The philosophy is that complexity must be managed through simplification: each microservice must remain focused, or the system becomes unmaintainable.

8. S3 Vectors: the challenges of building a completely new data primitive

Unlike S3 Tables (which builds on objects via Parquet files), S3 Vectors is a completely new data structure. The challenge team had when building these: searching for nearest neighbors in high-dimensional vector space is expensive. S3's solution:

Precompute "vector neighborhoods" — clusters of similar vectors computed asynchronously offline
When a new vector is inserted, it's added to one or more neighborhoods based on similarity
Queries first find the nearest neighborhoods, then load only those vectors into fast memory for the nearest neighbor algorithm
Result: sub-100ms warm query times with support for 2 billion vectors per index (!) and 20 trillion vectors per bucket (!!)

9. Crash consistency as a design philosophy

Mai-Lan told me how S3 engineers think deeply about crash consistency — the property that a system should always return to a consistent state after a fail-stop failure. Their approach: reason about all possible states a system can reach in the presence of failure, then design microservices to work together, assuming failure is always present.

10. The "Scale Is to Your Advantage" design principle

One of S3's core philosophies that I really liked was this: Scale must be to your advantage. This means that engineers at S3 cannot build something where performance degrades as the service grows. Instead, systems must be designed so that increased scale improves attributes. For example: the larger S3 gets, the more de-correlated workloads become, which ends up improving reliability for all users.

A reading recommendation from Mai-Lan: research papers!

Mai-Lan reads a lot of research papers. For curious folks, she recommends following research on multimodal embedding models — the next frontier for understanding data across different formats (text, images, audio) through semantic vectors. As she put it: "If you think about the future of data lakes, I think it's actually going to about metadata and the semantic understanding of our data. We will need to understand it through vectors, and search will need to be done across multiple modalities."

How S3 is built - by Gergely Orosz - The Pragmatic Engineer

Watch the episode

If you're interested in how one of the largest systems in the world is built, and keeps evolving: this episode is for you.

The Pragmatic Engineer deepdives relevant for this episode

Amazon papers recommended by Mai-Lan:

Timestamps

(00:00) Intro (01:03) S3's scale (03:58) How S3 started (07:25) Parquet, Iceberg, and S3 tables (09:46) S3 for developers (13:37) Why AWS keeps S3 prices low (17:10) AWS pricing tiers (19:38) Availability and durability (26:21) The cost of S3's consistency (31:22) Automated reasoning and proof of correctness (35:14) Durability at AWS scale (39:58) Correlated failure and crash consistency (43:22) Failure allowances (46:04) Two opposing principles in S3 design (49:09) S3's evolution (52:21) S3 Vectors (1:01:16) The 50 TB limit on AWS (1:07:54) The simplicity principle (1:10:10) Types of engineers working on S3 (1:14:15) Closing recommendations

References

Where to find Mai-Lan Tomsen Bukovec:

Mentions during the episode:

Statsig

#S3 #distributed systems #durability #formal methods #Rust