Building a Database on S3: Lessons from 2008 That Still Shape Modern Cloud Architecture
#Cloud

Building a Database on S3: Lessons from 2008 That Still Shape Modern Cloud Architecture

Tech Essays Reporter
7 min read

A 2008 paper on building databases over S3 reveals how separating storage from compute became the foundation for modern cloud-native databases, with surprising parallels to today's architectures like Aurora and the lakehouse paradigm.

In 2008, when Amazon S3 was painfully slow with 100ms reads being common, a groundbreaking paper proposed building a database architecture that would anticipate the serverless revolution by nearly a decade. The core insight was deceptively simple yet revolutionary: separate storage from compute by using S3 as a pagestore and SQS as a write-ahead log. This design philosophy, born from necessity in an era of primitive cloud primitives, has become the foundation of modern cloud-native databases.

The Architecture: S3 as Pagestore, SQS as WAL

The fundamental challenge was S3's latency. Direct writes of full data pages were prohibitively slow, so the authors devised an ingenious workaround. Instead of writing to S3 directly, clients would write small, idempotent redo log records to SQS queues. These logs would then be asynchronously applied to B-tree pages on S3 by client-side checkpointers. This separation of "commit" from "apply" was brilliant—it hid S3's latency while maintaining consistency.

This architecture bears striking similarities to modern systems. Amazon Aurora, for instance, follows essentially the same logic: the log is replicated, and storage materializes pages independently. While Aurora's primary write acknowledgment is synchronous after storage quorum replication (unlike the manual log pulling in the 2008 system), the philosophical foundation is identical. Both systems recognize that in the cloud era, you need to treat storage as a dumb, scalable service and push intelligence to the compute layer.

Surviving SQS's Quirks

The 2008 SQS was even more primitive than today's version. The paper describes a system where requesting 100 messages might randomly return only 20, because SQS would do a best-effort poll of a subset of its distributed servers and immediately return whatever it found. This was a deliberate trade-off for low latency, but it meant FIFO ordering wasn't guaranteed.

The database handled this chaos through idempotency. Log records were designed to be processed multiple times without corruption. If messages arrived out of order or duplicates slipped through, the system would simply ignore the duplicates and process the out-of-order messages correctly. This was engineering at its finest—building correctness on top of unreliable primitives.

The Atomicity Protocol: Consistency at a Cost

The initial commit protocol was naive: clients would send log records straight to Pending Update (PU) queues. But this broke atomicity—if a client crashed mid-commit, only some records might make it to the queue. The solution was an Atomicity protocol that first dumped all logs plus a final "commit" token into a private ATOMIC queue, then pushed everything to the public PU queues. This guaranteed all-or-nothing transactions.

The cost was staggering. Each extra SQS message added up to nearly twenty times the cost of the naive approach—2.90 per 1,000 transactions versus 0.15. This illustrates a profound truth about distributed systems: consistency comes at a literal monetary cost. The system was literally paying for correctness.

The Engineering Complexity

Building a real database on dumb cloud primitives required massive engineering effort. The system had to implement Record Managers, Page Managers, and buffer pools entirely on the client side, just to cluster tiny records into pages. For distributed coordination, they hacked SQS into a locking system with dedicated LOCK queues and carefully timed tokens.

The B-link tree implementation was particularly clever. Since S3 pages could be arbitrarily out-of-date compared to real-time logs in SQS, they needed a lock-free approach. B-link trees allowed readers to keep moving while background checkpoints split or reorganized index pages. Each node pointed to its right sibling, so if a checkpoint split a page, readers would simply follow the pointer without blocking.

However, this came with limitations. The LOCK queue token ensured only one thread checkpointed a specific PU queue at a time, creating a serious bottleneck. Hot-spot objects updated thousands of times per second simply couldn't scale under this design. The paper was honest about these trade-offs, acknowledging that extreme availability required sacrificing traditional isolation guarantees.

Isolation in the Age of Scale

The system threw traditional ANSI SQL-style isolation out the window. The authors argued that strict consistency couldn't survive at scale in this architecture. The atomicity protocol prevented dirty reads by ensuring only fully committed logs left a client's private queue, but commit-time read-write and write-write conflicts were ignored entirely. Last-writer wins became the default, making lost updates common.

To make this usable, consistency was pushed to the client layer. Each client tracked the highest commit timestamp it had seen, rejecting any older versions from S3 and rereading. Version counters on log records and page headers ensured monotonic writes. Checkpoints sorted logs and deferred out-of-order SQS messages so each client's writes stayed in order.

Interestingly, the paper claimed snapshot isolation hadn't been implemented in distributed systems yet because it strictly required a centralized global counter to serialize transactions. This was flagged as a fatal bottleneck and single point of failure. Looking back, this claim seems outdated. Modern distributed database systems implement snapshot isolation using loosely synchronized physical clocks and hybrid logical clocks to give global ordering with no centralized counter at all.

From Primitive Cloud to Modern Data Lakes

The paper's context is crucial to understanding its significance. In 2008, cloud primitives were brutally primitive. Building a database required implementing everything from scratch on top of unreliable, slow services. The engineering effort was massive, and the performance characteristics were far from ideal.

Yet the core insight—separating storage from compute—has proven timeless. In recent years, S3 has become dramatically faster, and in 2020 it gained strong read-after-write consistency for all PUTs and DELETEs. This made it much easier to build databases, especially for analytical workloads, directly over S3. The modern data lake and lakehouse paradigms owe much to this architectural insight.

Systems like Databricks (Delta Lake), Apache Iceberg, and Snowflake all embody the philosophy first articulated in this 2008 paper. They treat object storage as the source of truth and push computation to stateless compute layers. The lakehouse architecture, which combines the scalability of data lakes with the ACID guarantees of data warehouses, is essentially the realization of what this paper envisioned.

The Enduring Lesson

The most profound lesson from this paper isn't about specific technologies or protocols—it's about architectural philosophy. When building systems in the cloud, you need to embrace the primitives you're given, even when they're messy and unreliable. You need to push intelligence to the edges, keep the storage layer dumb and scalable, and handle consistency at the application level.

This philosophy has only become more relevant as cloud architectures have evolved. Serverless computing, microservices, and edge computing all embody the same principle: separate concerns, push complexity to where it can be managed, and build on top of simple, scalable primitives.

The 2008 paper was building serverless before the term existed. It was disaggregating storage and compute before that became the default pattern. It was accepting eventual consistency and pushing isolation guarantees to the client before that became the pragmatic approach to building scalable systems.

Looking at modern cloud-native databases, from Aurora to Snowflake to the various lakehouse implementations, you can trace a direct line back to this paper's core insight. The specific technologies have evolved—S3 is faster, SQS is more reliable, we have better synchronization primitives—but the fundamental architectural pattern remains the same.

Sometimes the most revolutionary ideas are the ones that seem obvious in retrospect. Separating storage from compute, treating object storage as the source of truth, and pushing computation to stateless layers—these patterns are now standard practice. But in 2008, they were radical departures from traditional database architecture.

The paper's legacy isn't in its specific implementations or protocols, which have been superseded by better technologies. It's in the architectural philosophy it articulated—a philosophy that continues to shape how we build distributed systems in the cloud era. When we talk about serverless, disaggregation, or the lakehouse paradigm, we're really talking about the ideas first explored in this paper, adapted to a world where cloud primitives are finally fast and reliable enough to make them practical.

In the end, the paper teaches us that great architecture isn't about finding perfect solutions—it's about finding workable solutions within the constraints you're given, and having the vision to see how those constraints might evolve. The authors of this 2008 paper didn't just build a database on S3; they laid the groundwork for how we think about building distributed systems in the cloud.

Comments

Loading comments...