Building a Simple Distributed Database Taught Me Why Consistency Models Matter
#Regulation

Building a Simple Distributed Database Taught Me Why Consistency Models Matter

Backend Reporter
9 min read

When I set out to build a distributed database from scratch, I thought consistency would be straightforward. I quickly discovered that the trade-offs between consistency, availability, and partition tolerance create fascinating challenges that shape modern database systems.

I've been working on a distributed database project with an unusual constraint: I must implement everything from scratch. No existing database engines, no distributed consensus libraries, no built-in consistency mechanisms. At first, I thought: "Consistency is simple. Data should be the same everywhere." Then I started thinking about what actually happens when multiple nodes need to agree on data. How does a system ensure that all copies of data remain synchronized? What happens when network partitions occur? That question pushed me into learning something I had seen before but never really understood: CAP theorem consistency models distributed consensus

This post isn't about implementation code. I'm still working on the project. Instead, this is the theoretical understanding I gained while building it.

The illusion of "perfect consistency"

Humans often think of data consistency as an absolute state. Either data is consistent, or it isn't. We see a bank account balance and expect it to be the same everywhere, all the time. To the machine, this initially looks closer to a complex dance of messages, acknowledgments, and timing. There is no single source of truth. There is no global clock. There is only a network of nodes with their own views of reality.

The computer needs protocols that gradually transform local perspectives into global agreement.

Stage 1: Understanding the CAP theorem

Before diving into implementation, I had to revisit the CAP theorem. I knew it existed but never deeply considered its implications. The theorem states that a distributed system can only guarantee two out of three properties:

Consistency: All nodes see the same data at the same time Availability: Every request receives a response (without guarantee that it contains the most recent version) Partition tolerance: The system continues to operate despite network failures

What surprised me was how these properties interact. In a real-world system with network partitions (which are inevitable), you must choose between consistency and availability.

I initially thought: "Why not have both?" Then I encountered scenarios where this proved impossible. Consider a system with two nodes, A and B, that can't communicate due to a network partition:

  • Node A receives an update to a record
  • Node B receives a read request for that same record

If the system guarantees availability, Node B must respond with some data. But since it can't communicate with Node A, it has two choices:

  1. Return stale data (violating consistency)
  2. Return an error (violating availability)

This fundamental trade-off reshaped how I approached system design.

Stage 2: Exploring consistency models

With the CAP theorem in mind, I began exploring different consistency models. I discovered that "consistency" isn't binary—it exists on a spectrum.

Strong consistency: This is what most people think of as "true" consistency. After an update, all subsequent reads will return the updated value. It's like having a single global state, but achieving it requires coordination that can impact availability.

Linearizability: A slightly relaxed form of strong consistency where operations appear to occur instantaneously at some point between their invocation and response. It's the model used in systems like ZooKeeper and etcd for critical operations.

Sequential consistency: Operations are executed in an order consistent with some sequential ordering of the operations, but not necessarily the real-time order. This model allows for more parallelism than linearizability.

Eventual consistency: The system guarantees that if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. This is the model used in many NoSQL databases like Cassandra and DynamoDB.

Causal consistency: Operations that are causally related are seen by all processes in the same order. Operations that are not causally related may be seen in a different order by different processes.

Each model makes different trade-offs between consistency, latency, and availability. What I found most interesting is that different parts of a system can use different consistency models based on their specific requirements.

Stage 3: Implementing consensus algorithms

To achieve strong consistency, I needed to implement consensus algorithms. I started with Paxos, which proved to be more complex than I anticipated. The algorithm works through a series of phases:

  1. Prepare: Proposers send proposals to acceptors
  2. Promise: Acceptors promise not to accept any earlier proposals
  3. Accept: Proposers send values to acceptors
  4. Accepted: Acceptors accept values if they haven't promised to accept a higher proposal number

The complexity comes from handling failure scenarios:

  • What if a proposer fails after sending prepare but before accept?
  • What if acceptors fail at different points?
  • How do we ensure progress even with failures?

I then moved to Raft, which proved more intuitive for implementation. Raft achieves consensus through leader election and log replication:

  1. Leader election: Nodes vote for a leader
  2. Log replication: Leaders append client commands to their logs and replicate them to followers
  3. Safety: The system ensures that committed entries are stored on a majority of nodes

What I found valuable was understanding how these algorithms ensure safety (the system makes correct progress) and liveness (the system makes progress eventually).

Stage 4: Handling network partitions

Network partitions are inevitable in distributed systems. I learned that how a system handles partitions defines its consistency guarantees. In a system that prioritizes consistency during a partition:

  • Nodes might stop accepting writes to avoid inconsistency
  • Reads might return errors instead of potentially stale data
  • The system might become unavailable until the partition heals

In a system that prioritizes availability during a partition:

  • Nodes might continue accepting writes locally
  • Reads might return stale or inconsistent data
  • The system remains operational but with potentially inconsistent state

This trade-off becomes especially challenging in geo-distributed systems where network latency between regions can create partitions that last for extended periods.

Stage 5: Designing for partial failures

As I built my system, I encountered a fundamental truth: partial failures are more common than total failures. In a distributed system with 100 nodes, the probability that exactly 50 nodes fail simultaneously is different from the probability that any subset of nodes fails.

This led me to design patterns for handling partial failures:

Quorums: Instead of requiring all nodes to agree, require only a majority. This allows the system to continue operating even if some nodes fail. For example, in a system with 5 nodes, a write might need to be replicated to 3 nodes to be considered successful.

Vector clocks: To track causality between operations across nodes, I implemented vector clocks. Each node maintains a vector of version numbers, with one entry per node. When a node updates data, it increments its own entry in the vector. When data is replicated, vectors are merged to maintain causality.

Gossip protocols: For disseminating state information across the cluster, I implemented epidemic broadcast (gossip) protocols. Nodes periodically exchange state with random peers, causing information to spread throughout the cluster like an epidemic.

These patterns taught me that consistency isn't just about making all nodes agree—it's about making them agree efficiently and reliably in the face of failures.

Stage 6: Understanding the impact of consistency on application design

What surprised me most was how consistency models impact application design. In a strongly consistent system:

  • Application logic can be simpler because data is always up-to-date
  • Concurrency control becomes easier
  • However, latency increases due to coordination overhead

In an eventually consistent system:

  • Application logic becomes more complex because it must handle stale data
  • Conflict resolution mechanisms are needed
  • However, latency decreases and availability increases

I found myself designing different APIs for different consistency levels. For strongly consistent operations, I used synchronous APIs that block until consensus is reached. For eventually consistent operations, I used asynchronous APIs that return immediately and notify clients when operations complete.

Trees of dependencies started appearing everywhere

While reading more, I kept finding the same idea: Everything becomes a dependency graph.

The consistency requirements for an application can be imagined as:

Application | +-- User Profile | | +-- Strong consistency (user authentication) +-- Shopping Cart | | +-- Eventual consistency (cart contents) +-- Product Catalog | +-- Eventual consistency (product availability)

I found this interesting because suddenly many things I had heard before started making more sense:

  • Database replication strategies
  • Distributed transactions
  • Conflict-free replicated data types (CRDTs)
  • Saga pattern for distributed transactions

They all repeatedly address the same fundamental challenge: maintaining consistency across distributed components with different trade-offs.

I finally understood what eventual consistency really means

I had heard "eventual consistency" many times and thought it sounded like a poor compromise. The concept is more nuanced than I expected. Eventual consistency doesn't mean "inconsistent for a while." It means:

  1. If no new updates are made, all replicas will eventually become consistent
  2. All replicas will converge to the same value given enough time and no updates
  3. The system guarantees that no updates are lost (they may be delayed, but not lost)

The "eventual" part refers to the fact that there's no guaranteed time bound on when consistency will be achieved. In practice, well-designed eventually consistent systems achieve consistency within milliseconds or seconds, not hours or days.

Error handling became less mysterious

Before learning this, consistency errors felt like edge cases. For example, concurrent updates to the same record. Humans intuitively understand that updates should be ordered. The system notices because its consistency model expected a sequence of operations but instead received concurrent operations.

So consistency errors are not random. They're essentially: "These operations cannot be applied in this order due to dependencies."

What surprised me the most

I started this project trying to build a database. I expected to learn about storage engines, indexing, and query optimization. Instead, one of the biggest lessons became understanding how consistency models shape system behavior.

Now whenever I use:

  • A strongly consistent database like PostgreSQL
  • An eventually consistent database like Cassandra
  • A consensus system like etcd

I no longer see magic. I imagine the underlying protocols: Application request ↓ Local processing ↓ Message passing ↓ Consensus protocol ↓ Replication ↓ Global state ↓ Application response

Before this project, consistency felt like a configuration option. Now it feels like a series of carefully designed trade-offs that permeate every aspect of system design.

I'm still working on this distributed database project, and I'm still learning, but understanding the theory behind consistency models changed how I look at distributed systems entirely.

Have you ever started a project for one reason and ended up understanding an entirely different layer of computing?

Comments

Loading comments...