Why You Can’t Have Consistency, Availability, and Partition Tolerance All at Once
#Infrastructure

Why You Can’t Have Consistency, Availability, and Partition Tolerance All at Once

Backend Reporter
4 min read

A seasoned distributed‑systems engineer explains why the CAP theorem still holds despite tunable quorum tricks. By walking through the W + R > N rule, a bank‑branch thought experiment, and the pitfalls of sloppy quorums, the article shows that once a network partition occurs, you must choose between consistency (CP) or availability (AP); you cannot truly achieve all three.

Why You Can’t Have Consistency, Availability, and Partition Tolerance All at Once

Featured image

In every systems interview the CAP theorem comes up like a reflex: pick two. After a decade of building Kubernetes operators, multi‑region control planes, and hybrid clouds, I stopped treating CAP as a quiz question and started treating it as a lived reality. The math behind quorum‑based stores looks solid, but the moment a partition hits, the guarantees evaporate.


The seductive math: W + R > N

  • N – total replicas of a datum.
  • W – number of replicas that must acknowledge a write.
  • R – number of replicas consulted for a read.

If W + R > N, the pigeonhole principle guarantees that the write set and the read set overlap on at least one node, so a read should see the latest write. This is the foundation of Dynamo‑style quorum systems and the reason many engineers feel they can have both consistency and availability.

The hidden assumption

The guarantee only holds if every write actually lands on the intended replicas. The overlap must be on real quorum members, not on placeholder nodes that happen to be reachable during a partition.


The bank‑branch thought experiment

Imagine a bank with five branches. A transaction is considered final once 3 branches have recorded it (W = 3, N = 5). A storm partitions the city:

  • Group A (minority) – Branches 1 & 2 (only 2 nodes).
  • Group B (majority) – Branches 3, 4 & 5 (3 nodes).

A customer at Branch 1 wants to withdraw money. The branch cannot reach three replicas, so it faces two choices:

  1. Reject the request – consistency is preserved, but the branch is unavailable.
  2. Process the request locally – availability is preserved, but the balance diverges from the majority view.

There is no third option. Adding more branches only delays the moment a partition isolates a minority; it does not eliminate the binary choice.


Sloppy quorum – why the math still looks right

To keep the minority side “available,” many systems use hinted handoff or sloppy quorums:

  • The write is acknowledged by reachable nodes that are not part of the original quorum.
  • The client sees W acknowledgments, so W + R > N still holds on paper.
  • A later read that contacts the real replicas finds no record of the write, returning stale data.

The overlap exists, but it is a ghost node that never stored the write. The underlying assumption of the quorum proof is broken, and consistency is already compromised.


CP vs. AP – the real decision point

Network partitions are inevitable; they are not a hypothetical edge case. Once a partition occurs, the system must decide:

  • CP (Consistency + Partition tolerance) – refuse writes when a quorum cannot be formed. Clients see errors or timeouts, but the state remains globally consistent.
  • AP (Availability + Partition tolerance) – accept writes on any reachable node, rely on background repair (read‑repair, anti‑entropy) to converge later.

Systems like Cassandra and DynamoDB embrace AP, exposing eventual consistency and repair mechanisms. etcd, ZooKeeper, and Consul choose CP, returning “unavailable” rather than serving possibly incorrect data.


What this means for design interviews

A good answer goes beyond reciting the theorem. Interviewers want to see that you understand:

  1. The quorum formula is conditional – it assumes writes reach the intended replicas.
  2. Partitions are a given – the system must have an explicit policy for the minority side.
  3. Trade‑offs are concrete – e.g., using hinted handoff trades immediate availability for eventual consistency, and read‑repair adds latency to convergence.
  4. Operational realities – real incidents (e.g., a split‑brain in a Kubernetes control plane) surface when the “ghost overlap” assumption fails.

Takeaways

  • You cannot have all three guarantees simultaneously; the theorem still stands when you examine the hidden assumptions.
  • Tunable consistency knobs (W/R) shift latency, not the fundamental trade‑off.
  • Design choices must be explicit – decide early whether your service should favor CP or AP and document the failure mode.
  • Observability matters – monitor for hinted writes, repair lag, and quorum failures to catch the moment the math stops matching reality.

Further reading

Understanding where the math breaks is more valuable than memorizing the theorem. It turns a textbook fact into a practical design compass.

Comments

Loading comments...