A Redis rate limiter failed during a Black Friday traffic spike after a network partition blocked access to the centralized counter, causing a cascade of false denials. The fix: a local token bucket with gossip-based eventual consistency that keeps services responsive when nodes can't agree on state.
A centralized Redis rate limiter failed during a Black Friday traffic spike after a network partition blocked access to the counter, causing a cascade of false denials across the site. The limiter was strict: when Redis became unreachable, every request got denied. Users saw a wall of 429 responses. The service wasn't down because the inventory backend crashed. It was down because the rate limiter itself couldn't tolerate a partition.
The problem was a design choice that prioritized consistency over availability. The limiter used a single Redis instance with atomic INCR and EXPIRE calls. When that node became unreachable during a brief network hiccup, the INCR call timed out, the limiter treated every timeout as a denial, and the entire service throttled itself into an outage.
The CAP constraint
CAP theorem forces a choice: consistency, availability, partition tolerance. Pick two. For a rate limiter, the question is which pair matters most.
A CP rate limiter (consistency + partition tolerance) guarantees accurate counters but blocks when nodes can't communicate. That works for internal systems where correctness matters more than uptime. For public-facing APIs, a brief partition that blocks all traffic costs more than a few extra requests slipping through.
An AP rate limiter (availability + partition tolerance) keeps responding even when nodes diverge. Counters may disagree for a short period. Some extra requests get through during the split. Once the partition heals, the counters converge. For most user-facing services, a few extra hits are cheaper than a total outage.
The local token bucket
Each service instance runs its own token bucket. No remote call. No shared state. A request hits the local bucket, the bucket checks its token count, and returns immediately.
The bucket refills at a fixed rate based on elapsed time. Each token allows one request. When tokens run out, the bucket denies requests until the next refill cycle. This gives you throttling logic without any network dependency.
The trade-off: each instance has its own view of the world. A user hitting two different instances could get twice the allowed rate during a partition.
Gossip for eventual consistency
The gossip layer keeps buckets roughly in sync. Each instance periodically sends its current token count to peers and reads theirs. The convergence strategy matters:
Take the maximum of local and remote token counts. This prevents under-limiting. If a peer has more tokens than you, you adopt the higher count. If you have more, you keep yours. After a partition heals, the counts drift back toward each other.
A push-pull cycle every one to two seconds works for most systems. Faster gossip means faster convergence but adds network overhead. Slower gossip means larger divergence windows.
What breaks
Clock drift. The refill math depends on elapsed time. If two nodes have wildly different clocks, their refill rates diverge. Use NTP or monotonic timers. The comparison across nodes doesn't need exact synchronization, but the drift should stay within a few seconds.
Over-gossiping. Sending full bucket state every 10 milliseconds floods the network and adds latency. Delta-based approaches (sending only the change since last gossip) reduce overhead.
Under-refilling. A refill rate that's too slow empties the bucket fast, causing false positives. Tune the refill interval to match your expected average request rate per instance.
The trade-off dial
You can adjust the system's behavior by changing gossip parameters:
Tighter limits: increase gossip frequency or move toward a CP model with a quorum store like Redis Cluster with strict consistency.
Maximum uptime: reduce gossip frequency and let buckets diverge more during partitions.
The choice depends on what costs more: occasional over-limiting or occasional under-limiting. Most public APIs accept a small margin of over-limiting in exchange for uptime.
Multi-region implications
Each region runs its own set of limiter instances. Gossip stays local within a region. Cross-region gossip happens less frequently, accepting larger divergence windows in exchange for reduced cross-region traffic.
The approach scales horizontally. Adding instances doesn't change the gossip topology because each instance only talks to its nearest peers. The limiter doesn't become a bottleneck at any scale.
The lesson from the Black Friday outage: the centralized limiter was a single point of failure disguised as a correctness guarantee. The distributed token bucket with gossip trades a small accuracy margin for the ability to survive partitions. For most systems, that's the right trade-off.

Comments
Please log in or register to join the discussion