Redis at Scale: Lessons from When the System's Heart Stopped

A deep dive into real-world Redis scaling challenges, focusing on the Hot Key incident that brought down an e-commerce platform and the architectural solutions that prevented future failures.

It was a typical Monday until, within minutes, the latency of our main e-commerce service spiked from 100ms to 15 seconds. Our Datadog dashboard was on fire: 100% CPU usage and a queue of pending connections that kept growing. The culprit? A Hot Key. We had a 'campaign settings' key that was queried on every request. When traffic tripled due to a promotion, the node holding that key in the cluster simply couldn't handle the volume of IOPS (Input/Output Operations Per Second). In high-performance systems, IOPS limits define how fast the hardware can process requests; when this ceiling is reached, requests queue up and the system freezes.

The Immediate Solution and Permanent Fix

At the height of the crisis, our immediate solution was to vertically scale the affected node, but the definitive fix came with implementing Client-side Caching. We started invalidating the local cache in applications via Redis Pub/Sub only when the configuration changed. This drastically reduced network calls and 'cooled down' the key, restoring stability to the cluster.

If you use Redis merely as an add-on to store sessions, this article isn't for you. But if your system depends on Redis for large-scale performance, here are the lessons we learned 'through pain' about architecture and resilience.

The Myth of 'Infinite Memory' and Eviction Policy

At scale, Redis isn't a database—it's a finite resource. The most common mistake is leaving the default memory configuration. The lesson: Always define a maxmemory and, more importantly, a maxmemory-policy. For pure caches, allkeys-lru is your best friend. It discards the least recently used keys to make room for new ones. Without this, Redis will return write errors once RAM is full, crashing your application.

Redis Cluster vs. Sentinel: When to Shift Gears?

Many teams start with Redis Sentinel. It's great for High Availability (failover), but it doesn't scale writes or reads beyond what a single node can handle. When we reached millions of operations per second, the answer was Redis Cluster.

Automatic Sharding

Redis divides your data into 16,384 hash slots distributed across nodes. For horizontal scalability: Need more performance? Add more nodes and rebalance slots without downtime.

The Dangers of 'Hot Keys' and 'Big Keys'

In the incident I mentioned at the beginning, the problem wasn't Redis—it was our access pattern.

Hot Keys

When a single key is requested by all clients simultaneously. Solution? Read replicas or, even better, Client-side Caching (introduced in Redis 6), where the application keeps a local copy and Redis only notifies when it changes.

Big Keys

A single Hash or List with hundreds of megabytes. This causes processing pauses (Redis is single-threaded for data commands!). Use MEMORY USAGE to find these villains. O(N) commands are forbidden in production.

If I could give only one piece of advice: Delete the KEYS * command from your vocabulary. In a database with millions of keys, KEYS locks the entire process while scanning memory. Use SCAN: It iterates over keys incrementally without blocking the server. The same applies to HGETALL on giant hashes; prefer HSCAN.

Persistence: RDB or AOF?

Scaling Redis requires careful attention to disk. RDB (snapshots) is performant, but you might lose a few minutes of data. AOF (log of operations) is safer, but it can create a terrible disk I/O bottleneck in high-throughput systems. Pro tip: In cache clusters, it's often better to disable persistence on primary nodes and keep it active only on a secondary node for disaster recovery.

Monitoring and Observability

At scale, observability is everything. Monitor your Slow Log, understand your key distribution, and never underestimate the power of proper configuration. The Redis documentation provides excellent guidance on monitoring and optimization.

Conclusion

Redis is a Ferrari, but no one drives a Ferrari at 300km/h without constant maintenance. The incident taught us that understanding access patterns and implementing proper caching strategies are as important as the infrastructure itself. Client-side caching, proper memory configuration, and avoiding O(N) commands transformed our Redis cluster from a bottleneck into a resilient performance enhancer.

Have you experienced any 'tight spots' with Redis that seemed inexplicable? Share your stories in the comments!

#Redis #Scaling #Caching #hot-keys #Infrastructure