Rebuilding a Three-Layer Cache System: Lessons in Distributed Data Consistency
#Infrastructure

Rebuilding a Three-Layer Cache System: Lessons in Distributed Data Consistency

Backend Reporter
6 min read

A deep dive into fixing critical flaws in a Redis-L1-MongoDB cache hierarchy, covering hierarchy inversion, silent data loss, race conditions, deadlock risks, and scalability issues. Each problem is analyzed with root cause, fix, and trade-off discussion, emphasizing why edge-case handling determines production reliability in distributed systems.

Cache systems often appear functional until edge cases expose fundamental flaws. In a recent infrastructure project managing player data for a game backend, a three-layer cache—Redis as master, L1 as in-memory mirror, MongoDB as persistent backup—revealed eight interconnected bugs during stress testing. What seemed like a working synchronization layer harbored silent data loss paths, race conditions, and a latent deadlock. This isn’t just about fixing code; it’s about understanding how distributed systems fail when assumptions about consistency, ordering, and fault tolerance collide with reality. The fixes required rethinking not just implementation but the core guarantees each layer provides.

Featured image

The architecture followed a strict hierarchy: Redis holds the canonical state, L1 mirrors Redis for low-latency reads, and MongoDB serves as durable storage. Three scheduled tasks maintained sync: L1 pulled from Redis every 10s, Redis flushed dirty keys to MongoDB every 15s, and a reconciliation task every 3min resolved divergences by treating Redis as truth. On paper, this prevented drift. In practice, subtle violations of these rules caused cascading failures under load or during transient failures.

Problem 1: Hierarchy Inversion The reconciliation task violated the master-slave contract by letting MongoDB override Redis. When MongoDB held a stale value (e.g., due to a failed flush), the task would read it, compare against L1, and write MongoDB’s value back to Redis and L1. This inverted the priority chain—MongoDB became the de facto master. The fix inverted the logic: reconciliation now reads Redis first, treating its value as ground truth. If MongoDB differs, only MongoDB and L1 are updated. This seems trivial but is foundational; without enforcing Redis as the single source of truth, no layer can be trusted during partitions or writer conflicts. The trade-off is minimal—reconciliation now does extra Redis reads—but eliminates a silent corruption vector where correct data in Redis could be overwritten by outdated MongoDB entries.

Problem 2: Silent Data Loss on Flush Failure The auto-flush task cleared the dirty flag before confirming MongoDB write success. If the MongoDB call timed out or threw an exception, the key was permanently removed from dirtyKeys, guaranteeing the update would never retry. Data was lost silently, with no alert or recovery path. The fix introduced a two-phase approach: snapshot dirtyKeys, then only remove keys after the MongoDB write future completes successfully (using .get() to block until confirmation). Failures leave the key dirty for retry. This adds 15s of potential delay during MongoDB outages but guarantees durability. The trade-off mirrors classic distributed systems dilemmas: acknowledging writes only after persistence increases latency but prevents data loss—a necessary choice when the cache backs critical state.

Problem 3: TOCTOU Race in removeModel() The original removeModel() checked key existence via containsKey() before removal, creating a window where another thread could delete the key between check and remove. In high-concurrency scenarios (e.g., thousands of player disconnects/sec), this caused NullPointerExceptions or double-deletion attempts. The fix replaced the two-step process with ConcurrentHashMap.remove(key), which atomically checks and removes, returning the deleted value or null. This eliminates the race with zero performance cost—ConcurrentHashMap’s remove is designed for this exact pattern. The lesson here is that in concurrent systems, "check-then-act" sequences are almost always unsafe; prefer atomic operations provided by the platform.

Problem 4: Deadlock Risk in Reconciliation The reconciliation task dispatched new processTask calls from within an existing processTask block. Assuming Redis client uses a single-threaded executor (common for command ordering), the inner task could never start—the outer task blocked the only thread, waiting for the inner task to finish, which waited for the outer thread to release the executor. This caused a permanent deadlock under load. The fix flattened the task structure: all MongoDB reads and writes happen within the single processTask context, using CompletableFuture for async I/O without nesting task dispatches. This highlights a subtle but critical point: executors with limited threads (like single-threaded Redis clients) forbid recursive task submission. The trade-off is slightly less modular code but guaranteed liveness.

Problem 5: MongoDB Request Storm Reconciliation fired MongoDB queries for every key simultaneously—no batching or throttling. With 10k keys, this meant 10k concurrent reads followed by 10k writes, exhausting connection pools and spiking latency. The fix introduced batch processing: at most RECONCILE_BATCH_SIZE (default 50) MongoDB operations in flight at once, waiting for each batch to complete before starting the next. This transforms an O(n) connection spike into O(batch_size), making load predictable. The trade-off is increased reconciliation time (serialized batches vs. parallel), but tuning batch size lets operators balance throughput against MongoDB capacity—a classic throughput-latency trade-off in backend systems.

Problem 6: Ignoring Redis Eviction/TTL L1 sync skipped keys when Redis.getData() returned empty (due to eviction or TTL expiry), leaving L1 with stale values. On next flush, this stale L1 data would be read as "current" and written to MongoDB, potentially overwriting a newer value that existed in Redis before eviction. The fix now treats missing Redis keys as signals to restore Redis from L1 and mark the key dirty for re-flush. This turns cache misses into active healing events. The trade-off is extra Redis writes during eviction storms, but it prevents silent data corruption—a necessary cost for correctness when using ephemeral caches like Redis as a primary store.

Problem 7: Incomplete Write Paths The addModelFix() method wrote only to L1, relying on the next L1 sync to detect the missing Redis key and trigger a restore. This created unnecessary sync cycles and temporary inconsistency. The fix unified all writes through a single internal method that updates L1 and Redis atomically. This eliminates redundant cycles and ensures write paths always respect the hierarchy. The insight here is that cache systems need explicit, consistent write contracts—optimizing for rare code paths at the expense of correctness invites subtle bugs.

Problem 8: O(n) ID Lookups getDataModelFromId() scanned the entire idToDataList map per lookup, becoming a CPU hotspot at scale. With 1k entries, each check did 1k comparisons; called per player action, this added measurable latency. The fix added a reverse index (idToKey ConcurrentHashMap) maintained alongside the primary map. Lookups now require two O(1) hash table lookups instead of a linear scan. The memory cost is negligible (another map of strings), but the lookup speedup is transformative for high-frequency access patterns. This exemplifies how distributed systems often trade minimal memory for significant latency reductions in access paths.

The final system isn’t faster or simpler—it’s more correct. Each fix addressed a specific failure mode that only manifested under pressure: network blips, GC pauses, thread interleavings, or memory pressure. What’s striking is how the original code passed basic tests and low-load scenarios; the bugs hid in the gaps between idealized assumptions and real-world behavior. Distributed cache management demands explicit handling of partial failures, ordering guarantees, and state reconciliation—not as afterthoughts, but as core design criteria. When Redis evicts a key under load, or MongoDB stalls for 200ms, or two threads race to delete the same entry, the system must continue operating correctly. These aren’t edge cases; they’re the main event in production systems. The full implementation is open source, offering a concrete study in how layered cache systems can be made resilient through precise, assumption-challenging fixes.

Comments

Loading comments...