Design a Distributed Lock Service: Fencing Tokens and the Failure Modes - DEV Community | LavX News

Explains why a simple lock with TTL is insufficient, describes lease expiration, split‑brain scenarios, and the role of fencing tokens to guarantee correctness, plus guidance on backing stores and interview framing.

A lock that appears held can still allow two writers to corrupt data silently

When a client believes it still owns a lock but the lock service has already granted the lock to another client the system can end up with two concurrent writes. The failure mode is not a crash, it is a pause that outlasts the lease. In such a pause the holder does not know that time has jumped ahead, it simply resumes and writes with a stale token. The result is silent data corruption that passes all happy‑path tests.

The first question to ask when designing a lock is whether the use case demands efficiency or correctness. An efficiency lock stops duplicate work, a correctness lock protects shared state from concurrent writes. For the latter a simple Redis key with a TTL is not sufficient, because the lease expiration does not protect against a paused holder.

A lease with a TTL only guarantees that a dead holder will eventually release the lock. It does not guarantee that a holder that is merely paused cannot resume and write again. The pause can be caused by a garbage collection stop‑the‑world, a VM migration, or a network partition. When the lease expires the lock service may hand the lock to a second client, that client may finish its work and write to storage, and then the first client wakes up and writes with the old token. Both writes succeed and the data becomes inconsistent.

The solution that separates a correct lock from an merely available one is a fencing token. Every grant from the lock service includes a monotonically increasing number. The protected resource attaches that number to every write and rejects any write that carries a token that is not newer than the highest token it has already accepted. The lock service does not need to be perfect, it only needs to be monotonic. The token is generated atomically, for example with an INCR operation in Redis or with a sequence node in ZooKeeper.

System Design Pocket Guide: Fundamentals

The protected resource must be able to check the token. A database can enforce the check with a conditional update, an object store can enforce it with a compare‑and‑set on a version field, but a plain file on a simple filesystem cannot. When the resource cannot verify a token the lock is only an efficiency lock wearing a correctness costume.

Backing stores that provide a monotonic number for free include ZooKeeper and etcd. These systems expose a revision or a sequence number that can be used directly as a fencing token, removing the need to maintain a separate counter. Redis can also be used, but the caller must increment a key atomically and must guard the release path with a Lua script that checks the value before deleting it.

Heartbeat renewal can keep a long‑running holder from losing its lease early, but it does not solve the pause problem. If the holder is frozen the heartbeat stops, the lease expires, and the split‑brain scenario returns. Heartbeats only reduce the window in which a healthy but slow holder loses its lock.

Redlock improves availability across multiple Redis instances, but it does not eliminate the pause issue or the need for fencing tokens. It is a useful component for higher availability, not a complete correctness solution.

When the lock service fails it should fail closed, meaning no client can acquire the lock if the service is unavailable. This prevents the unsafe situation where the service hands the same lock to two clients at the same time. The safe failure mode is preferable to the unsafe one even though it reduces availability.

Interviewers often probe deeper with follow‑up questions. They may ask what happens if two clients receive the same fencing token, how to handle a lock service outage, or whether the lock can be avoided entirely. The answer to the token duplication question is that the token source must be atomic; an INCR operation in Redis or a zxid in ZooKeeper guarantees uniqueness. If the lock service goes down the correct reaction is to block all acquisitions, which preserves safety at the cost of availability.

The broader pattern is to state the guarantee, locate the pause, fence at the resource, and pick a backing store that hands a monotonic token for free. This reasoning turns a trick question into a routine one and is the core of the System Design Pocket Guide: Fundamentals, which walks through coordination primitives, leases, and consensus with the same lens.

Sentry image

For further reading the book System Design Pocket Guide: Fundamentals provides a concise walkthrough of these failure modes and the design choices that keep data consistent under concurrency. My project Hermes IDE | GitHub is an IDE for developers who ship with Claude Code and other AI coding tools. My site xgabriel.com | GitHub hosts my writings. Sentry offers smarter debugging with Sentry MCP and Cursor.

#distributed systems #Locks #fencing-tokens #Redis #system-design

Design a Distributed Lock Service: Fencing Tokens and the Failure Modes - DEV Community

Comments