Skip to content
C Codeloom
System Design

Distributed Locks with Redis: What Works, What Breaks

A practical look at distributed locking with Redis: SET NX EX, Redlock, fencing tokens, and the failure modes that cause data corruption.

·5 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • How SET NX EX implements a basic lock
  • Why naive locks corrupt data under GC pauses
  • What Redlock claims and what critics dispute
  • Why fencing tokens are the real correctness story
  • When to avoid locks entirely

Prerequisites

  • Familiar with how APIs work
  • Basic Redis knowledge

What and Why

A distributed lock is a mutual exclusion primitive across multiple processes that share no memory. You reach for one when two workers might process the same job, two services might both decrement the same inventory, or two cron jobs might fire at the same wall clock.

Redis is the most popular tool because it is fast, ubiquitous, and exposes the right primitives. But distributed locking is one of those problems where 90% of implementations look right and are subtly wrong. The failure modes are silent: duplicate processing, double charges, lost updates.

Mental Model

A lock has three responsibilities: acquire exclusively, prevent other holders, and release safely. The hard part in a distributed setting is that the lock holder may pause (GC, swap, network), and the lock may expire while it still believes it holds. Now another process has the lock and the first one wakes up and writes anyway.

This is not a hypothetical. JVM GC pauses of multiple seconds are common. Kubernetes nodes get evicted. Network partitions happen. Any lock implementation has to assume the holder can be arbitrarily delayed.

Architecture

The minimal Redis lock:

# Acquire
ok = redis.set("lock:order:42", token, nx=True, ex=30)
if not ok: raise LockUnavailable

try:
    do_critical_section()
finally:
    # Release only if we still own it (Lua for atomicity)
    redis.eval("""
        if redis.call('get', KEYS[1]) == ARGV[1] then
            return redis.call('del', KEYS[1])
        end
        return 0
    """, 1, "lock:order:42", token)

The token is a random UUID. It identifies the holder. The Lua script ensures we never delete someone else’s lock when ours has expired and theirs has been acquired.

Client A           Redis              Client B
 |                 |                    |
 |--acquire lock-->|                    |
 |<--ok (TTL 30s)--|                    |
 |                 |                    |
[ GC pause 35s ]     |                    |
 |                 |---TTL expires----->|
 |                 |<--acquire lock-----|
 |                 |---ok (TTL 30s)---->|
 |                 |                    |
 |--write--------->| <----write---------|
 |   X both write, both think they hold the lock
Lock expiry during a GC pause causes double-write

This is why a single-instance lock with TTL is insufficient for correctness. You need either a fencing token or a different design.

Trade-offs

Single-node Redis lock. Simple, fast, works for advisory locking (best-effort dedup). Loses on Redis failover: the lock is in memory and replication is async, so a new primary can serve the same lock again.

Redlock. Antirez’s algorithm: acquire the lock on N independent Redis nodes, claim success on majority. Aims to survive single-node failure. Martin Kleppmann famously argued it provides no safety guarantee under GC pauses or clock drift. Antirez countered that it is safe under specific assumptions. The practical verdict: Redlock survives one Redis crash but does not solve the pause problem.

Fencing tokens. The only general fix. Each lock acquisition returns a monotonically increasing token. The downstream system (database, storage) records the highest token it has seen and rejects writes with a lower token. If A holds token 7, pauses, and B acquires token 8 and writes, then A’s later write with token 7 is rejected by the storage.

token = lock_acquire("lock:order:42")  # returns 7
db.update("orders", id=42, set={...}, where={"max_token": "<= 7"})
# DB also updates max_token = 7 atomically

This is the design used by Google Chubby and ZooKeeper. Redis can simulate it with INCR on a separate counter, but you have to enforce the token check in the downstream system. If the downstream cannot enforce it, Redlock cannot save you.

Don’t lock if you can avoid it. Idempotency keys, optimistic concurrency control (compare-and-set), and event sourcing eliminate many lock use cases. A unique constraint on (order_id, charge_attempt_id) prevents double charges without a lock.

Practical Tips

  1. Always set a TTL. Forever-locks become outages when the holder crashes.
  2. Always check ownership on release. Use the Lua script above. Never DEL blindly.
  3. Make the critical section short. The longer the section, the higher the chance of GC pause exceeding TTL.
  4. Renew, don’t extend hope. If your work might run long, refresh the TTL periodically while you hold it. Stop renewing as soon as the work is done.
  5. Use fencing tokens for correctness. Treat the Redis lock as a coordination hint, not a correctness boundary, unless the protected resource enforces tokens.
  6. Prefer idempotency for retries. Locks plus retries plus TTL is a recipe for duplicates. Idempotency keys solve the same problem more reliably.
  7. Monitor lock acquisition time. If p99 acquisition time is climbing, you have contention or a hot key; address it before it causes timeouts.

Wrap-up

Redis distributed locks are a useful tool when you understand the boundaries. They prevent honest workers from stepping on each other; they do not prevent a paused worker from corrupting data. For mutual exclusion that affects state, pair the lock with fencing tokens or replace it with idempotency and compare-and-set. The teams I have seen burned by distributed locks were not running bad code; they were assuming a lock meant exclusivity. It does not, not in this world. Design for the pause.