System Design: Distributed Rate Limiter

Requirements

Enforce per-user and per-IP limits (e.g., 100 req/min), burst handling, low latency, global distribution.

Approaches

Token bucket in Redis with Lua scripts (atomic); or sliding window counters.

Architecture

Clients → Gateway → Limiter SDK → Redis/Memcache cluster (sharded) → Fallback local estimators.

Data model

key = tenant:user:minute → counters; or key = tenant:user → (tokens, last_ts).

Consistency

Prefer strong atomic ops (Lua) per key; eventual across regions with locality; shadow write to secondary.

SLOs

Check P95 < 5 ms; availability 99.99% (graceful degrade to stricter local limits on Redis outage).

Capacity

1M rps checks: shard across 10 Redis primaries (100k rps each); pipeline ops.

Failure modes

Hot keys → add jitter to keys (bucketize), hierarchical keys.
Region outage → fail open/closed per product policy.

Lua pseudo (token bucket)

-- KEYS[1]=key, ARGV[1]=now_ms, ARGV[2]=rate_per_s, ARGV[3]=burst
local now=tonumber(ARGV[1])
local rate=tonumber(ARGV[2])
local burst=tonumber(ARGV[3])
local tokens=tonumber(redis.call('HGET', KEYS[1], 't') or burst)
local ts=tonumber(redis.call('HGET', KEYS[1], 'ts') or now)
tokens=math.min(burst, tokens + (now-ts)*rate/1000)
local allowed=tokens>=1 and 1 or 0
if allowed==1 then tokens=tokens-1 end
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', now)
redis.call('PEXPIRE', KEYS[1], 60000)
return allowed

Failover policy

Redis down: enforce stricter local in‑process leaky bucket; log to audit; restore to central when healthy.

Robina Li

System Design: Distributed Rate Limiter

Requirements

Approaches

Architecture

Data model

Consistency

SLOs

Capacity

Failure modes

Lua pseudo (token bucket)

Failover policy

Related Posts

Recent Posts