System Design: Distributed Rate Limiter
Requirements
- Enforce per-user and per-IP limits (e.g., 100 req/min), burst handling, low latency, global distribution.
Approaches
- Token bucket in Redis with Lua scripts (atomic); or sliding window counters.
Architecture
Clients → Gateway → Limiter SDK → Redis/Memcache cluster (sharded) → Fallback local estimators.
Data model
key = tenant:user:minute → counters; or key = tenant:user → (tokens, last_ts).
Consistency
- Prefer strong atomic ops (Lua) per key; eventual across regions with locality; shadow write to secondary.
SLOs
- Check P95 < 5 ms; availability 99.99% (graceful degrade to stricter local limits on Redis outage).
Capacity
- 1M rps checks: shard across 10 Redis primaries (100k rps each); pipeline ops.
Failure modes
- Hot keys → add jitter to keys (bucketize), hierarchical keys.
- Region outage → fail open/closed per product policy.
Lua pseudo (token bucket)
-- KEYS[1]=key, ARGV[1]=now_ms, ARGV[2]=rate_per_s, ARGV[3]=burst
local now=tonumber(ARGV[1])
local rate=tonumber(ARGV[2])
local burst=tonumber(ARGV[3])
local tokens=tonumber(redis.call('HGET', KEYS[1], 't') or burst)
local ts=tonumber(redis.call('HGET', KEYS[1], 'ts') or now)
tokens=math.min(burst, tokens + (now-ts)*rate/1000)
local allowed=tokens>=1 and 1 or 0
if allowed==1 then tokens=tokens-1 end
redis.call('HMSET', KEYS[1], 't', tokens, 'ts', now)
redis.call('PEXPIRE', KEYS[1], 60000)
return allowed
Failover policy
- Redis down: enforce stricter local in‑process leaky bucket; log to audit; restore to central when healthy.