System Design: Twitter (Timeline, Fanout, Search, Trends)
Goal: newsfeed of short posts with follows, likes/retweets, media, search, and trends, at massive read/write scale with low latency.
Requirements
- Functional: post tweet (text/media), follow/unfollow, home timeline, user timeline, likes/retweets/replies, search, trends, notifications.
- Non‑functional: P95 read < 200 ms, write < 500 ms; HC read skew; hot‑key protection; multi‑region DR.
High‑level architecture
Clients → API Gateway → (Auth, Rate‑Limit)
├─ Tweet Write Service → Tweets Store (Cassandra/S3) → Media Store (S3/CDN)
├─ Fanout Service (to caches/queues)
├─ Timeline Read Service → Redis/Timeline Cache → Tweets Store
├─ Graph Service (Follows) → Graph DB/Cache
├─ Engagements (Likes/RT) → Counters Store (Redis/CRDT + Cassandra)
└─ Search Indexer → Kafka → Elastic/OpenSearch
Data model
- Tweets:
tweet_id (KS), user_id, ts, text, media, countersin wide‑column store; payload persisted to S3 for cold. - Follows graph: adjacency lists in KV (sorted sets) + caching.
- Timelines: precomputed home timelines in Redis lists (fanout‑to‑cache), fallback to on‑read merge.
Write path
1) Validate/auth, create tweet row (idempotency key), enqueue to Kafka.
2) Fanout: for users with small follower counts, push tweet_id into each follower’s home list (Redis). For super‑users, mark as pull‑on‑read.
3) Index into search; media to CDN.
Read path (Home timeline)
1) Read tweet_ids from home list (Redis). On cache miss or pull‑user, merge K sorted lists (followees) from Timeline Store.
2) Hydrate tweets from Tweets Store; batch get; enrich counters from Redis; return.
Hot key and super‑user strategy
- Hybrid fanout: push for normal, pull for super‑users; also use sharded queues and backpressure.
- Token buckets per user; coalesce duplicate writes; cache stampede protection with single‑flight.
Counters (likes/retweets)
- Redis sharded counters with periodic flush to Cassandra; CRDT (G‑Counter) across regions.
Search and trends
- Ingest to Kafka → consumers update inverted index (Elastic/OpenSearch) with fields (text, hashtags, user).
- Trends: streaming aggregations on hashtag counts per geo/window (Flink/Beam) with anomaly detection.
Moderation and abuse
- ML classifiers for spam/toxicity; shadow bans; rate‑limits; IP/device reputation; human review queue.
Multi‑region
- Active‑active reads; active‑passive writes for tweets; CRDT counters; asynchronous index replication.
APIs
POST /v1/tweets { text, media_ids[] }
GET /v1/timeline/home?cursor=...
GET /v1/timeline/{user}
POST /v1/tweets/{id}/like
Testing/observability
- Canary releases; tail latency alarms; cardinality‑aware dashboards; chaos on fanout queues.
Capacity & back‑of‑the‑envelope
- Assumptions: 200M MAU, 20M DAU, 2M peak QPS reads, 50k peak TPS writes.
- Average tweet 280 bytes + metadata → ~0.5–1 KB/tweet in KV (excl. media). Daily 50M tweets → ~50 GB/day (hot store), archive to S3.
- Home timeline: 100 tweets/page; Redis list node ~60 bytes → per user cache ~6 KB for head; for 20M DAU cached head ~120 GB across shards.
- Fanout workers: if avg fanout 200 followers/tweet, 50k TPS → 10M inserts/s naive; hybrid push/pull to keep inserts <2M/s, remainder pulled on read.
SLOs & error budgets
- Home timeline P95 < 200 ms; Tweet publish ACK P95 < 500 ms; Availability 99.9%.
- Budget resets per 30 days; protect with circuit breakers on fanout and single‑flight cache.
Consistency & trade‑offs
- Eventual consistency for timelines and counters; strong consistency on tweet write path (idempotency key, single writer per tweet id).
- Read repairs during hydration; monotonic timeline per user using per‑followee cursors.
Bottlenecks & mitigations
- Super‑user hot keys: switch to pull, sample followers for push; cache pinning + request coalescing.
- Search indexing lag: prioritize fresh content pipeline; separate cold backfills.
Evolution plan
- Phase 1: push‑only for small scale; Phase 2: hybrid fanout with on‑read merge; Phase 3: multi‑region with CRDT counters and geo‑proximal reads.
Detailed APIs (sample)
POST /v1/tweets { text, media_ids[], reply_to?, audience? } -> { id }
GET /v1/timeline/home?cursor=... -> { items:[{tweet_id, user, text, media, counters}], next_cursor }
POST /v1/tweets/{id}/like -> 204
Data model (DDL sketch)
CREATE TABLE tweets (
tweet_id bigint PRIMARY KEY,
user_id bigint NOT NULL,
ts timestamp NOT NULL,
text text,
media jsonb,
visibility smallint,
attrs jsonb
);
CREATE TABLE timeline_items (
user_id bigint,
tweet_id bigint,
ts timestamp,
PRIMARY KEY (user_id, tweet_id)
);
Capacity math (BoE)
- Home timeline read 2M QPS p95 200 ms → Redis cluster of N shards each ~150k QPS; cache size ~120 GB for heads.
- Storage: 50M tweets/day @1 KB → 50 GB/day; 2‑year hot tier ~36 TB (before compression).
Consistency matrix
- Tweet create: strong (single writer, idempotency key)
- Timeline insert: eventual; reconcile on read with per‑followee cursors
- Counters: eventual with periodic flush to base store
Failure drill runbook
- Fanout backlog > threshold → switch heavy authors to pull, throttle enrichment, shed “who to follow”.
- Redis shard hot → enable request coalescing, migrate keys, temporarily reduce page size.
Testing plan
- Load tests with mixed read/write and super‑user spikes; chaos on fanout queue; cache stampede simulations.
What Interviewers Look For
Social Media Feed Skills
- Timeline Generation
- Fan-out vs pull strategy
- Hybrid approach for super-users
- Cache optimization
- Red Flags: Push-only, no pull strategy, cache stampedes
- Hot Key Handling
- Super-user strategies
- Request coalescing
- Cache pinning
- Red Flags: No hot key handling, cache stampedes, poor performance
- Counter Management
- Likes/retweets counters
- CRDT for multi-region
- Periodic flush to DB
- Red Flags: Inconsistent counters, no CRDT, race conditions
Distributed Systems Skills
- Read/Write Optimization
- Read-heavy optimization
- Write path efficiency
- Red Flags: Poor read performance, slow writes, bottlenecks
- Multi-Region Design
- Active-active reads
- CRDT counters
- Geo-proximal reads
- Red Flags: Single region, no replication, high latency
- Search & Trends
- Real-time indexing
- Streaming aggregations
- Red Flags: Slow indexing, no trends, poor search
Problem-Solving Approach
- Scale Challenges
- Super-user fanout
- Hot key protection
- Cache optimization
- Red Flags: No scale consideration, bottlenecks, poor performance
- Edge Cases
- Fanout backlog
- Cache stampedes
- Search lag
- Red Flags: Ignoring edge cases, no handling
- Trade-off Analysis
- Consistency vs latency
- Push vs pull
- Red Flags: No trade-offs, dogmatic choices
System Design Skills
- Component Design
- Tweet service
- Fanout service
- Timeline service
- Red Flags: Monolithic, unclear boundaries
- Caching Strategy
- Timeline caching
- Counter caching
- Cache stampede protection
- Red Flags: No caching, poor strategy, stampedes
- Moderation & Abuse
- ML classifiers
- Rate limiting
- Human review
- Red Flags: No moderation, abuse issues, no safety
Communication Skills
- Feed Architecture Explanation
- Can explain fan-out strategy
- Understands hot key handling
- Red Flags: No understanding, vague explanations
- Scale Explanation
- Can explain scaling strategies
- Understands bottlenecks
- Red Flags: No understanding, vague
Meta-Specific Focus
- Social Media Expertise
- Feed generation knowledge
- Scale understanding
- Key: Show social media expertise
- Performance & Scale Balance
- Low latency design
- High throughput
- Key: Demonstrate performance/scale balance