System Design: Chat/Messaging
Requirements
- One-to-one and group chats, delivery/read receipts, presence, search, attachments, E2E optional.
Architecture
Clients → Gateway (auth) → WebSocket Fanout + Pub/Sub (Kafka) → Message Store (Cassandra) → Search (Elastic) → Attachments (S3/CDN)
Ordering/IDs
- Per-conversation monotonic ids via time+shard (snowflake) or per-partition sequence; resolve on client.
Data model
conversations(id, members, type)
messages(conv_id, msg_id, sender, ts, body, status)
SLOs
- Send ACK P95 < 200 ms; delivery < 1 s; presence < 2 s convergence.
Consistency
- At-least-once over pub/sub; idempotent message writes by (conv_id,msg_id);
- Read-your-writes with sticky reads.
Failure modes
- Hot group → split shards, partial fanout; degraded typing indicators under load.
Detailed APIs
POST /v1/messages { conv_id, body, attachments? } -> { msg_id }
GET /v1/conversations/{id}/history?cursor=...
POST /v1/receipts { conv_id, msg_id, type=delivered|read }
Retention & search
- Retention policies per workspace; legal hold; search indexes updated async with privacy filters.
Test plan
- WS longevity under mobile networks; presence convergence; ordered delivery under partition.
What Interviewers Look For
Real-Time Messaging Skills
- WebSocket Architecture
- Connection management
- Message fanout
- Pub/Sub integration
- Red Flags: Polling, high latency, no real-time
- Message Ordering
- Per-conversation ordering
- Monotonic IDs
- Client-side resolution
- Red Flags: No ordering, out-of-order messages, poor UX
- Delivery Guarantees
- At-least-once delivery
- Read receipts
- Delivery receipts
- Red Flags: Message loss, no receipts, unreliable
Distributed Systems Skills
- Presence System
- Online/offline status
- Convergence time < 2s
- Red Flags: Slow presence, inaccurate status, poor UX
- Message Storage
- Cassandra for messages
- Search indexing
- Retention policies
- Red Flags: No storage, slow search, no retention
- Scalability Design
- Horizontal scaling
- Sharding strategy
- Red Flags: Vertical scaling, no sharding, bottlenecks
Problem-Solving Approach
- Hot Group Handling
- Shard splitting
- Partial fanout
- Red Flags: No hot group handling, bottlenecks, poor performance
- Edge Cases
- Network partitions
- Connection failures
- Message duplicates
- Red Flags: Ignoring edge cases, no handling
- Trade-off Analysis
- Consistency vs latency
- Ordering vs performance
- Red Flags: No trade-offs, dogmatic choices
System Design Skills
- Component Design
- Message service
- Presence service
- Search service
- Red Flags: Monolithic, unclear boundaries
- Security Design
- E2E encryption (optional)
- Authentication
- Authorization
- Red Flags: No security, insecure, vulnerabilities
- Attachments Handling
- S3 storage
- CDN delivery
- Red Flags: No attachments, slow delivery, high costs
Communication Skills
- Messaging Architecture Explanation
- Can explain WebSocket design
- Understands ordering
- Red Flags: No understanding, vague explanations
- Scale Explanation
- Can explain scaling strategies
- Understands bottlenecks
- Red Flags: No understanding, vague
Meta-Specific Focus
- Real-Time Systems Expertise
- WebSocket knowledge
- Low-latency design
- Key: Show real-time systems expertise
- Reliability Focus
- Message delivery guarantees
- Presence accuracy
- Key: Demonstrate reliability focus