System Design: Metrics & Logging Pipeline
Requirements
- Ingest 10M events/sec logs, 10M samples/sec metrics; retention, indexing, alerting.
Architecture
Agents → Kafka (multi-tenant topics) → Stream processors (PII redaction, sampling) → Logs: Elastic/ClickHouse + S3 cold Metrics: Prometheus remote write → TSDB (Cortex/Mimir) + downsampling
SLOs
- 99% of logs searchable < 1 min; metrics scrape latency < 10 s.
Multi-region
- Local ingest + cross-region replication; query federation; cost-aware retention.
Failure modes
- Backpressure to agents (buffer + sampling); hot shards → rebalancing; index throttling.
What Interviewers Look For
Observability Systems Skills
- High-Throughput Ingestion
- 10M events/sec logs
- 10M samples/sec metrics
- Kafka for buffering
- Red Flags: Low throughput, bottlenecks, poor ingestion
- Stream Processing
- PII redaction
- Sampling strategies
- Multi-tenant support
- Red Flags: No processing, no sampling, security issues
- Storage & Retention
- Hot storage (Elasticsearch/ClickHouse)
- Cold storage (S3)
- Retention policies
- Red Flags: No retention, high costs, poor storage
Distributed Systems Skills
- Multi-Region Design
- Local ingest
- Cross-region replication
- Query federation
- Red Flags: Single region, no replication, poor queries
- Scalability Design
- Horizontal scaling
- Sharding strategy
- Load balancing
- Red Flags: Vertical scaling, no sharding, bottlenecks
- Cost Optimization
- Downsampling
- Retention policies
- Storage tiers
- Red Flags: No optimization, high costs, inefficient
Problem-Solving Approach
- Failure Handling
- Backpressure management
- Hot shard rebalancing
- Index throttling
- Red Flags: Ignoring failures, no handling, poor recovery
- Edge Cases
- Traffic spikes
- Storage capacity
- Query performance
- Red Flags: Ignoring edge cases, no handling
- Trade-off Analysis
- Cost vs retention
- Latency vs accuracy
- Red Flags: No trade-offs, dogmatic choices
System Design Skills
- Component Design
- Ingestion service
- Stream processors
- Storage services
- Red Flags: Monolithic, unclear boundaries
- Data Pipeline
- Kafka → Processors → Storage
- Clear data flow
- Red Flags: No pipeline, unclear flow
- Query Design
- Search optimization
- Indexing strategy
- Red Flags: Slow queries, missing indexes, poor performance
Communication Skills
- Pipeline Explanation
- Can explain ingestion flow
- Understands processing
- Red Flags: No understanding, vague explanations
- Storage Explanation
- Can explain hot/cold storage
- Understands retention
- Red Flags: No understanding, vague
Meta-Specific Focus
- Observability Expertise
- Metrics/logging knowledge
- High-throughput design
- Key: Show observability expertise
- Cost & Scale Balance
- Cost optimization
- High throughput
- Key: Demonstrate cost/scale balance