System Design: Metrics & Logging Pipeline

Requirements

Ingest 10M events/sec logs, 10M samples/sec metrics; retention, indexing, alerting.

Architecture

Agents → Kafka (multi-tenant topics) → Stream processors (PII redaction, sampling) → Logs: Elastic/ClickHouse + S3 cold Metrics: Prometheus remote write → TSDB (Cortex/Mimir) + downsampling

SLOs

99% of logs searchable < 1 min; metrics scrape latency < 10 s.

Multi-region

Local ingest + cross-region replication; query federation; cost-aware retention.

Failure modes

Backpressure to agents (buffer + sampling); hot shards → rebalancing; index throttling.

What Interviewers Look For

Observability Systems Skills

High-Throughput Ingestion
- 10M events/sec logs
- 10M samples/sec metrics
- Kafka for buffering
- Red Flags: Low throughput, bottlenecks, poor ingestion
Stream Processing
- PII redaction
- Sampling strategies
- Multi-tenant support
- Red Flags: No processing, no sampling, security issues
Storage & Retention
- Hot storage (Elasticsearch/ClickHouse)
- Cold storage (S3)
- Retention policies
- Red Flags: No retention, high costs, poor storage

Distributed Systems Skills

Multi-Region Design
- Local ingest
- Cross-region replication
- Query federation
- Red Flags: Single region, no replication, poor queries
Scalability Design
- Horizontal scaling
- Sharding strategy
- Load balancing
- Red Flags: Vertical scaling, no sharding, bottlenecks
Cost Optimization
- Downsampling
- Retention policies
- Storage tiers
- Red Flags: No optimization, high costs, inefficient

Problem-Solving Approach

Failure Handling
- Backpressure management
- Hot shard rebalancing
- Index throttling
- Red Flags: Ignoring failures, no handling, poor recovery
Edge Cases
- Traffic spikes
- Storage capacity
- Query performance
- Red Flags: Ignoring edge cases, no handling
Trade-off Analysis
- Cost vs retention
- Latency vs accuracy
- Red Flags: No trade-offs, dogmatic choices

System Design Skills

Component Design
- Ingestion service
- Stream processors
- Storage services
- Red Flags: Monolithic, unclear boundaries
Data Pipeline
- Kafka → Processors → Storage
- Clear data flow
- Red Flags: No pipeline, unclear flow
Query Design
- Search optimization
- Indexing strategy
- Red Flags: Slow queries, missing indexes, poor performance

Communication Skills

Pipeline Explanation
- Can explain ingestion flow
- Understands processing
- Red Flags: No understanding, vague explanations
Storage Explanation
- Can explain hot/cold storage
- Understands retention
- Red Flags: No understanding, vague

Meta-Specific Focus

Observability Expertise
- Metrics/logging knowledge
- High-throughput design
- Key: Show observability expertise
Cost & Scale Balance
- Cost optimization
- High throughput
- Key: Demonstrate cost/scale balance

Robina Li

System Design: Metrics & Logging Pipeline

Requirements

Architecture

SLOs

Multi-region

Failure modes

What Interviewers Look For

Observability Systems Skills

Distributed Systems Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Related Posts

Recent Posts