Distributed System Design Ecosystem

A reference guide to common categories in distributed system design, the typical components in each category, and their common usage. Use this as a mental checklist when tackling system design interviews or architecting real systems.

Component Reference by Technology

Systematic breakdown by category, representative technologies, and typical usage. Use these tables when you need to name concrete systems (e.g. “Kafka for event streaming”) rather than abstract concepts.

1. Messaging & Event Streaming

Category	Examples	Common Usage
Message Queue / Broker	RabbitMQ, ActiveMQ	Decouple services, handle asynchronous tasks, reliable delivery.
Distributed Log / Event Streaming	Kafka, Pulsar, Redpanda	High-throughput event streaming, log aggregation, real-time analytics, pub-sub communication.
Pub/Sub Systems	NATS, Google Pub/Sub	Broadcast events to multiple consumers, event-driven microservices.

2. Databases

Category	Examples	Common Usage
Relational DB	MySQL, PostgreSQL, MariaDB	Strong consistency, structured data, transactions.
NoSQL Key-Value Store	Redis, DynamoDB, Riak	Low-latency access, caching, session storage, high throughput.
Document Store	MongoDB, Couchbase	Flexible schema, JSON-style data, content storage.
Wide Column Store	Cassandra, HBase	Large-scale writes, time-series data, distributed tables.
Graph DB	Neo4j, JanusGraph	Relationships-heavy data, social networks, recommendation engines.

3. Caching & In-Memory Systems

Category	Examples	Common Usage
In-Memory Cache	Redis, Memcached	Reduce DB load, fast data access, session storage.
Distributed Cache	Hazelcast, Ignite	Shared cache across nodes in a cluster, scale horizontally.

4. Storage Systems

Category	Examples	Common Usage
Object Storage	AWS S3, MinIO	Store unstructured data like images, videos, logs.
Distributed File System	HDFS, Ceph	Large-scale batch processing, big data analytics, high durability.
Block Storage	EBS, OpenStack Cinder	Persistent storage for VMs and containers.

5. Coordination & Configuration

Category	Examples	Common Usage
Distributed Coordination	Zookeeper, Consul	Leader election, service discovery, distributed locks.
Configuration Management	etcd, Consul	Store cluster metadata, maintain consistent config across nodes.

6. Search & Analytics

Category	Examples	Common Usage
Search Engine	Elasticsearch, Solr	Full-text search, log analytics, metrics search.
Analytics / OLAP	ClickHouse, Druid	Real-time analytics, aggregate queries on big data.

7. API & Gateway Layers

Category	Examples	Common Usage
API Gateway	Kong, NGINX, Envoy	Routing, authentication, rate-limiting, service entry point.
Load Balancer	HAProxy, NGINX, AWS ELB	Distribute traffic across services, high availability.

8. Stream Processing / Computation

Category	Examples	Common Usage
Stream Processing	Apache Flink, Kafka Streams, Spark Streaming	Real-time transformations, aggregations on events.
Batch Processing	Apache Spark, Hadoop MapReduce	Large-scale offline computations.

9. Monitoring & Observability

Category	Examples	Common Usage
Metrics & Monitoring	Prometheus, Grafana	Track service health, alerting, system metrics.
Tracing / Logging	Jaeger, ELK stack	Debug distributed requests, trace latency, centralized logging.

10. Distributed Infrastructure

Category	Examples	Common Usage
Service Mesh	Istio, Linkerd	Traffic management, observability, secure communication between services.
Container Orchestration	Kubernetes, Nomad	Deploy, scale, and manage containers in clusters.
Distributed Task Scheduler	Airflow, Celery	Manage workflows, schedule jobs across nodes.

Quick takeaways:

Messaging & Event Streaming → decouple services, real-time communication
Databases & Storage → persistence, scale, reliability
Coordination & Config → consistency, leader election, cluster metadata
Monitoring & Observability → track, debug, alert
Compute Layers → process data (real-time or batch)

Conceptual Categories (Consistency, Partitioning, etc.)

The sections below organize the same ecosystem by concepts (consistency, partitioning, fault tolerance) rather than by technology. Use them to reason about trade-offs and to answer “how” questions (e.g. how do you shard? how do you handle failures?).

1. Consistency & Replication

Category	Common Items	Common Usage
Consistency models	Strong (linearizable), eventual, causal, read-your-writes, session	Choose by read/write latency and correctness needs; strong for financial/leaderboard, eventual for feeds/social.
Replication topologies	Leader–follower (primary–replica), multi-leader, leaderless (Dynamo-style)	Leader–follower for simple strong consistency; multi-leader for multi-region; leaderless for availability.
Reconciliation	Quorum (R/W), vector clocks, version vectors, last-write-wins (LWW), CRDTs	Quorum for consensus; vector clocks for ordering; LWW for simple conflict resolution; CRDTs for conflict-free merges.
Sync vs async	Synchronous replication, asynchronous replication	Sync for durability/consistency; async for lower latency and higher throughput at cost of staleness.

When to mention: Replication strategy, CAP trade-offs, multi-datacenter design, conflict resolution.

2. Partitioning & Sharding

Category	Common Items	Common Usage
Partitioning strategies	Hash-based, range-based, directory-based, hybrid	Hash for even load; range for range queries and sorting; directory for flexible rebalancing.
Consistent hashing	Virtual nodes, bounded load, rendezvous hashing	Minimize reassignment on add/remove nodes; used in caches, CDNs, distributed storage.
Partition key design	Single key, composite key, key + sort key	Avoid hot partitions; support access patterns (e.g. user_id + timestamp).
Rebalancing	Full scan, range move, double-write during migration	Minimize data movement and downtime; often combined with directory or consistent hashing.

When to mention: Scaling storage or compute, hot spots, data locality, schema design.

3. Messaging & Communication

Category	Common Items	Common Usage
Message queues	Kafka, RabbitMQ, SQS, Azure Service Bus, Redis Streams	Decouple producers/consumers; buffer spikes; at-least-once or exactly-once delivery.
Pub/sub	Kafka topics, Google Pub/Sub, AWS SNS/SQS, Redis Pub/Sub	Fan-out to many subscribers; event-driven and real-time pipelines.
Request/response	REST, gRPC, GraphQL, Thrift	Service-to-service and client–server; gRPC for performance, REST for simplicity.
Delivery guarantees	At-most-once, at-least-once, exactly-once	Trade off simplicity vs duplicates vs implementation cost (ids, idempotency, transactions).
Patterns	Request–reply, fan-out, fan-in, priority queue, dead-letter queue (DLQ)	Model workflows, retries, and failure handling.

When to mention: Async processing, event sourcing, microservices communication, backpressure.

4. Coordination & Consensus

Category	Common Items	Common Usage
Consensus algorithms	Paxos, Raft, Zab (ZooKeeper), Viewstamped Replication	Replicated logs, cluster metadata, configuration.
Coordination services	ZooKeeper, etcd, Consul	Leader election, distributed locks, service discovery, config storage.
Distributed locks	Redis SET NX, ZooKeeper sequences, database advisory locks	Mutual exclusion across processes; use with TTL or heartbeats.
Leader election	Bully, ring, Raft leader	Single writer or coordinator per partition/shard.
Discovery	etcd, Consul, Eureka, DNS, K8s Services	Find healthy instances; support load balancing and failover.

When to mention: Multi-node coordination, strong consistency, avoiding split-brain, deployment and config.

5. Storage

Category	Common Items	Common Usage
Relational DB	PostgreSQL, MySQL, Aurora, Spanner	Transactions, joins, strong consistency; vertical + read replicas + sharding for scale.
Key–value	Redis, DynamoDB, Riak, etcd	Caching, session store, simple CRUD by key; low latency.
Document	MongoDB, Couchbase, Firestore	Flexible schema, nested documents, query by fields.
Wide-column	Cassandra, HBase, Bigtable	High write throughput, time-series, log data; tunable consistency.
Search	Elasticsearch, OpenSearch, Solr	Full-text search, aggregations, log analytics.
Object/blob	S3, GCS, Azure Blob, MinIO	Large binaries, backups, data lakes; eventual consistency.
Caching	Redis, Memcached, Varnish, CDN	Reduce DB load and latency; cache-aside, write-through, write-behind.
CDN	CloudFront, Cloudflare, Akamai	Static assets, API caching at edge; lower latency and origin load.

When to mention: Data model, access patterns, durability, latency, cost.

6. Load Balancing & Routing

Category	Common Items	Common Usage
LB algorithms	Round-robin, least connections, weighted, IP hash, consistent hash	Spread traffic; avoid overloaded or unhealthy nodes.
Layers	DNS, L4 (TCP/UDP), L7 (HTTP/gRPC)	DNS for geo/failover; L4 for raw throughput; L7 for routing by path/header.
Components	Nginx, HAProxy, Envoy, AWS ALB/NLB, cloud LB	Ingress, internal routing, health checks, TLS termination.
Service mesh	Istio, Linkerd, Consul Connect	mTLS, retries, timeouts, observability across services.

When to mention: High availability, scaling stateless services, canary/blue-green.

7. Fault Tolerance & Resilience

Category	Common Items	Common Usage
Retry	Exponential backoff, jitter, max attempts	Transient failures; avoid thundering herd.
Timeout	Connection, read, total request	Prevent hanging calls; propagate and set at each layer.
Circuit breaker	Open / half-open / closed	Stop calling failing dependencies; fail fast and recover.
Bulkhead	Thread pools, connection limits per dependency	Isolate failures; one bad dependency doesn’t starve others.
Idempotency	Idempotency keys, dedupe by request ID	Safe retries; exactly-once semantics where needed.
Graceful degradation	Fallbacks, cached/stale data, feature flags	Maintain partial service when dependencies fail.
Health checks	Liveness, readiness, dependency checks	LB and orchestrator use these to route traffic and restart.

When to mention: Reliability, cascading failures, SLA, client experience under failure.

8. Security & Access Control

Category	Common Items	Common Usage
Authentication	JWT, OAuth2, OIDC, API keys, mTLS	Identify users and services; SSO and delegated access.
Authorization	RBAC, ABAC, ACLs, policy engines	Who can do what on which resource.
Rate limiting	Token bucket, leaky bucket, sliding window, per-user/per-IP	Protect APIs and backends; fairness and cost control.
Encryption	TLS in transit, encryption at rest (KMS), field-level	Confidentiality and compliance.
Secrets	Vault, K8s Secrets, managed secret managers	Store and rotate DB credentials, API keys.

When to mention: Multi-tenant systems, public APIs, compliance, zero-trust.

9. Observability & Operations

Category	Common Items	Common Usage
Logging	Structured logs (JSON), log levels, correlation IDs	Debugging, audit, tracing requests across services.
Metrics	Counters, gauges, histograms, percentiles (p50/p99)	Latency, throughput, error rate; SLOs and alerting.
Tracing	OpenTelemetry, Jaeger, Zipkin, X-Ray	Distributed trace IDs; find bottlenecks and failure paths.
Alerting	PagerDuty, Opsgenie, Slack, on-call runbooks	React to SLO breaches and incidents.
Dashboards	Grafana, Datadog, CloudWatch	Visualize metrics and logs; capacity and trend analysis.

When to mention: Debugging at scale, SLO/SLA, incident response, capacity planning.

10. Compute & Scheduling

Category	Common Items	Common Usage
Job queues	Celery, Sidekiq, K8s Jobs, Lambda, Step Functions	Async and batch jobs; retries and backoff.
Orchestration	Kubernetes, Mesos, Nomad, ECS	Deploy and scale containers; placement and resource limits.
Workflow	Temporal, Cadence, Airflow, Prefect	Durable workflows, human-in-the-loop, data pipelines.
Serverless	Lambda, Cloud Functions, Knative	Event-driven, scale-to-zero; short, stateless tasks.

When to mention: Background processing, batch vs real-time, resource and cost optimization.

How to Use This in an Interview

Clarify requirements — Then map them to categories above (e.g. “strong consistency” → Consistency & Replication; “10M QPS” → Partitioning, Load Balancing, Storage).
Name categories first — e.g. “We need replication, partitioning, and a message queue,” then drill into specific items.
Justify choices — For each item, briefly say why (e.g. “Eventual consistency because reads can be stale and we need low latency”).
Trade-offs — Refer back to categories when discussing trade-offs (e.g. consistency vs availability, sync vs async replication).

Quick Reference: Category → Typical Questions

Category	Interview angles
Consistency & Replication	How do you replicate across regions? How do you resolve conflicts?
Partitioning	How do you shard? How do you avoid hot partitions?
Messaging	How do you decouple services? How do you guarantee delivery?
Coordination	How do you elect a leader? How do you implement a distributed lock?
Storage	Why this DB? How do you scale reads/writes?
Load Balancing	How do you scale out? How do you do canary releases?
Fault Tolerance	How do you handle failures? How do you make retries safe?
Security	How do you authenticate/authorize? How do you rate limit?
Observability	How do you debug in production? How do you define SLOs?
Compute	How do you run background jobs? How do you orchestrate workflows?

System Scale Analysis

Concrete throughput, capacity, and scale numbers help you sanity-check designs and answer “how big can this get?” in interviews. Numbers below are indicative (hardware, config, and workload vary); use them as order-of-magnitude references.

Messaging & Event Streaming

System	Typical scale (setup)	Notes
Kafka	~100K–600K+ msg/s per broker; ~100–600+ MB/s write throughput on a 3-broker cluster (e.g. 8 vCPU, NVMe, tuned producer)	Single broker: ~25–95+ MB/s with batching/compression. Cluster scales linearly with brokers. p99 latency ~5 ms at ~200 MB/s. Tuning: `batch.size`, `linger.ms`, `compression.type=lz4`, `acks=1` for throughput.
RabbitMQ	~10K–50K msg/s per node (small messages); lower for large payloads	Throughput depends on message size, persistence (disk vs RAM), and cluster size. Use for moderate throughput and flexible routing.
Google Pub/Sub	~1M msg/s per topic (managed; scales with partitions and subscribers)	Fully managed; throughput and retention are service limits, not single-machine limits.

Databases

System	Typical scale (setup)	Notes
PostgreSQL	No hard DB size limit; single table up to ~32 TB (default 8 KB block size); single field up to 1 GB	Practical limit is disk and performance. Single instance: often hundreds of GB–few TB before sharding/read replicas. Extensions (e.g. Citus) add sharding for larger scale.
MySQL	Single table ~64 TB (InnoDB); practical single-instance ~1–10 TB depending on workload	Scale via read replicas, then sharding or Vitess/ProxySQL for very large datasets.
Redis (single node)	~100K–1.2M ops/s per node (simple GET/SET); ~1M+ RPS on larger instances (e.g. r7g.4xlarge)	Cluster: linear scaling (e.g. ~10M ops/s on 6 nodes, ~200M ops/s on 40 nodes with Redis Enterprise). Sub-ms latency typical.
DynamoDB	~3K–10K WCU per partition (write); ~3K–10K RCU per partition (read); auto-scaling and partition splitting	Throughput and storage scale with partitions; no fixed “max DB size”—pay per request and storage.
Cassandra	Linear write scaling with nodes; ~10K–30K+ writes/s per node depending on schema and hardware	No single “max size”; cluster size and replication factor determine capacity. Proper data modeling and vnodes (e.g. 4–16 tokens per node) matter.
MongoDB	Single replica set ~TB range; sharded clusters 100+ TB with many shards	Throughput scales with shards and read preference (primary vs secondaries).

Caching & In-Memory

System	Typical scale (setup)	Notes
Redis	See Databases above; ~100K–1M+ ops/s per node, sub-ms latency	Same product used as cache and as KV store; cluster mode for horizontal scale.
Memcached	~100K–500K+ ops/s per node (GET-heavy)	Simpler than Redis; multi-threaded; no persistence. Scale by adding nodes.

Storage Systems

System	Typical scale (setup)	Notes
AWS S3	Unlimited objects and total size; 3,500 PUT / 5,500 GET per second per prefix (request rate best practices)	Scale by prefix design (shard prefixes for higher aggregate throughput). No single “max bucket size.”
HDFS	PB-scale with hundreds/thousands of nodes; single NameNode metadata in millions of files	Throughput scales with DataNodes and replication; block size (e.g. 128 MB) affects large-file throughput.

Search & Analytics

System	Typical scale (setup)	Notes
Elasticsearch	~10–50 GB per shard recommended; <200M documents per shard; ~1000 shards per (non-frozen) data node default	Total cluster size = nodes × shard capacity. Oversharding hurts performance; scale by adding nodes and reindexing/shrinking if needed.
ClickHouse	TB–PB per cluster; billions of rows per table; high compression	Optimized for analytical queries; throughput depends on schema, compression, and hardware.

Coordination & Configuration

System	Typical scale (setup)	Notes
ZooKeeper	~10K–100K+ ops/s for reads; writes lower (consensus); KB–MB for znodes	Suited for metadata and coordination, not bulk data. Scale by ensemble size (3, 5, 7 nodes typical).
etcd	~10K+ writes/s (small values); ~100K+ reads/s; multi-GB storage per cluster	Used by Kubernetes; scale by cluster size and resource limits.

Why Scale Numbers Matter in Interviews

Sizing: “We need 1M events/s → Kafka with N brokers” or “We have 50 TB → PostgreSQL + Citus or Cassandra.”
Bottlenecks: “Single Redis node caps at ~1M ops/s; we’ll need a cluster for 10M.”
Trade-offs: “PostgreSQL single table is 32 TB; beyond that we shard or move to a distributed store.”
Realism: Avoid “Kafka can do infinite throughput” or “PostgreSQL can’t hold more than 1 GB”—use order-of-magnitude numbers instead.

This ecosystem view helps you structure your answer (by category), recall standard building blocks (items), and explain their use in a given design. The component reference gives you concrete technology names; the conceptual categories help you reason about trade-offs; the scale analysis grounds designs in realistic numbers. Keep this as a mental map when practicing system design problems.

Robina Li