System Design Overview: Cloud-Native Architectures
A practical map of cloud system design: entry points, compute patterns, data stores, async pipelines, resilience, security, and observability.
Reference blueprint
Clients → CDN/Edge → API Gateway (AuthN/Z, WAF, Rate‑limit)
├── REST/GraphQL Services (K8s) → Cache (Redis)
├── Async Workers (Queues/Streams)
├── Batch/ETL (Airflow/Spark)
└── Admin/Backoffice
State: OLTP (Postgres/MySQL), NoSQL (Dynamo/Cassandra), Blob (S3/GCS), Search (Elastic), Analytics (BigQuery/ClickHouse)
Infra: Kubernetes, Service Mesh, IaC (Terraform), Secrets (KMS)
Core patterns
- API gateway: authentication (OIDC), request shaping, routing, circuit breakers.
- Microservices on K8s: autoscaling, HPA, pod disruption budgets, rolling and canary deploys.
- Caching: edge (CDN), app‑side Redis, database read replicas; cache‑aside + TTL + stampede control.
- Messaging: queues (SQS/PubSub) for async work; streams (Kafka) for event sourcing and fanout.
- Data stores: choose per access pattern (OLTP for transactions, NoSQL for key‑value/scale, search for text, columnar for analytics).
Reliability
- Graceful degradation and feature flags; backpressure; rate limits.
- Multi‑AZ by default; multi‑region active‑active reads where feasible; RPO/RTO defined and tested.
- Idempotency keys for mutable APIs; retries with jitter; dead‑letter queues.
Security
- Principle of least privilege (IAM); encryption in transit (mTLS) and at rest (KMS).
- Secret management, key rotation, short‑lived tokens; audit trails (WORM where needed).
- WAF, bot protection, and abuse detection on edges.
Observability
- Structured logs, metrics, and traces (OpenTelemetry); SLOs and burn‑rate alerts.
- Blackbox probes and synthetic tests; chaos experiments with blast radius limits.
Cost & performance
- Right‑size instances; autoscale; spot capacity for batch.
- Tune hot paths (lock contention, connection pools); profile with p99 focus.
Interview checklist
- Entry → gateway → services → data; sync vs async; cache strategy; failure modes; rollout plan; SLOs.
SLO & capacity templates
SLOs: API p95 < 200 ms, availability 99.9%, error rate < 0.1%
Capacity (BoE): QPS=____, payload=____ KB → egress/day=____ GB; cache hit target=____; DB IOPS needed=____.
Failure drill menu
- Regional outage, cache cluster loss, message backlog, DB primary failover, provider throttle—define runbooks and automate game days.