System Design: ChatGPT-Style LLM Service (Serving, Caching, Safety)

Goal: low‑latency, high‑availability text generation (and tools) with safety, rate limiting, and observability.

Requirements

Streaming tokens (<200 ms first token), batch throughput, multi‑tenant quotas, session history, tool use (functions), file/RAG.

Architecture overview

Clients → API Gateway (AuthN/Z, rate limit, quotas)
  → Orchestrator (routing, context mgmt, tool calls)
    → Inference Fleet (GPU/TPU)  → KV Cache (paged attention)
    → Safety Filters (pre/post)
    → Retrieval (vector store, doc store) via RAG
  → Event Bus (telemetry)

Inference serving

Models sharded across GPUs with tensor/pipe parallel; batching/scheduling for high utilization.
KV cache reuse across turns; paged‑attention cache in GPU/host memory tiers.
Multi‑model routing (cost/latency/quality); fallback to smaller models under load.

Context and tools

Conversation store (compressed) with truncation strategies; tools/plugins invoked via JSON schemas.
Function calling: orchestrator validates args; tool sandbox with timeouts and quotas.

Retrieval (RAG)

Embeddings service builds vectors for docs; chunking + metadata; ANN index (HNSW/IVF‑PQ).
At query time: recall top‑k, re‑rank, synthesize context; guard context length with budgeters.

Caching and dedup

Prompt/result cache (normalized input) for deterministic prompts; stage caches (embedding, retrieval, decode prefixes).
Safety: never cache sensitive PII; encrypt at rest.

Safety and policy

Pre‑filters (regex/keyword/ML) and post‑filters (classifier) for harmful content; red teaming and appeals.
Audit logs of prompts/outputs; data retention policies per tenant.

Observability and reliability

Token‑level metrics (TTFT, TPS, errors); autoscaling; circuit breakers; brownout mode (shorter max tokens) under load.

APIs

POST /v1/chat/completions { model, messages[], tools?, tool_choice? }
POST /v1/embeddings { model, input }

Capacity planning

Target: 100k concurrent sessions, TTFT < 200 ms, 50 tokens/s median decode.
GPU math: model needs 40 GB per replica; with paged‑attention, cache ~128k tokens/GPU; plan cache tiers (GPU/HBM → CPU RAM → SSD) with eviction.
Routing: batch size 8–16 to keep utilization > 60% without harming latency; autoscale on queue depth and TTFT SLO.

SLOs & safety

SLOs: TTFT, tokens/sec, error rate; budget enforces brownouts (max tokens cap) when violated.
Safety: pre/post filters with allow/deny lists; privacy guardrails per tenant; audit retention limits.

Failure modes

GPU node loss: retry to alternate pool; preserve cache keys when possible; degrade to smaller model if capacity constrained.
RAG backends slow: fall back to last known context or skip retrieval with warning tag.

Detailed APIs

POST /v1/chat/completions { model, messages[], tools?, tool_choice?, stream? }
event: chunk  data: { role, delta, usage? }

Orchestrator design

Router selects model/pool; KV cache coordinator attaches cache id; streaming gateway multiplexes SSE chunks.
Tooling: JSON schema validation and safe tool sandbox with timeouts; retries with circuit breaker per tool.

Capacity BoE

100k concurrent streams @ 50 tok/s → 5M tok/s; plan GPU count for model throughput; batch size 8–16.
KV cache: 128k tokens/GPU tier; promote hot sessions; evict with LRU per tenant.

Testing & eval

Load gen with mixed prompts/tools; TTFT SLO monitors; shadow deploys for new model versions with guardrails.

What Interviewers Look For

LLM/AI Systems Skills

Inference Optimization
- GPU/TPU utilization
- Batching strategies
- KV cache management
- Red Flags: Poor GPU utilization, no batching, inefficient caching
Latency Optimization
- Time-to-first-token (TTFT) < 200ms
- Streaming architecture
- Paged attention
- Red Flags: High latency, no streaming, poor TTFT
Model Serving
- Multi-model routing
- Fallback strategies
- Auto-scaling
- Red Flags: Single model, no fallback, poor scaling

Distributed Systems Skills

Caching Strategy
- Prompt/result caching
- KV cache tiers
- Cache eviction policies
- Red Flags: No caching, poor strategy, cache misses
RAG Architecture
- Vector search
- Embedding generation
- Context management
- Red Flags: No RAG, inefficient search, context overflow
Safety & Policy
- Pre/post filters
- Content moderation
- Audit logging
- Red Flags: No safety, no moderation, no audit

Problem-Solving Approach

Cost Optimization
- GPU cost management
- Cache efficiency
- Model selection
- Red Flags: High costs, no optimization, inefficient
Edge Cases
- GPU failures
- RAG backend slow
- Cache misses
- Red Flags: Ignoring edge cases, no handling
Trade-off Analysis
- Latency vs cost
- Quality vs speed
- Red Flags: No trade-offs, dogmatic choices

System Design Skills

Component Design
- Orchestrator
- Inference fleet
- Safety filters
- Red Flags: Monolithic, unclear boundaries
Observability
- Token-level metrics
- SLO monitoring
- Circuit breakers
- Red Flags: No metrics, no monitoring, no observability
Reliability
- Auto-scaling
- Brownout mode
- Graceful degradation
- Red Flags: No scaling, no degradation, poor reliability

Communication Skills

LLM Architecture Explanation
- Can explain inference serving
- Understands caching strategies
- Red Flags: No understanding, vague explanations
Performance Explanation
- Can explain latency optimization
- Understands cost trade-offs
- Red Flags: No understanding, vague

Meta-Specific Focus

AI/ML Systems Expertise
- Deep LLM knowledge
- Inference optimization
- Key: Show AI/ML systems expertise
Performance & Cost Balance
- Latency optimization
- Cost efficiency
- Key: Demonstrate performance/cost balance

Robina Li

System Design: ChatGPT-Style LLM Service (Serving, Caching, Safety)

Requirements

Architecture overview

Inference serving

Context and tools

Retrieval (RAG)

Caching and dedup

Safety and policy

Observability and reliability

APIs

Capacity planning

SLOs & safety

Failure modes

Detailed APIs

Orchestrator design

Capacity BoE

Testing & eval

What Interviewers Look For

LLM/AI Systems Skills

Distributed Systems Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Related Posts

Recent Posts