Introduction
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It collects metrics from configured targets, stores them in a time-series database, and provides a powerful query language for analysis. Understanding Prometheus is essential for system design interviews involving observability and monitoring.
This guide covers:
- Prometheus Fundamentals: Architecture, metrics, and data model
- PromQL: Query language for metrics analysis
- Service Discovery: Automatic target discovery
- Alerting: Alert rules and Alertmanager
- Best Practices: Metric naming, labeling, and performance
What is Prometheus?
Prometheus is a monitoring system that:
- Metrics Collection: Pulls metrics from targets
- Time-Series Database: Stores metrics efficiently
- Query Language: PromQL for analysis
- Alerting: Alertmanager for notifications
- Multi-Dimensional: Labels for flexible querying
Key Concepts
Metric: Time series with name and labels
Label: Key-value pair for metric dimensions
Target: Endpoint being monitored
Scrape: Collection of metrics from target
Job: Collection of targets
Instance: Single target
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service │────▶│ Service │────▶│ Service │
│ A │ │ B │ │ C │
│ (Metrics) │ │ (Metrics) │ │ (Metrics) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
│ Scrape (Pull)
│
▼
┌─────────────────────────┐
│ Prometheus Server │
│ │
│ ┌──────────┐ │
│ │ Scrape │ │
│ │ Targets │ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Time- │ │
│ │ Series │ │
│ │ Database │ │
│ └──────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ PromQL │ │
│ │ (Queries) │ │
│ └───────────────────┘ │
└──────┬──────────────────┘
│
┌─────────────┴─────────────┐
│ │
┌──────▼──────┐ ┌───────▼──────┐
│ Grafana │ │ Alertmanager │
│ (Dashboards)│ │ (Alerts) │
└─────────────┘ └─────────────┘
Explanation:
- Services: Applications, servers, or infrastructure components that expose metrics endpoints (e.g., HTTP /metrics).
- Prometheus Server: Collects metrics by scraping targets, stores them in a time-series database, and provides a query language (PromQL).
- Scrape Targets: Services or exporters that expose metrics in Prometheus format. Prometheus pulls metrics from these targets.
- Time-Series Database: Stores metrics data with timestamps. Prometheus uses its own efficient storage format.
- PromQL (Queries): Query language for retrieving and aggregating metrics data.
- Grafana: Visualization tool that queries Prometheus and displays dashboards.
- Alertmanager: Handles alerts generated by Prometheus and routes them to notification channels.
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Scrape Targets │ │
│ │ (HTTP Endpoints, Exporters) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Time-Series Database │ │
│ │ (Local Storage) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ PromQL Query Engine │ │
│ │ (Query Language) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Alertmanager │ │
│ │ (Alert Routing, Notifications) │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Metrics Types
Counter
Monotonically increasing value:
http_requests_total{method="GET", status="200"} 1234
http_requests_total{method="POST", status="500"} 56
Use Cases:
- Request counts
- Error counts
- Bytes transferred
Gauge
Value that can go up or down:
memory_usage_bytes{instance="server1"} 1024000
cpu_usage_percent{instance="server1"} 75.5
Use Cases:
- Current memory usage
- Active connections
- Queue size
Histogram
Distribution of values:
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_bucket{le="1.0"} 250
http_request_duration_seconds_bucket{le="+Inf"} 300
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 300
Use Cases:
- Request latency
- Response sizes
- Processing time
Summary
Similar to histogram, with quantiles:
http_request_duration_seconds{quantile="0.5"} 0.1
http_request_duration_seconds{quantile="0.9"} 0.5
http_request_duration_seconds{quantile="0.99"} 1.0
http_request_duration_seconds_sum 150.5
http_request_duration_seconds_count 300
PromQL (Prometheus Query Language)
Basic Queries
Select metric:
http_requests_total
Filter by label:
http_requests_total{method="GET"}
Multiple label filters:
http_requests_total{method="GET", status="200"}
Rate and Increase
Rate over time:
rate(http_requests_total[5m])
Increase over time:
increase(http_requests_total[5m])
Aggregation
Sum:
sum(http_requests_total)
Average:
avg(cpu_usage_percent)
Group by:
sum(http_requests_total) by (method)
Max/Min:
max(memory_usage_bytes)
min(cpu_usage_percent)
Functions
Rate:
rate(http_requests_total[5m])
Histogram Quantile:
histogram_quantile(0.95, http_request_duration_seconds_bucket)
Time:
time() - process_start_time_seconds
Label Replace:
label_replace(http_requests_total, "service", "$1", "instance", "(.*):.*")
Configuration
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
static_configs:
- targets: ['app1:8080', 'app2:8080']
metrics_path: '/metrics'
scrape_interval: 10s
Service Discovery
Kubernetes:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Consul:
scrape_configs:
- job_name: 'consul'
consul_sd_configs:
- server: 'consul:8500'
Instrumentation
Go Client
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
},
[]string{"method"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func handler(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method))
defer timer.ObserveDuration()
// Handle request
httpRequestsTotal.WithLabelValues(r.Method, "200").Inc()
}
func main() {
http.HandleFunc("/", handler)
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Python Client
from prometheus_client import Counter, Histogram, start_http_server
import time
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'status']
)
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method']
)
def handle_request(method, status):
with http_request_duration.labels(method=method).time():
# Process request
http_requests_total.labels(method=method, status=status).inc()
# Start metrics server
start_http_server(8000)
Alerting
Alert Rules
alert_rules.yml:
groups:
- name: example
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is errors/sec"
- alert: HighMemoryUsage
expr: memory_usage_bytes / memory_total_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is "
Alertmanager Configuration
alertmanager.yml:
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://webhook:5000/alerts'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
- name: 'warning-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
Best Practices
1. Metric Naming
- Use descriptive names
- Follow naming conventions
- Use base units (seconds, bytes)
- Avoid high cardinality
Good:
http_requests_total
cpu_usage_percent
memory_usage_bytes
Bad:
requests
cpu
memory
2. Labeling
- Use labels for dimensions
- Avoid high cardinality labels
- Keep label names consistent
- Use appropriate label values
Good:
http_requests_total{method="GET", status="200", endpoint="/api/users"}
Bad:
http_requests_total{user_id="12345", request_id="abc123"}
3. Recording Rules
Pre-compute queries:
groups:
- name: recording_rules
interval: 30s
rules:
- record: http:requests:rate5m
expr: rate(http_requests_total[5m])
4. Performance
- Limit scrape interval
- Use recording rules
- Optimize queries
- Monitor Prometheus itself
What Interviewers Look For
Monitoring Understanding
- Prometheus Concepts
- Understanding of metrics, labels, targets
- PromQL query language
- Alerting rules
- Red Flags: No Prometheus understanding, wrong concepts, poor queries
- Observability
- Metrics vs logs vs traces
- Monitoring strategies
- Alerting design
- Red Flags: No observability understanding, poor monitoring, no alerts
- Performance
- Metric cardinality
- Query optimization
- Storage efficiency
- Red Flags: High cardinality, poor queries, no optimization
Problem-Solving Approach
- Metric Design
- Choose appropriate metric types
- Design labels
- Avoid cardinality explosion
- Red Flags: Wrong types, high cardinality, poor design
- Alerting Design
- Define alert rules
- Set appropriate thresholds
- Route alerts properly
- Red Flags: No alerts, wrong thresholds, poor routing
System Design Skills
- Observability Architecture
- Monitoring system design
- Metric collection
- Alerting pipeline
- Red Flags: No monitoring, poor architecture, no alerts
- Scalability
- Handle high cardinality
- Optimize queries
- Scale Prometheus
- Red Flags: No scaling, poor performance, no optimization
Communication Skills
- Clear Explanation
- Explains monitoring concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Observability Expertise
- Understanding of monitoring
- Prometheus mastery
- Alerting design
- Key: Demonstrate observability expertise
- System Design Skills
- Can design monitoring systems
- Understands observability challenges
- Makes informed trade-offs
- Key: Show practical observability design skills
Summary
Prometheus Key Points:
- Metrics Collection: Pull-based metric collection
- Time-Series Database: Efficient storage of metrics
- PromQL: Powerful query language
- Multi-Dimensional: Labels for flexible querying
- Alerting: Alertmanager for notifications
- Service Discovery: Automatic target discovery
Common Use Cases:
- Application monitoring
- Infrastructure monitoring
- Service health checks
- Performance monitoring
- Alerting and notifications
- Capacity planning
Best Practices:
- Use appropriate metric types
- Design labels carefully
- Avoid high cardinality
- Use recording rules
- Optimize queries
- Set up proper alerting
- Monitor Prometheus itself
Prometheus is a powerful monitoring system that provides comprehensive observability for modern applications and infrastructure.