Introduction
Apache Druid is a high-performance, column-oriented, distributed data store designed for real-time analytics and OLAP queries. It’s optimized for sub-second queries on large datasets. Understanding Druid is essential for system design interviews involving real-time analytics, time-series analytics, and OLAP workloads.
This guide covers:
- Druid Fundamentals: Architecture, segments, and data ingestion
- Real-Time Ingestion: Streaming data ingestion
- Querying: SQL and native query APIs
- Performance: Indexing, compression, and optimization
- Best Practices: Data modeling, partitioning, and scaling
What is Apache Druid?
Apache Druid is an analytics database that:
- Column-Oriented: Columnar storage for analytics
- Real-Time Ingestion: Stream data ingestion
- Sub-Second Queries: Fast query performance
- Time-Based Partitioning: Optimized for time-series data
- Scalability: Horizontal scaling
Key Concepts
Segment: Unit of storage and replication
Datasource: Table-like entity
Dimension: Column used for filtering and grouping
Metric: Column used for aggregation
Time Column: Timestamp column for partitioning
Indexing Service: Service that creates segments
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Client │────▶│ Client │
│ Application │ │ Application │ │ Application │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
│ SQL Queries
│
▼
┌─────────────────────────┐
│ Druid Broker │
│ (Query Router) │
└──────┬──────────────────┘
│
┌─────────────┴─────────────┐
│ │
┌──────▼──────┐ ┌───────▼──────┐
│ Historical │ │ Real-Time │
│ Nodes │ │ Nodes │
│ (Segments) │ │ (Ingestion) │
└──────┬──────┘ └───────┬──────┘
│ │
└──────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Deep Storage │
│ (S3, HDFS) │
└─────────────────────────┘
Explanation:
- Client Applications: Applications that query Druid for real-time and historical analytics (e.g., dashboards, analytics platforms, monitoring systems).
- Druid Broker: Routes queries to appropriate nodes (Historical or Real-Time) and merges results.
- Historical Nodes: Store and serve historical data segments. Optimized for fast analytical queries on historical data.
- Real-Time Nodes: Ingest and process real-time streaming data. Data is eventually moved to Historical nodes.
- Deep Storage: Long-term storage for segments (e.g., S3, HDFS). Historical nodes load segments from deep storage.
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ Druid Cluster │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Coordinator │ │
│ │ (Segment Management, Load Balancing) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Broker │ │
│ │ (Query Routing, Result Merging) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Historical Nodes │ │
│ │ (Segment Storage, Query Execution) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Middle Manager │ │
│ │ (Real-Time Ingestion, Indexing) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Deep Storage │ │
│ │ (S3, HDFS, etc.) │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘
Data Model
Datasource
Schema:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "events",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": ["user_id", "event_type", "country"]
},
"metricsSpec": [
{
"name": "count",
"type": "count"
},
{
"name": "revenue",
"type": "doubleSum",
"fieldName": "amount"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "hour",
"queryGranularity": "minute"
}
}
}
}
Dimensions vs Metrics
Dimensions:
- Used for filtering and grouping
- String, long, float, or complex types
- Example: user_id, event_type, country
Metrics:
- Used for aggregation
- Numeric types
- Example: count, sum, min, max
Data Ingestion
Batch Ingestion
Native Batch:
{
"type": "index",
"spec": {
"ioConfig": {
"type": "index",
"inputSource": {
"type": "local",
"baseDir": "/path/to/data",
"filter": "*.json"
}
}
}
}
Hadoop Batch:
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "hdfs://path/to/data"
}
}
}
}
Real-Time Ingestion
Kafka Ingestion:
{
"type": "kafka",
"spec": {
"ioConfig": {
"type": "kafka",
"consumerProperties": {
"bootstrap.servers": "localhost:9092"
},
"topic": "events",
"inputFormat": {
"type": "json"
}
},
"tuningConfig": {
"type": "kafka",
"maxRowsPerSegment": 5000000
}
}
}
Querying
SQL Queries
Basic Query:
SELECT
TIME_FLOOR(__time, 'PT1H') AS hour,
country,
COUNT(*) AS events
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1, 2
ORDER BY 1 DESC
Aggregation:
SELECT
event_type,
SUM(revenue) AS total_revenue,
COUNT(*) AS event_count
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY event_type
ORDER BY total_revenue DESC
Native Query API
Timeseries Query:
{
"queryType": "timeseries",
"dataSource": "events",
"granularity": "hour",
"intervals": ["2024-01-01/2024-01-02"],
"aggregations": [
{
"type": "count",
"name": "events"
},
{
"type": "doubleSum",
"name": "revenue",
"fieldName": "amount"
}
]
}
TopN Query:
{
"queryType": "topN",
"dataSource": "events",
"granularity": "all",
"dimension": "country",
"metric": "revenue",
"threshold": 10,
"intervals": ["2024-01-01/2024-01-02"],
"aggregations": [
{
"type": "doubleSum",
"name": "revenue",
"fieldName": "amount"
}
]
}
Performance Optimization
Segment Optimization
Segment Size:
- Target: 300-700MB per segment
- Adjust
maxRowsPerSegment - Balance query performance and ingestion
Partitioning:
{
"partitionsSpec": {
"type": "hashed",
"targetRowsPerSegment": 5000000
}
}
Indexing
Bitmap Indexes:
- Automatic for dimensions
- Fast filtering
- Efficient for high cardinality
Compression:
- Automatic compression
- Columnar format
- Efficient storage
Query Optimization
Use Time Intervals:
-- Good: Specific time range
SELECT * FROM events
WHERE __time >= '2024-01-01' AND __time < '2024-01-02'
-- Bad: No time filter
SELECT * FROM events
Filter Early:
-- Good: Filter before aggregation
SELECT country, COUNT(*)
FROM events
WHERE country = 'US' AND __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY country
-- Less efficient: Filter after aggregation
SELECT country, COUNT(*)
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY country
HAVING country = 'US'
Best Practices
1. Data Modeling
- Use appropriate granularity
- Choose dimensions carefully
- Design metrics for queries
- Plan for time-based partitioning
2. Ingestion
- Use real-time for streaming
- Batch for historical data
- Optimize segment size
- Monitor ingestion lag
3. Query Performance
- Always use time intervals
- Filter on dimensions early
- Use appropriate aggregations
- Limit result sets
4. Scaling
- Scale historical nodes for storage
- Scale brokers for query throughput
- Monitor segment distribution
- Plan for capacity
What Interviewers Look For
Analytics Database Understanding
- Druid Concepts
- Understanding of columnar storage
- Real-time ingestion
- Time-based partitioning
- Red Flags: No Druid understanding, wrong model, no partitioning
- OLAP Patterns
- Aggregation strategies
- Time-series queries
- Dimension modeling
- Red Flags: Poor aggregations, no time-series, poor dimensions
- Performance
- Query optimization
- Segment design
- Indexing strategies
- Red Flags: No optimization, poor segments, no indexing
Problem-Solving Approach
- Data Modeling
- Dimension and metric design
- Time partitioning
- Granularity selection
- Red Flags: Poor design, no partitioning, wrong granularity
- Query Design
- Time interval usage
- Filter optimization
- Aggregation design
- Red Flags: No time intervals, poor filters, no aggregation
System Design Skills
- Analytics Architecture
- Druid cluster design
- Ingestion pipeline
- Query optimization
- Red Flags: No architecture, poor ingestion, no optimization
- Scalability
- Horizontal scaling
- Segment management
- Performance tuning
- Red Flags: No scaling, poor segments, no tuning
Communication Skills
- Clear Explanation
- Explains Druid concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Analytics Expertise
- Understanding of OLAP systems
- Druid mastery
- Real-time analytics
- Key: Demonstrate analytics expertise
- System Design Skills
- Can design analytics systems
- Understands OLAP challenges
- Makes informed trade-offs
- Key: Show practical analytics design skills
Summary
Apache Druid Key Points:
- Column-Oriented: Optimized for analytics queries
- Real-Time Ingestion: Stream data ingestion
- Sub-Second Queries: Fast query performance
- Time-Based Partitioning: Optimized for time-series
- Scalability: Horizontal scaling
Common Use Cases:
- Real-time analytics dashboards
- Time-series analytics
- OLAP workloads
- Event analytics
- User behavior analysis
- Business intelligence
Best Practices:
- Use appropriate granularity
- Design dimensions and metrics carefully
- Always use time intervals in queries
- Optimize segment size
- Filter early in queries
- Monitor ingestion and query performance
- Plan for horizontal scaling
Apache Druid is a powerful analytics database optimized for real-time OLAP queries on large datasets.