Introduction

Apache Druid is a high-performance, column-oriented, distributed data store designed for real-time analytics and OLAP queries. It’s optimized for sub-second queries on large datasets. Understanding Druid is essential for system design interviews involving real-time analytics, time-series analytics, and OLAP workloads.

This guide covers:

  • Druid Fundamentals: Architecture, segments, and data ingestion
  • Real-Time Ingestion: Streaming data ingestion
  • Querying: SQL and native query APIs
  • Performance: Indexing, compression, and optimization
  • Best Practices: Data modeling, partitioning, and scaling

What is Apache Druid?

Apache Druid is an analytics database that:

  • Column-Oriented: Columnar storage for analytics
  • Real-Time Ingestion: Stream data ingestion
  • Sub-Second Queries: Fast query performance
  • Time-Based Partitioning: Optimized for time-series data
  • Scalability: Horizontal scaling

Key Concepts

Segment: Unit of storage and replication

Datasource: Table-like entity

Dimension: Column used for filtering and grouping

Metric: Column used for aggregation

Time Column: Timestamp column for partitioning

Indexing Service: Service that creates segments

Architecture

High-Level Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│   Client    │────▶│   Client    │
│ Application │     │ Application │     │ Application │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            │ SQL Queries
                            │
                            ▼
              ┌─────────────────────────┐
              │   Druid Broker           │
              │   (Query Router)         │
              └──────┬──────────────────┘
                     │
       ┌─────────────┴─────────────┐
       │                           │
┌──────▼──────┐           ┌───────▼──────┐
│ Historical  │           │  Real-Time   │
│   Nodes     │           │   Nodes      │
│ (Segments)  │           │ (Ingestion)  │
└──────┬──────┘           └───────┬──────┘
       │                           │
       └──────────────┬────────────┘
                      │
                      ▼
              ┌─────────────────────────┐
              │   Deep Storage           │
              │   (S3, HDFS)            │
              └─────────────────────────┘

Explanation:

  • Client Applications: Applications that query Druid for real-time and historical analytics (e.g., dashboards, analytics platforms, monitoring systems).
  • Druid Broker: Routes queries to appropriate nodes (Historical or Real-Time) and merges results.
  • Historical Nodes: Store and serve historical data segments. Optimized for fast analytical queries on historical data.
  • Real-Time Nodes: Ingest and process real-time streaming data. Data is eventually moved to Historical nodes.
  • Deep Storage: Long-term storage for segments (e.g., S3, HDFS). Historical nodes load segments from deep storage.

Core Architecture

┌─────────────────────────────────────────────────────────┐
│              Druid Cluster                               │
│                                                           │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Coordinator                                │    │
│  │  (Segment Management, Load Balancing)               │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Broker                                     │    │
│  │  (Query Routing, Result Merging)                   │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Historical Nodes                          │    │
│  │  (Segment Storage, Query Execution)                │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Middle Manager                            │    │
│  │  (Real-Time Ingestion, Indexing)                   │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Deep Storage                               │    │
│  │  (S3, HDFS, etc.)                                  │    │
│  └──────────────────────────────────────────────────┘    │
└───────────────────────────────────────────────────────────┘

Data Model

Datasource

Schema:

{
  "type": "index",
  "spec": {
    "dataSchema": {
      "dataSource": "events",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": ["user_id", "event_type", "country"]
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "revenue",
          "type": "doubleSum",
          "fieldName": "amount"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "hour",
        "queryGranularity": "minute"
      }
    }
  }
}

Dimensions vs Metrics

Dimensions:

  • Used for filtering and grouping
  • String, long, float, or complex types
  • Example: user_id, event_type, country

Metrics:

  • Used for aggregation
  • Numeric types
  • Example: count, sum, min, max

Data Ingestion

Batch Ingestion

Native Batch:

{
  "type": "index",
  "spec": {
    "ioConfig": {
      "type": "index",
      "inputSource": {
        "type": "local",
        "baseDir": "/path/to/data",
        "filter": "*.json"
      }
    }
  }
}

Hadoop Batch:

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "paths": "hdfs://path/to/data"
      }
    }
  }
}

Real-Time Ingestion

Kafka Ingestion:

{
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "localhost:9092"
      },
      "topic": "events",
      "inputFormat": {
        "type": "json"
      }
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsPerSegment": 5000000
    }
  }
}

Querying

SQL Queries

Basic Query:

SELECT 
  TIME_FLOOR(__time, 'PT1H') AS hour,
  country,
  COUNT(*) AS events
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY 1, 2
ORDER BY 1 DESC

Aggregation:

SELECT 
  event_type,
  SUM(revenue) AS total_revenue,
  COUNT(*) AS event_count
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY event_type
ORDER BY total_revenue DESC

Native Query API

Timeseries Query:

{
  "queryType": "timeseries",
  "dataSource": "events",
  "granularity": "hour",
  "intervals": ["2024-01-01/2024-01-02"],
  "aggregations": [
    {
      "type": "count",
      "name": "events"
    },
    {
      "type": "doubleSum",
      "name": "revenue",
      "fieldName": "amount"
    }
  ]
}

TopN Query:

{
  "queryType": "topN",
  "dataSource": "events",
  "granularity": "all",
  "dimension": "country",
  "metric": "revenue",
  "threshold": 10,
  "intervals": ["2024-01-01/2024-01-02"],
  "aggregations": [
    {
      "type": "doubleSum",
      "name": "revenue",
      "fieldName": "amount"
    }
  ]
}

Performance Optimization

Segment Optimization

Segment Size:

  • Target: 300-700MB per segment
  • Adjust maxRowsPerSegment
  • Balance query performance and ingestion

Partitioning:

{
  "partitionsSpec": {
    "type": "hashed",
    "targetRowsPerSegment": 5000000
  }
}

Indexing

Bitmap Indexes:

  • Automatic for dimensions
  • Fast filtering
  • Efficient for high cardinality

Compression:

  • Automatic compression
  • Columnar format
  • Efficient storage

Query Optimization

Use Time Intervals:

-- Good: Specific time range
SELECT * FROM events
WHERE __time >= '2024-01-01' AND __time < '2024-01-02'

-- Bad: No time filter
SELECT * FROM events

Filter Early:

-- Good: Filter before aggregation
SELECT country, COUNT(*) 
FROM events
WHERE country = 'US' AND __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY country

-- Less efficient: Filter after aggregation
SELECT country, COUNT(*) 
FROM events
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY country
HAVING country = 'US'

Best Practices

1. Data Modeling

  • Use appropriate granularity
  • Choose dimensions carefully
  • Design metrics for queries
  • Plan for time-based partitioning

2. Ingestion

  • Use real-time for streaming
  • Batch for historical data
  • Optimize segment size
  • Monitor ingestion lag

3. Query Performance

  • Always use time intervals
  • Filter on dimensions early
  • Use appropriate aggregations
  • Limit result sets

4. Scaling

  • Scale historical nodes for storage
  • Scale brokers for query throughput
  • Monitor segment distribution
  • Plan for capacity

What Interviewers Look For

Analytics Database Understanding

  1. Druid Concepts
    • Understanding of columnar storage
    • Real-time ingestion
    • Time-based partitioning
    • Red Flags: No Druid understanding, wrong model, no partitioning
  2. OLAP Patterns
    • Aggregation strategies
    • Time-series queries
    • Dimension modeling
    • Red Flags: Poor aggregations, no time-series, poor dimensions
  3. Performance
    • Query optimization
    • Segment design
    • Indexing strategies
    • Red Flags: No optimization, poor segments, no indexing

Problem-Solving Approach

  1. Data Modeling
    • Dimension and metric design
    • Time partitioning
    • Granularity selection
    • Red Flags: Poor design, no partitioning, wrong granularity
  2. Query Design
    • Time interval usage
    • Filter optimization
    • Aggregation design
    • Red Flags: No time intervals, poor filters, no aggregation

System Design Skills

  1. Analytics Architecture
    • Druid cluster design
    • Ingestion pipeline
    • Query optimization
    • Red Flags: No architecture, poor ingestion, no optimization
  2. Scalability
    • Horizontal scaling
    • Segment management
    • Performance tuning
    • Red Flags: No scaling, poor segments, no tuning

Communication Skills

  1. Clear Explanation
    • Explains Druid concepts
    • Discusses trade-offs
    • Justifies design decisions
    • Red Flags: Unclear explanations, no justification, confusing

Meta-Specific Focus

  1. Analytics Expertise
    • Understanding of OLAP systems
    • Druid mastery
    • Real-time analytics
    • Key: Demonstrate analytics expertise
  2. System Design Skills
    • Can design analytics systems
    • Understands OLAP challenges
    • Makes informed trade-offs
    • Key: Show practical analytics design skills

Summary

Apache Druid Key Points:

  • Column-Oriented: Optimized for analytics queries
  • Real-Time Ingestion: Stream data ingestion
  • Sub-Second Queries: Fast query performance
  • Time-Based Partitioning: Optimized for time-series
  • Scalability: Horizontal scaling

Common Use Cases:

  • Real-time analytics dashboards
  • Time-series analytics
  • OLAP workloads
  • Event analytics
  • User behavior analysis
  • Business intelligence

Best Practices:

  • Use appropriate granularity
  • Design dimensions and metrics carefully
  • Always use time intervals in queries
  • Optimize segment size
  • Filter early in queries
  • Monitor ingestion and query performance
  • Plan for horizontal scaling

Apache Druid is a powerful analytics database optimized for real-time OLAP queries on large datasets.