Elasticsearch: Comprehensive Guide to Distributed Search and Analytics Engine

Introduction

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It provides near real-time search capabilities, powerful analytics, and horizontal scalability, making it ideal for search engines, log analytics, and data exploration.

This guide covers:

Elasticsearch Fundamentals: Core concepts, architecture, and components
Indexing and Search: How to index and search data
Aggregations: Analytics and data exploration
Scalability: Cluster management and horizontal scaling
Use Cases: Real-world applications and patterns
Best Practices: Performance, reliability, and optimization

What is Elasticsearch?

Elasticsearch is a distributed search and analytics engine that offers:

Full-Text Search: Powerful text search with relevance scoring
Real-Time: Near real-time indexing and search
Scalability: Horizontal scaling with automatic sharding
Analytics: Aggregations for data analysis
RESTful API: Simple HTTP-based interface
Schema-Free: JSON document storage with dynamic mapping

Key Concepts

Index: A collection of documents (similar to a database in SQL)

Document: A JSON object stored in an index (similar to a row in SQL)

Type: (Deprecated in 7.x+) A category of documents within an index

Shard: A subset of an index (for horizontal scaling)

Replica: A copy of a shard (for high availability)

Node: A single Elasticsearch instance

Cluster: A collection of nodes working together

Architecture

High-Level Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│   Client    │────▶│   Client    │
│ Application │     │ Application │     │ Application │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                    │                    │
       └────────────────────┴────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │  Elasticsearch Cluster  │
              │                         │
              │  ┌──────────┐           │
              │  │  Master  │           │
              │  │  Node    │           │
              │  └────┬─────┘           │
              │       │                 │
              │  ┌────┴─────┐           │
              │  │ Data     │           │
              │  │ Nodes    │           │
              │  └──────────┘           │
              │                         │
              │  ┌───────────────────┐  │
              │  │  Indices          │  │
              │  │  (Shards)         │  │
              │  └───────────────────┘  │
              └─────────────────────────┘

Explanation:

Client Applications: Applications that send search and indexing requests to Elasticsearch (e.g., web applications, log aggregators).
Elasticsearch Cluster: A collection of Elasticsearch nodes working together to store and search data.
Master Node: One node in the cluster that manages cluster state and coordinates operations (can also be a data node).
Data Nodes: Nodes that store data (indices and shards) and perform search operations.
Indices (Shards): Logical partitions of data distributed across nodes for scalability and performance.

Core Architecture

┌─────────────────────────────────────────────────────────┐
│                    Elasticsearch Cluster                 │
│                                                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │  Node 1  │  │  Node 2  │  │  Node 3  │              │
│  │ (Master) │  │          │  │          │              │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘              │
│       │             │             │                     │
│  ┌────▼─────────────▼─────────────▼─────┐              │
│  │         Index: "products"            │              │
│  │  ┌──────────┐  ┌──────────┐         │              │
│  │  │ Shard 0  │  │ Shard 1  │         │              │
│  │  │ (Primary)│  │ (Primary)│         │              │
│  │  └────┬─────┘  └────┬─────┘         │              │
│  │       │             │                │              │
│  │  ┌────▼─────┐  ┌────▼─────┐         │              │
│  │  │ Replica  │  │ Replica  │         │              │
│  │  │  Shard 0 │  │  Shard 1 │         │              │
│  │  └──────────┘  └──────────┘         │              │
│  └─────────────────────────────────────┘              │
└─────────────────────────────────────────────────────────┘

Common Use Cases

1. Full-Text Search

Search through large volumes of text data with relevance scoring.

Use Cases:

Product search (e-commerce)
Content search (blogs, articles)
Document search
User search

Example:

// Index a document
POST /products/_doc/1
{
  "title": "iPhone 15 Pro",
  "description": "Latest iPhone with advanced camera",
  "price": 999,
  "category": "electronics"
}

// Search
GET /products/_search
{
  "query": {
    "match": {
      "description": "camera phone"
    }
  }
}

2. Log Analytics

Store, search, and analyze log data in real-time.

Use Cases:

Application logs
System logs
Security logs
Error tracking

Example:

// Index log
POST /logs/_doc
{
  "timestamp": "2025-12-29T10:00:00Z",
  "level": "ERROR",
  "message": "Database connection failed",
  "service": "payment-service",
  "user_id": "user_123"
}

// Search errors
GET /logs/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "ERROR" }},
        { "range": { "timestamp": { "gte": "now-1h" }}}
      ]
    }
  }
}

3. Real-Time Analytics

Aggregate and analyze data in real-time.

Use Cases:

Metrics and monitoring
Business intelligence
Time-series analytics
Dashboard data

Example:

// Aggregation: Count errors by service
GET /logs/_search
{
  "size": 0,
  "aggs": {
    "errors_by_service": {
      "terms": {
        "field": "service.keyword",
        "size": 10
      }
    }
  }
}

Indexing and Search

Indexing Documents

Basic Indexing:

// Index a single document
POST /products/_doc/1
{
  "title": "Laptop",
  "price": 1299,
  "category": "electronics"
}

// Bulk indexing
POST /products/_bulk
{"index":{"_id":"1"}}
{"title":"Laptop","price":1299}
{"index":{"_id":"2"}}
{"title":"Phone","price":699}

Search Queries

Match Query (Full-Text):

GET /products/_search
{
  "query": {
    "match": {
      "title": "laptop computer"
    }
  }
}

Term Query (Exact Match):

GET /products/_search
{
  "query": {
    "term": {
      "category.keyword": "electronics"
    }
  }
}

Bool Query (Combined):

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "laptop" }}
      ],
      "filter": [
        { "range": { "price": { "gte": 1000, "lte": 2000 }}}
      ]
    }
  }
}

Aggregations

Metrics Aggregations

Average, Sum, Min, Max:

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "avg": { "field": "price" }
    },
    "max_price": {
      "max": { "field": "price" }
    }
  }
}

Bucket Aggregations

Terms Aggregation:

GET /products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 10
      }
    }
  }
}

Date Histogram:

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "errors_over_time": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1h"
      }
    }
  }
}

Scalability and Performance

Sharding Strategy

Shard Configuration:

// Create index with custom shard settings
PUT /products
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Best Practices:

Shard Size: 10-50GB per shard
Shard Count: Plan for growth (hard to change later)
Replicas: At least 1 replica for high availability

Cluster Management

Cluster Health:

GET /_cluster/health
{
  "cluster_name": "my-cluster",
  "status": "green",
  "number_of_nodes": 3,
  "number_of_data_nodes": 3
}

Node Roles:

Master Node: Manages cluster state
Data Node: Stores data and executes queries
Ingest Node: Pre-processes documents
Coordinating Node: Routes requests (default role)

Use Cases and Patterns

1. E-Commerce Search

Requirements:

Product search with relevance
Faceted search (filters)
Autocomplete
Spell correction

Implementation:

// Index with mapping
PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "price": { "type": "float" },
      "category": { "type": "keyword" }
    }
  }
}

// Faceted search
GET /products/_search
{
  "query": { "match": { "title": "laptop" }},
  "aggs": {
    "categories": {
      "terms": { "field": "category.keyword" }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 500 },
          { "from": 500, "to": 1000 },
          { "from": 1000 }
        ]
      }
    }
  }
}

2. Log Analytics (ELK Stack)

Components:

Elasticsearch: Storage and search
Logstash: Log processing
Kibana: Visualization

Pattern:

Logs → Logstash → Elasticsearch → Kibana

Index Template:

PUT /_template/logs-template
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "message": { "type": "text" }
    }
  }
}

3. Time-Series Data

Use Cases:

Metrics and monitoring
IoT data
Financial data

Index Pattern:

// Daily indices: metrics-2025-12-29
PUT /metrics-2025-12-29
{
  "settings": {
    "number_of_shards": 1
  }
}

Best Practices

Index Design

1. Index per Time Period:

Daily indices: logs-2025-12-29
Weekly indices: logs-2025-w52
Monthly indices: logs-2025-12

2. Index Templates:

PUT /_template/logs-template
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

3. Index Lifecycle Management:

Hot: Recent data (SSD, fast)
Warm: Older data (HDD, slower)
Cold: Archive data (cheap storage)
Delete: Very old data

Query Optimization

1. Use Filters for Exact Matches:

// Good: Filter (cached)
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "status": "active" }}
      ]
    }
  }
}

// Bad: Query (not cached)
{
  "query": {
    "term": { "status": "active" }
  }
}

2. Limit Result Size:

GET /products/_search
{
  "size": 20,  // Limit results
  "from": 0    // Pagination
}

3. Use Source Filtering:

GET /products/_search
{
  "_source": ["title", "price"],  // Only return needed fields
  "query": { "match_all": {} }
}

Performance Tuning

1. Refresh Interval:

PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s"  // Reduce refresh frequency
  }
}

2. Bulk Operations:

// Bulk index (faster than individual requests)
POST /products/_bulk
{"index":{}}
{"title":"Product 1"}
{"index":{}}
{"title":"Product 2"}

3. Mapping Optimization:

Use keyword for exact matches
Use text for full-text search
Disable _source if not needed (saves storage)

Common Patterns

Autocomplete

Completion Suggester:

PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "completion"
      }
    }
  }
}

// Search
GET /products/_search
{
  "suggest": {
    "title_suggest": {
      "prefix": "iph",
      "completion": {
        "field": "title"
      }
    }
  }
}

Fuzzy Search

Fuzzy Query:

GET /products/_search
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "lapto",
        "fuzziness": "AUTO"
      }
    }
  }
}

Highlighting

Highlight Matches:

GET /products/_search
{
  "query": { "match": { "title": "laptop" }},
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

What Interviewers Look For

Search Engine Skills

Full-Text Search Understanding
- Understanding of indexing and search
- Relevance scoring
- Query types and optimization
- Red Flags: No search understanding, poor queries, no optimization
Distributed Search Architecture
- Sharding and replication
- Cluster management
- Horizontal scaling
- Red Flags: No sharding strategy, single node, no scaling
Analytics Capabilities
- Aggregations
- Real-time analytics
- Time-series handling
- Red Flags: No aggregations, no analytics, poor time-series

Problem-Solving Approach

Index Design
- Proper index structure
- Mapping optimization
- Lifecycle management
- Red Flags: Poor index design, no lifecycle, inefficient mappings
Query Optimization
- Efficient queries
- Filter vs query
- Performance tuning
- Red Flags: Inefficient queries, no optimization, poor performance

System Design Skills

Scalability
- Horizontal scaling
- Shard management
- Cluster design
- Red Flags: Vertical scaling only, no sharding, poor cluster design
Use Case Application
- Search engines
- Log analytics
- Real-time analytics
- Red Flags: Wrong use cases, no application, poor understanding

Communication Skills

Clear Explanation
- Explains search concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing

Meta-Specific Focus

Search Engine Expertise
- Understanding of search engines
- Elasticsearch mastery
- Real-world application
- Key: Demonstrate search engine expertise

Summary

Elasticsearch Key Points:

Distributed Search: Horizontal scaling with sharding
Real-Time: Near real-time indexing and search
Full-Text Search: Powerful text search with relevance
Analytics: Aggregations for data analysis
RESTful API: Simple HTTP-based interface

Common Use Cases:

Full-text search (e-commerce, content)
Log analytics (ELK stack)
Real-time analytics (metrics, dashboards)
Time-series data (monitoring, IoT)

Best Practices:

Index per time period
Use filters for exact matches
Optimize mappings
Bulk operations for performance
Index lifecycle management

Elasticsearch is a powerful tool for building search engines, log analytics, and real-time data exploration systems with horizontal scalability and near real-time performance.

Introduction

What is Elasticsearch?

Key Concepts

Architecture

High-Level Architecture

Core Architecture

Common Use Cases

1. Full-Text Search

2. Log Analytics

3. Real-Time Analytics

Indexing and Search

Indexing Documents

Search Queries

Aggregations

Metrics Aggregations

Bucket Aggregations

Scalability and Performance

Sharding Strategy

Cluster Management

Use Cases and Patterns

1. E-Commerce Search

2. Log Analytics (ELK Stack)

3. Time-Series Data

Best Practices

Index Design

Query Optimization

Performance Tuning

Common Patterns

Autocomplete

Fuzzy Search

Highlighting

What Interviewers Look For

Search Engine Skills

Problem-Solving Approach

System Design Skills

Communication Skills

Meta-Specific Focus

Summary

Related Posts

Recent Posts