Introduction
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It provides near real-time search capabilities, powerful analytics, and horizontal scalability, making it ideal for search engines, log analytics, and data exploration.
This guide covers:
- Elasticsearch Fundamentals: Core concepts, architecture, and components
- Indexing and Search: How to index and search data
- Aggregations: Analytics and data exploration
- Scalability: Cluster management and horizontal scaling
- Use Cases: Real-world applications and patterns
- Best Practices: Performance, reliability, and optimization
What is Elasticsearch?
Elasticsearch is a distributed search and analytics engine that offers:
- Full-Text Search: Powerful text search with relevance scoring
- Real-Time: Near real-time indexing and search
- Scalability: Horizontal scaling with automatic sharding
- Analytics: Aggregations for data analysis
- RESTful API: Simple HTTP-based interface
- Schema-Free: JSON document storage with dynamic mapping
Key Concepts
Index: A collection of documents (similar to a database in SQL)
Document: A JSON object stored in an index (similar to a row in SQL)
Type: (Deprecated in 7.x+) A category of documents within an index
Shard: A subset of an index (for horizontal scaling)
Replica: A copy of a shard (for high availability)
Node: A single Elasticsearch instance
Cluster: A collection of nodes working together
Architecture
High-Level Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ Client │────▶│ Client │
│ Application │ │ Application │ │ Application │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────────┴────────────────────┘
│
▼
┌─────────────────────────┐
│ Elasticsearch Cluster │
│ │
│ ┌──────────┐ │
│ │ Master │ │
│ │ Node │ │
│ └────┬─────┘ │
│ │ │
│ ┌────┴─────┐ │
│ │ Data │ │
│ │ Nodes │ │
│ └──────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ Indices │ │
│ │ (Shards) │ │
│ └───────────────────┘ │
└─────────────────────────┘
Explanation:
- Client Applications: Applications that send search and indexing requests to Elasticsearch (e.g., web applications, log aggregators).
- Elasticsearch Cluster: A collection of Elasticsearch nodes working together to store and search data.
- Master Node: One node in the cluster that manages cluster state and coordinates operations (can also be a data node).
- Data Nodes: Nodes that store data (indices and shards) and perform search operations.
- Indices (Shards): Logical partitions of data distributed across nodes for scalability and performance.
Core Architecture
┌─────────────────────────────────────────────────────────┐
│ Elasticsearch Cluster │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ (Master) │ │ │ │ │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────▼─────────────▼─────────────▼─────┐ │
│ │ Index: "products" │ │
│ │ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Shard 0 │ │ Shard 1 │ │ │
│ │ │ (Primary)│ │ (Primary)│ │ │
│ │ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │
│ │ ┌────▼─────┐ ┌────▼─────┐ │ │
│ │ │ Replica │ │ Replica │ │ │
│ │ │ Shard 0 │ │ Shard 1 │ │ │
│ │ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Common Use Cases
1. Full-Text Search
Search through large volumes of text data with relevance scoring.
Use Cases:
- Product search (e-commerce)
- Content search (blogs, articles)
- Document search
- User search
Example:
// Index a document
POST /products/_doc/1
{
"title": "iPhone 15 Pro",
"description": "Latest iPhone with advanced camera",
"price": 999,
"category": "electronics"
}
// Search
GET /products/_search
{
"query": {
"match": {
"description": "camera phone"
}
}
}
2. Log Analytics
Store, search, and analyze log data in real-time.
Use Cases:
- Application logs
- System logs
- Security logs
- Error tracking
Example:
// Index log
POST /logs/_doc
{
"timestamp": "2025-12-29T10:00:00Z",
"level": "ERROR",
"message": "Database connection failed",
"service": "payment-service",
"user_id": "user_123"
}
// Search errors
GET /logs/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" }},
{ "range": { "timestamp": { "gte": "now-1h" }}}
]
}
}
}
3. Real-Time Analytics
Aggregate and analyze data in real-time.
Use Cases:
- Metrics and monitoring
- Business intelligence
- Time-series analytics
- Dashboard data
Example:
// Aggregation: Count errors by service
GET /logs/_search
{
"size": 0,
"aggs": {
"errors_by_service": {
"terms": {
"field": "service.keyword",
"size": 10
}
}
}
}
Indexing and Search
Indexing Documents
Basic Indexing:
// Index a single document
POST /products/_doc/1
{
"title": "Laptop",
"price": 1299,
"category": "electronics"
}
// Bulk indexing
POST /products/_bulk
{"index":{"_id":"1"}}
{"title":"Laptop","price":1299}
{"index":{"_id":"2"}}
{"title":"Phone","price":699}
Search Queries
Match Query (Full-Text):
GET /products/_search
{
"query": {
"match": {
"title": "laptop computer"
}
}
}
Term Query (Exact Match):
GET /products/_search
{
"query": {
"term": {
"category.keyword": "electronics"
}
}
}
Bool Query (Combined):
GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "laptop" }}
],
"filter": [
{ "range": { "price": { "gte": 1000, "lte": 2000 }}}
]
}
}
}
Aggregations
Metrics Aggregations
Average, Sum, Min, Max:
GET /products/_search
{
"size": 0,
"aggs": {
"avg_price": {
"avg": { "field": "price" }
},
"max_price": {
"max": { "field": "price" }
}
}
}
Bucket Aggregations
Terms Aggregation:
GET /products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"size": 10
}
}
}
}
Date Histogram:
GET /logs/_search
{
"size": 0,
"aggs": {
"errors_over_time": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "1h"
}
}
}
}
Scalability and Performance
Sharding Strategy
Shard Configuration:
// Create index with custom shard settings
PUT /products
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
}
}
Best Practices:
- Shard Size: 10-50GB per shard
- Shard Count: Plan for growth (hard to change later)
- Replicas: At least 1 replica for high availability
Cluster Management
Cluster Health:
GET /_cluster/health
{
"cluster_name": "my-cluster",
"status": "green",
"number_of_nodes": 3,
"number_of_data_nodes": 3
}
Node Roles:
- Master Node: Manages cluster state
- Data Node: Stores data and executes queries
- Ingest Node: Pre-processes documents
- Coordinating Node: Routes requests (default role)
Use Cases and Patterns
1. E-Commerce Search
Requirements:
- Product search with relevance
- Faceted search (filters)
- Autocomplete
- Spell correction
Implementation:
// Index with mapping
PUT /products
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": { "type": "keyword" }
}
},
"price": { "type": "float" },
"category": { "type": "keyword" }
}
}
}
// Faceted search
GET /products/_search
{
"query": { "match": { "title": "laptop" }},
"aggs": {
"categories": {
"terms": { "field": "category.keyword" }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 500 },
{ "from": 500, "to": 1000 },
{ "from": 1000 }
]
}
}
}
}
2. Log Analytics (ELK Stack)
Components:
- Elasticsearch: Storage and search
- Logstash: Log processing
- Kibana: Visualization
Pattern:
Logs → Logstash → Elasticsearch → Kibana
Index Template:
PUT /_template/logs-template
{
"index_patterns": ["logs-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" }
}
}
}
3. Time-Series Data
Use Cases:
- Metrics and monitoring
- IoT data
- Financial data
Index Pattern:
// Daily indices: metrics-2025-12-29
PUT /metrics-2025-12-29
{
"settings": {
"number_of_shards": 1
}
}
Best Practices
Index Design
1. Index per Time Period:
- Daily indices:
logs-2025-12-29 - Weekly indices:
logs-2025-w52 - Monthly indices:
logs-2025-12
2. Index Templates:
PUT /_template/logs-template
{
"index_patterns": ["logs-*"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
3. Index Lifecycle Management:
- Hot: Recent data (SSD, fast)
- Warm: Older data (HDD, slower)
- Cold: Archive data (cheap storage)
- Delete: Very old data
Query Optimization
1. Use Filters for Exact Matches:
// Good: Filter (cached)
{
"query": {
"bool": {
"filter": [
{ "term": { "status": "active" }}
]
}
}
}
// Bad: Query (not cached)
{
"query": {
"term": { "status": "active" }
}
}
2. Limit Result Size:
GET /products/_search
{
"size": 20, // Limit results
"from": 0 // Pagination
}
3. Use Source Filtering:
GET /products/_search
{
"_source": ["title", "price"], // Only return needed fields
"query": { "match_all": {} }
}
Performance Tuning
1. Refresh Interval:
PUT /products/_settings
{
"index": {
"refresh_interval": "30s" // Reduce refresh frequency
}
}
2. Bulk Operations:
// Bulk index (faster than individual requests)
POST /products/_bulk
{"index":{}}
{"title":"Product 1"}
{"index":{}}
{"title":"Product 2"}
3. Mapping Optimization:
- Use
keywordfor exact matches - Use
textfor full-text search - Disable
_sourceif not needed (saves storage)
Common Patterns
Autocomplete
Completion Suggester:
PUT /products
{
"mappings": {
"properties": {
"title": {
"type": "completion"
}
}
}
}
// Search
GET /products/_search
{
"suggest": {
"title_suggest": {
"prefix": "iph",
"completion": {
"field": "title"
}
}
}
}
Fuzzy Search
Fuzzy Query:
GET /products/_search
{
"query": {
"fuzzy": {
"title": {
"value": "lapto",
"fuzziness": "AUTO"
}
}
}
}
Highlighting
Highlight Matches:
GET /products/_search
{
"query": { "match": { "title": "laptop" }},
"highlight": {
"fields": {
"title": {}
}
}
}
What Interviewers Look For
Search Engine Skills
- Full-Text Search Understanding
- Understanding of indexing and search
- Relevance scoring
- Query types and optimization
- Red Flags: No search understanding, poor queries, no optimization
- Distributed Search Architecture
- Sharding and replication
- Cluster management
- Horizontal scaling
- Red Flags: No sharding strategy, single node, no scaling
- Analytics Capabilities
- Aggregations
- Real-time analytics
- Time-series handling
- Red Flags: No aggregations, no analytics, poor time-series
Problem-Solving Approach
- Index Design
- Proper index structure
- Mapping optimization
- Lifecycle management
- Red Flags: Poor index design, no lifecycle, inefficient mappings
- Query Optimization
- Efficient queries
- Filter vs query
- Performance tuning
- Red Flags: Inefficient queries, no optimization, poor performance
System Design Skills
- Scalability
- Horizontal scaling
- Shard management
- Cluster design
- Red Flags: Vertical scaling only, no sharding, poor cluster design
- Use Case Application
- Search engines
- Log analytics
- Real-time analytics
- Red Flags: Wrong use cases, no application, poor understanding
Communication Skills
- Clear Explanation
- Explains search concepts
- Discusses trade-offs
- Justifies design decisions
- Red Flags: Unclear explanations, no justification, confusing
Meta-Specific Focus
- Search Engine Expertise
- Understanding of search engines
- Elasticsearch mastery
- Real-world application
- Key: Demonstrate search engine expertise
Summary
Elasticsearch Key Points:
- Distributed Search: Horizontal scaling with sharding
- Real-Time: Near real-time indexing and search
- Full-Text Search: Powerful text search with relevance
- Analytics: Aggregations for data analysis
- RESTful API: Simple HTTP-based interface
Common Use Cases:
- Full-text search (e-commerce, content)
- Log analytics (ELK stack)
- Real-time analytics (metrics, dashboards)
- Time-series data (monitoring, IoT)
Best Practices:
- Index per time period
- Use filters for exact matches
- Optimize mappings
- Bulk operations for performance
- Index lifecycle management
Elasticsearch is a powerful tool for building search engines, log analytics, and real-time data exploration systems with horizontal scalability and near real-time performance.