Skip to content

ElasticSearch

What is Elasticsearch?

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It's designed for horizontal scalability, reliability, and real-time search.

  • Type: Distributed search and analytics engine
  • Written in: Java
  • License: Elastic License 2.0 / SSPL (post 7.11), Apache 2.0 (OpenSearch fork)
  • Protocol: REST API over HTTP/HTTPS
  • Default Port: 9200 (HTTP), 9300 (Transport)
  • Part of: Elastic Stack (ELK: Elasticsearch, Logstash, Kibana)

Core Concepts

Terminology

Concept Description Analogy (RDBMS)
Index Collection of documents Database
Document JSON object (unit of data) Row
Field Key-value in document Column
Mapping Schema definition Table Schema
Shard Horizontal partition of index Partition
Replica Copy of a shard Replica
Node Single ES server instance Server
Cluster Group of nodes Cluster

Architecture

Elasticsearch Cluster Architecture

Node Types

Type Role
Master Cluster management, index creation
Data Stores data, executes searches
Ingest Pre-processing documents
Coordinating Routes requests, aggregates results
ML Machine learning jobs

Core Features

Elasticsearch Core Features


Common Use Cases

// Index a document
PUT /products/_doc/1
{
    "name": "Apple iPhone 15 Pro",
    "description": "Latest iPhone with A17 Pro chip",
    "price": 999,
    "category": "electronics",
    "tags": ["smartphone", "apple", "ios"]
}

// Search
GET /products/_search
{
    "query": {
        "multi_match": {
            "query": "iphone pro",
            "fields": ["name^3", "description", "tags"],
            "fuzziness": "AUTO"
        }
    }
}

2. Log Analytics (ELK Stack)

// Log document structure
{
    "@timestamp": "2024-01-15T10:30:00Z",
    "level": "ERROR",
    "service": "payment-service",
    "message": "Payment failed for user 123",
    "trace_id": "abc123",
    "user_id": "user_123",
    "error_code": "INSUFFICIENT_FUNDS"
}

// Query logs
GET /logs-*/_search
{
    "query": {
        "bool": {
            "must": [
                { "match": { "level": "ERROR" } },
                { "match": { "service": "payment-service" } }
            ],
            "filter": [
                { "range": { "@timestamp": { "gte": "now-1h" } } }
            ]
        }
    },
    "aggs": {
        "errors_by_code": {
            "terms": { "field": "error_code.keyword" }
        }
    }
}
@Service
public class ProductSearchService {

    private final ElasticsearchClient client;

    public SearchResult<Product> search(ProductSearchRequest request) {
        SearchResponse<Product> response = client.search(s -> s
            .index("products")
            .query(q -> q
                .bool(b -> {
                    // Full-text search
                    if (request.getQuery() != null) {
                        b.must(m -> m
                            .multiMatch(mm -> mm
                                .query(request.getQuery())
                                .fields("name^3", "description", "brand^2")
                                .fuzziness("AUTO")
                            )
                        );
                    }

                    // Filters
                    if (request.getCategory() != null) {
                        b.filter(f -> f
                            .term(t -> t.field("category").value(request.getCategory()))
                        );
                    }

                    if (request.getMinPrice() != null || request.getMaxPrice() != null) {
                        b.filter(f -> f
                            .range(r -> r
                                .field("price")
                                .gte(JsonData.of(request.getMinPrice()))
                                .lte(JsonData.of(request.getMaxPrice()))
                            )
                        );
                    }

                    return b;
                })
            )
            .aggregations("categories", a -> a
                .terms(t -> t.field("category.keyword"))
            )
            .aggregations("price_ranges", a -> a
                .range(r -> r
                    .field("price")
                    .ranges(
                        Range.of(rr -> rr.to(50.0)),
                        Range.of(rr -> rr.from(50.0).to(100.0)),
                        Range.of(rr -> rr.from(100.0))
                    )
                )
            )
            .highlight(h -> h
                .fields("name", f -> f)
                .fields("description", f -> f)
            )
            .from(request.getOffset())
            .size(request.getLimit()),
            Product.class
        );

        return mapResponse(response);
    }
}

4. Autocomplete / Suggestions

// Mapping with completion suggester
PUT /products
{
    "mappings": {
        "properties": {
            "name": { "type": "text" },
            "suggest": {
                "type": "completion",
                "contexts": [
                    { "name": "category", "type": "category" }
                ]
            }
        }
    }
}

// Index with suggestions
PUT /products/_doc/1
{
    "name": "Apple iPhone 15",
    "suggest": {
        "input": ["iphone", "iphone 15", "apple iphone"],
        "contexts": { "category": "electronics" }
    }
}

// Autocomplete query
GET /products/_search
{
    "suggest": {
        "product-suggest": {
            "prefix": "iph",
            "completion": {
                "field": "suggest",
                "size": 5,
                "contexts": {
                    "category": "electronics"
                },
                "fuzzy": { "fuzziness": 1 }
            }
        }
    }
}
// Mapping
PUT /stores
{
    "mappings": {
        "properties": {
            "name": { "type": "text" },
            "location": { "type": "geo_point" }
        }
    }
}

// Index store
PUT /stores/_doc/1
{
    "name": "Downtown Store",
    "location": { "lat": 40.7128, "lon": -74.0060 }
}

// Find stores within radius
GET /stores/_search
{
    "query": {
        "geo_distance": {
            "distance": "10km",
            "location": { "lat": 40.73, "lon": -73.99 }
        }
    },
    "sort": [
        {
            "_geo_distance": {
                "location": { "lat": 40.73, "lon": -73.99 },
                "order": "asc",
                "unit": "km"
            }
        }
    ]
}

6. Aggregations / Analytics

// Sales analytics
GET /orders/_search
{
    "size": 0,
    "aggs": {
        "sales_over_time": {
            "date_histogram": {
                "field": "order_date",
                "calendar_interval": "day"
            },
            "aggs": {
                "total_sales": { "sum": { "field": "amount" } },
                "avg_order_value": { "avg": { "field": "amount" } }
            }
        },
        "top_categories": {
            "terms": { "field": "category.keyword", "size": 10 },
            "aggs": {
                "revenue": { "sum": { "field": "amount" } }
            }
        },
        "revenue_percentiles": {
            "percentiles": { "field": "amount" }
        }
    }
}

Query Types

Full-Text Queries

// Match (analyzed)
{ "match": { "message": "quick brown fox" } }

// Match Phrase
{ "match_phrase": { "message": "quick brown fox" } }

// Multi-Match
{ "multi_match": { "query": "search text", "fields": ["title^2", "body"] } }

// Query String (Lucene syntax)
{ "query_string": { "query": "title:elasticsearch AND status:published" } }

Term-Level Queries

// Term (exact match, not analyzed)
{ "term": { "status": "published" } }

// Terms (multiple values)
{ "terms": { "status": ["published", "draft"] } }

// Range
{ "range": { "price": { "gte": 10, "lte": 100 } } }

// Exists
{ "exists": { "field": "email" } }

// Prefix
{ "prefix": { "username": "joh" } }

// Wildcard
{ "wildcard": { "email": "*@gmail.com" } }

Compound Queries

// Bool Query
{
    "bool": {
        "must": [ ... ],      // AND, affects score
        "should": [ ... ],    // OR, affects score
        "must_not": [ ... ],  // NOT, no score
        "filter": [ ... ]     // AND, no score (cached)
    }
}

Mapping & Analyzers

Field Types

Type Description
text Analyzed full-text
keyword Exact value (not analyzed)
long, integer, short, byte Numeric
double, float Floating point
boolean true/false
date Date/datetime
geo_point Lat/lon
geo_shape Polygons, etc.
nested Array of objects
object JSON object

Custom Analyzer

PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "asciifolding", "my_stemmer"]
                }
            },
            "filter": {
                "my_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Index Management

Index Lifecycle

Elasticsearch Index Lifecycle

Index Templates

PUT /_index_template/logs_template
{
    "index_patterns": ["logs-*"],
    "template": {
        "settings": {
            "number_of_shards": 3,
            "number_of_replicas": 1
        },
        "mappings": {
            "properties": {
                "@timestamp": { "type": "date" },
                "message": { "type": "text" }
            }
        }
    }
}

Reindexing

POST /_reindex
{
    "source": { "index": "old_index" },
    "dest": { "index": "new_index" }
}

Performance Optimization

Indexing

// Bulk indexing
POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Product 1", "price": 10 }
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "Product 2", "price": 20 }
// Use filter context for non-scoring queries
{
    "query": {
        "bool": {
            "must": { "match": { "title": "search" } },
            "filter": [
                { "term": { "status": "published" } },
                { "range": { "date": { "gte": "2024-01-01" } } }
            ]
        }
    }
}

// Limit fields returned
{
    "_source": ["title", "date"],
    "query": { ... }
}

// Use search_after for deep pagination
{
    "search_after": [1234, "doc_id"],
    "sort": [{ "date": "desc" }, { "_id": "asc" }]
}

Trade-offs

Pros Cons
Powerful full-text search Complex to operate at scale
Near real-time Eventually consistent
Horizontal scalability Memory intensive
Rich query DSL Not for primary data store
Great for analytics Expensive for high cardinality
Schema-free Mapping changes can be painful
REST API No transactions
Aggregations No joins (denormalize)

Performance Characteristics

Metric Typical Value
Index latency ~1 second (NRT)
Search latency 10-100ms
Throughput 10,000+ docs/sec/node
Shard size 10-50 GB recommended
Shards per index 1-5 (avoid over-sharding)

When to Use Elasticsearch

Good For: - Full-text search - Log/event analytics - Product search - Autocomplete - Geo-spatial search - Metrics and monitoring - Security analytics (SIEM)

Not Good For: - Primary database - Transactional data - Strong consistency needs - Frequent updates to same document - Complex relational queries


Elasticsearch vs Alternatives

Feature Elasticsearch Solr OpenSearch Algolia
Full-text search Excellent Excellent Excellent Excellent
Analytics Excellent Good Excellent Limited
Ease of use Good Moderate Good Excellent
Managed options Yes Limited Yes Yes (SaaS)
License Elastic/SSPL Apache 2.0 Apache 2.0 Proprietary
Real-time Yes Yes Yes Yes

Best Practices

  1. Right-size shards - 10-50GB per shard
  2. Don't over-shard - More shards ≠ better performance
  3. Use aliases - For zero-downtime reindexing
  4. Bulk for indexing - Never single document inserts at scale
  5. Use filters - For non-scoring queries (cached)
  6. Denormalize data - No joins, embed related data
  7. Set explicit mappings - Don't rely on dynamic mapping in production
  8. Index templates - For consistent settings across indices
  9. Separate hot/warm/cold - Lifecycle management
  10. Monitor cluster health - Yellow = replicas missing, Red = data missing

Common API Endpoints

# Cluster
GET /_cluster/health
GET /_cluster/stats
GET /_cat/nodes?v
GET /_cat/indices?v
GET /_cat/shards?v

# Index
PUT /my_index
DELETE /my_index
GET /my_index/_mapping
GET /my_index/_settings

# Document
PUT /my_index/_doc/1 { ... }
GET /my_index/_doc/1
DELETE /my_index/_doc/1
POST /my_index/_update/1 { "doc": { ... } }

# Search
GET /my_index/_search { "query": { ... } }
POST /my_index/_search { "query": { ... } }

# Bulk
POST /_bulk { ... }

# Analyze
GET /_analyze { "analyzer": "standard", "text": "Hello World" }