ElasticSearch¶

What is Elasticsearch?¶

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It's designed for horizontal scalability, reliability, and real-time search.

Type: Distributed search and analytics engine
Written in: Java
License: Elastic License 2.0 / SSPL (post 7.11), Apache 2.0 (OpenSearch fork)
Protocol: REST API over HTTP/HTTPS
Default Port: 9200 (HTTP), 9300 (Transport)
Part of: Elastic Stack (ELK: Elasticsearch, Logstash, Kibana)

Core Concepts¶

Terminology¶

Concept	Description	Analogy (RDBMS)
Index	Collection of documents	Database
Document	JSON object (unit of data)	Row
Field	Key-value in document	Column
Mapping	Schema definition	Table Schema
Shard	Horizontal partition of index	Partition
Replica	Copy of a shard	Replica
Node	Single ES server instance	Server
Cluster	Group of nodes	Cluster

Architecture¶

Elasticsearch Cluster Architecture

Node Types¶

Type	Role
Master	Cluster management, index creation
Data	Stores data, executes searches
Ingest	Pre-processing documents
Coordinating	Routes requests, aggregates results
ML	Machine learning jobs

Core Features¶

Elasticsearch Core Features

Common Use Cases¶

1. Full-Text Search¶

// Index a document
PUT /products/_doc/1
{
    "name": "Apple iPhone 15 Pro",
    "description": "Latest iPhone with A17 Pro chip",
    "price": 999,
    "category": "electronics",
    "tags": ["smartphone", "apple", "ios"]
}

// Search
GET /products/_search
{
    "query": {
        "multi_match": {
            "query": "iphone pro",
            "fields": ["name^3", "description", "tags"],
            "fuzziness": "AUTO"
        }
    }
}

2. Log Analytics (ELK Stack)¶

// Log document structure
{
    "@timestamp": "2024-01-15T10:30:00Z",
    "level": "ERROR",
    "service": "payment-service",
    "message": "Payment failed for user 123",
    "trace_id": "abc123",
    "user_id": "user_123",
    "error_code": "INSUFFICIENT_FUNDS"
}

// Query logs
GET /logs-*/_search
{
    "query": {
        "bool": {
            "must": [
                { "match": { "level": "ERROR" } },
                { "match": { "service": "payment-service" } }
            ],
            "filter": [
                { "range": { "@timestamp": { "gte": "now-1h" } } }
            ]
        }
    },
    "aggs": {
        "errors_by_code": {
            "terms": { "field": "error_code.keyword" }
        }
    }
}

3. E-commerce Search¶

@Service
public class ProductSearchService {

    private final ElasticsearchClient client;

    public SearchResult<Product> search(ProductSearchRequest request) {
        SearchResponse<Product> response = client.search(s -> s
            .index("products")
            .query(q -> q
                .bool(b -> {
                    // Full-text search
                    if (request.getQuery() != null) {
                        b.must(m -> m
                            .multiMatch(mm -> mm
                                .query(request.getQuery())
                                .fields("name^3", "description", "brand^2")
                                .fuzziness("AUTO")
                            )
                        );
                    }

                    // Filters
                    if (request.getCategory() != null) {
                        b.filter(f -> f
                            .term(t -> t.field("category").value(request.getCategory()))
                        );
                    }

                    if (request.getMinPrice() != null || request.getMaxPrice() != null) {
                        b.filter(f -> f
                            .range(r -> r
                                .field("price")
                                .gte(JsonData.of(request.getMinPrice()))
                                .lte(JsonData.of(request.getMaxPrice()))
                            )
                        );
                    }

                    return b;
                })
            )
            .aggregations("categories", a -> a
                .terms(t -> t.field("category.keyword"))
            )
            .aggregations("price_ranges", a -> a
                .range(r -> r
                    .field("price")
                    .ranges(
                        Range.of(rr -> rr.to(50.0)),
                        Range.of(rr -> rr.from(50.0).to(100.0)),
                        Range.of(rr -> rr.from(100.0))
                    )
                )
            )
            .highlight(h -> h
                .fields("name", f -> f)
                .fields("description", f -> f)
            )
            .from(request.getOffset())
            .size(request.getLimit()),
            Product.class
        );

        return mapResponse(response);
    }
}

4. Autocomplete / Suggestions¶

// Mapping with completion suggester
PUT /products
{
    "mappings": {
        "properties": {
            "name": { "type": "text" },
            "suggest": {
                "type": "completion",
                "contexts": [
                    { "name": "category", "type": "category" }
                ]
            }
        }
    }
}

// Index with suggestions
PUT /products/_doc/1
{
    "name": "Apple iPhone 15",
    "suggest": {
        "input": ["iphone", "iphone 15", "apple iphone"],
        "contexts": { "category": "electronics" }
    }
}

// Autocomplete query
GET /products/_search
{
    "suggest": {
        "product-suggest": {
            "prefix": "iph",
            "completion": {
                "field": "suggest",
                "size": 5,
                "contexts": {
                    "category": "electronics"
                },
                "fuzzy": { "fuzziness": 1 }
            }
        }
    }
}

5. Geo-Spatial Search¶

// Mapping
PUT /stores
{
    "mappings": {
        "properties": {
            "name": { "type": "text" },
            "location": { "type": "geo_point" }
        }
    }
}

// Index store
PUT /stores/_doc/1
{
    "name": "Downtown Store",
    "location": { "lat": 40.7128, "lon": -74.0060 }
}

// Find stores within radius
GET /stores/_search
{
    "query": {
        "geo_distance": {
            "distance": "10km",
            "location": { "lat": 40.73, "lon": -73.99 }
        }
    },
    "sort": [
        {
            "_geo_distance": {
                "location": { "lat": 40.73, "lon": -73.99 },
                "order": "asc",
                "unit": "km"
            }
        }
    ]
}

6. Aggregations / Analytics¶

// Sales analytics
GET /orders/_search
{
    "size": 0,
    "aggs": {
        "sales_over_time": {
            "date_histogram": {
                "field": "order_date",
                "calendar_interval": "day"
            },
            "aggs": {
                "total_sales": { "sum": { "field": "amount" } },
                "avg_order_value": { "avg": { "field": "amount" } }
            }
        },
        "top_categories": {
            "terms": { "field": "category.keyword", "size": 10 },
            "aggs": {
                "revenue": { "sum": { "field": "amount" } }
            }
        },
        "revenue_percentiles": {
            "percentiles": { "field": "amount" }
        }
    }
}

Query Types¶

Full-Text Queries¶

// Match (analyzed)
{ "match": { "message": "quick brown fox" } }

// Match Phrase
{ "match_phrase": { "message": "quick brown fox" } }

// Multi-Match
{ "multi_match": { "query": "search text", "fields": ["title^2", "body"] } }

// Query String (Lucene syntax)
{ "query_string": { "query": "title:elasticsearch AND status:published" } }

Term-Level Queries¶

// Term (exact match, not analyzed)
{ "term": { "status": "published" } }

// Terms (multiple values)
{ "terms": { "status": ["published", "draft"] } }

// Range
{ "range": { "price": { "gte": 10, "lte": 100 } } }

// Exists
{ "exists": { "field": "email" } }

// Prefix
{ "prefix": { "username": "joh" } }

// Wildcard
{ "wildcard": { "email": "*@gmail.com" } }

Compound Queries¶

// Bool Query
{
    "bool": {
        "must": [ ... ],      // AND, affects score
        "should": [ ... ],    // OR, affects score
        "must_not": [ ... ],  // NOT, no score
        "filter": [ ... ]     // AND, no score (cached)
    }
}

Mapping & Analyzers¶

Field Types¶

Type	Description
`text`	Analyzed full-text
`keyword`	Exact value (not analyzed)
`long`, `integer`, `short`, `byte`	Numeric
`double`, `float`	Floating point
`boolean`	true/false
`date`	Date/datetime
`geo_point`	Lat/lon
`geo_shape`	Polygons, etc.
`nested`	Array of objects
`object`	JSON object

Custom Analyzer¶

PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "asciifolding", "my_stemmer"]
                }
            },
            "filter": {
                "my_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }
    }
}

Index Management¶

Index Lifecycle¶

Elasticsearch Index Lifecycle

Index Templates¶

PUT /_index_template/logs_template
{
    "index_patterns": ["logs-*"],
    "template": {
        "settings": {
            "number_of_shards": 3,
            "number_of_replicas": 1
        },
        "mappings": {
            "properties": {
                "@timestamp": { "type": "date" },
                "message": { "type": "text" }
            }
        }
    }
}

Reindexing¶

POST /_reindex
{
    "source": { "index": "old_index" },
    "dest": { "index": "new_index" }
}

Performance Optimization¶

Indexing¶

// Bulk indexing
POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Product 1", "price": 10 }
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "Product 2", "price": 20 }

Search¶

// Use filter context for non-scoring queries
{
    "query": {
        "bool": {
            "must": { "match": { "title": "search" } },
            "filter": [
                { "term": { "status": "published" } },
                { "range": { "date": { "gte": "2024-01-01" } } }
            ]
        }
    }
}

// Limit fields returned
{
    "_source": ["title", "date"],
    "query": { ... }
}

// Use search_after for deep pagination
{
    "search_after": [1234, "doc_id"],
    "sort": [{ "date": "desc" }, { "_id": "asc" }]
}

Trade-offs¶

Pros	Cons
Powerful full-text search	Complex to operate at scale
Near real-time	Eventually consistent
Horizontal scalability	Memory intensive
Rich query DSL	Not for primary data store
Great for analytics	Expensive for high cardinality
Schema-free	Mapping changes can be painful
REST API	No transactions
Aggregations	No joins (denormalize)

Performance Characteristics¶

Metric	Typical Value
Index latency	~1 second (NRT)
Search latency	10-100ms
Throughput	10,000+ docs/sec/node
Shard size	10-50 GB recommended
Shards per index	1-5 (avoid over-sharding)

When to Use Elasticsearch¶

Good For: - Full-text search - Log/event analytics - Product search - Autocomplete - Geo-spatial search - Metrics and monitoring - Security analytics (SIEM)

Not Good For: - Primary database - Transactional data - Strong consistency needs - Frequent updates to same document - Complex relational queries

Elasticsearch vs Alternatives¶

Feature	Elasticsearch	Solr	OpenSearch	Algolia
Full-text search	Excellent	Excellent	Excellent	Excellent
Analytics	Excellent	Good	Excellent	Limited
Ease of use	Good	Moderate	Good	Excellent
Managed options	Yes	Limited	Yes	Yes (SaaS)
License	Elastic/SSPL	Apache 2.0	Apache 2.0	Proprietary
Real-time	Yes	Yes	Yes	Yes

Best Practices¶

Right-size shards - 10-50GB per shard
Don't over-shard - More shards ≠ better performance
Use aliases - For zero-downtime reindexing
Bulk for indexing - Never single document inserts at scale
Use filters - For non-scoring queries (cached)
Denormalize data - No joins, embed related data
Set explicit mappings - Don't rely on dynamic mapping in production
Index templates - For consistent settings across indices
Separate hot/warm/cold - Lifecycle management
Monitor cluster health - Yellow = replicas missing, Red = data missing

Common API Endpoints¶

# Cluster
GET /_cluster/health
GET /_cluster/stats
GET /_cat/nodes?v
GET /_cat/indices?v
GET /_cat/shards?v

# Index
PUT /my_index
DELETE /my_index
GET /my_index/_mapping
GET /my_index/_settings

# Document
PUT /my_index/_doc/1 { ... }
GET /my_index/_doc/1
DELETE /my_index/_doc/1
POST /my_index/_update/1 { "doc": { ... } }

# Search
GET /my_index/_search { "query": { ... } }
POST /my_index/_search { "query": { ... } }

# Bulk
POST /_bulk { ... }

# Analyze
GET /_analyze { "analyzer": "standard", "text": "Hello World" }