Skip to content

HyperLogLog

Introduction

HyperLogLog (HLL) is a probabilistic data structure used to estimate the cardinality (count of distinct elements) of a set. It can count billions of unique items using only ~12 KB of memory.

Key Properties

Property Description
Space efficient O(log log n) space for n distinct elements
Fixed memory ~12 KB for billions of elements
Fast O(1) add, O(m) count (m = number of registers)
Approximate Typical error rate of ~2%
Mergeable Can combine HLLs from distributed systems

The Problem It Solves

Exact counting:
- Need to store all unique elements
- 1 billion UUIDs = ~16 GB memory

HyperLogLog:
- Estimate count with 2% error
- Only 12 KB memory

How It Works

Core Insight

When you hash random values, the probability of seeing a hash with: - Leading 0: 50% - Leading 00: 25% - Leading 000: 12.5% - Leading k zeros: 1/2^k

If you've seen a hash with k leading zeros, you've probably seen ~2^k distinct elements.

Algorithm Steps

1. Hash and Split

element → hash(element) → [prefix | suffix]
                          └─p bits─┘ └─rest─┘

Prefix determines which register (bucket)
Suffix is used to count leading zeros

2. Count Leading Zeros

int countLeadingZeros(String hashSuffix) {
    int count = 0;
    for (char bit : hashSuffix.toCharArray()) {
        if (bit == '0') {
            count++;
        } else {
            break;
        }
    }
    return count + 1; // Position of first 1-bit
}

3. Update Register

int registerIndex = hashPrefix; // Which register to update
int leadingZeros = countLeadingZeros(hashSuffix);
registers[registerIndex] = Math.max(registers[registerIndex], leadingZeros);

4. Estimate Cardinality

double estimate() {
    // Harmonic mean of 2^register values
    double indicator = 0;
    for (int register : registers) {
        indicator += Math.pow(2, -register);
    }
    double rawEstimate = alpha * m * m / indicator;
    return applyCorrections(rawEstimate);
}

Visual Example

m = 4 registers (using 2-bit prefix)

Stream: [A, B, C, D, E, A, F, G]

hash(A) = 00|101001  → Register 0, leading zeros = 0 → max(R[0], 1) = 1
hash(B) = 01|011010  → Register 1, leading zeros = 1 → max(R[1], 2) = 2
hash(C) = 10|000101  → Register 2, leading zeros = 3 → max(R[2], 4) = 4
hash(D) = 11|110000  → Register 3, leading zeros = 0 → max(R[3], 1) = 1
hash(E) = 00|001100  → Register 0, leading zeros = 2 → max(R[0], 3) = 3
hash(A) = 00|101001  → Register 0, leading zeros = 0 → max(R[0], 3) = 3 (no change)
hash(F) = 01|000010  → Register 1, leading zeros = 4 → max(R[1], 5) = 5
hash(G) = 10|100000  → Register 2, leading zeros = 0 → max(R[2], 4) = 4 (no change)

Final registers: [3, 5, 4, 1]

Estimate = α × m × m / (2^-3 + 2^-5 + 2^-4 + 2^-1)
        ≈ 0.673 × 4 × 4 / (0.125 + 0.031 + 0.0625 + 0.5)
        ≈ 10.77 / 0.72
        ≈ 15 (actual = 7, but with more registers this is accurate)

Mathematical Foundation

Register Count

m = 2^p registers
p = precision parameter (typically 4-16)

More registers = more accuracy = more memory

Standard Error

Standard Error ≈ 1.04 / √m

For m = 16,384 (p = 14):
Error ≈ 1.04 / √16384 ≈ 0.81%

Memory Usage

Memory = m × register_size
       = 2^p × 6 bits (6 bits can count up to 64 leading zeros)

For p = 14:
Memory = 16,384 × 6 bits = 12 KB

Precision vs Memory Trade-off

p Registers Memory Error Rate
4 16 12 bytes 26%
10 1,024 768 bytes 3.25%
12 4,096 3 KB 1.625%
14 16,384 12 KB 0.81%
16 65,536 48 KB 0.41%

Common choice: p = 14 (~12 KB, ~0.81% error)


Implementation

import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class HyperLogLog {
    private final int p;        // precision
    private final int m;        // number of registers (2^p)
    private final int[] registers;
    private final double alpha; // bias correction constant

    public HyperLogLog(int precision) {
        this.p = precision;
        this.m = 1 << precision; // 2^p registers
        this.registers = new int[m];

        // Alpha constant for bias correction
        if (m == 16) {
            alpha = 0.673;
        } else if (m == 32) {
            alpha = 0.697;
        } else if (m == 64) {
            alpha = 0.709;
        } else {
            alpha = 0.7213 / (1 + 1.079 / m);
        }
    }

    public HyperLogLog() {
        this(14); // default precision
    }

    private int hash(String item) {
        return Hashing.murmur3_32_fixed()
                .hashString(item, StandardCharsets.UTF_8)
                .asInt();
    }

    private int getRegisterIndex(int hashVal) {
        // First p bits determine register
        return hashVal >>> (32 - p);
    }

    private int countLeadingZeros(int hashVal) {
        // Count leading zeros in remaining bits
        int remaining = hashVal & ((1 << (32 - p)) - 1);
        if (remaining == 0) {
            return 32 - p;
        }
        return (32 - p) - (32 - Integer.numberOfLeadingZeros(remaining));
    }

    public void add(String item) {
        int hashVal = hash(item);
        int registerIdx = getRegisterIndex(hashVal);
        int leadingZeros = countLeadingZeros(hashVal) + 1;
        registers[registerIdx] = Math.max(registers[registerIdx], leadingZeros);
    }

    public long count() {
        // Raw estimate using harmonic mean
        double indicator = 0;
        for (int r : registers) {
            indicator += Math.pow(2.0, -r);
        }
        double rawEstimate = alpha * m * m / indicator;

        // Small range correction
        if (rawEstimate <= 2.5 * m) {
            int zeros = 0;
            for (int r : registers) {
                if (r == 0) zeros++;
            }
            if (zeros > 0) {
                return Math.round(m * Math.log((double) m / zeros));
            }
        }

        // Large range correction
        if (rawEstimate > (1L << 32) / 30.0) {
            return Math.round(-(1L << 32) * Math.log(1 - rawEstimate / (1L << 32)));
        }

        return Math.round(rawEstimate);
    }

    /** Merge another HLL into this one */
    public void merge(HyperLogLog other) {
        if (this.p != other.p) {
            throw new IllegalArgumentException("Precision mismatch");
        }
        for (int i = 0; i < m; i++) {
            registers[i] = Math.max(registers[i], other.registers[i]);
        }
    }

    // Expose registers for union/intersection operations
    public int[] getRegisters() { return registers; }
    public int getP() { return p; }
    public int getM() { return m; }

    // Usage
    public static void main(String[] args) {
        HyperLogLog hll = new HyperLogLog(14);

        // Add 1 million unique items
        for (int i = 0; i < 1_000_000; i++) {
            hll.add("user_" + i);
        }

        System.out.println("Estimated: " + hll.count());  // ~1,000,000 (+-0.81%)
        System.out.println("Memory: " + (hll.registers.length * 6 / 8) + " bytes"); // ~12 KB
    }
}

Use Cases

1. Unique Visitors Counting

// Website analytics
HyperLogLog dailyVisitors = new HyperLogLog();

void trackVisit(String userId) {
    dailyVisitors.add(userId);
}

long getUniqueVisitors() {
    return dailyVisitors.count();
}

2. Database Cardinality Estimation

-- PostgreSQL uses HLL internally
SELECT COUNT(DISTINCT user_id) FROM events;

-- With HLL extension
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id)))
FROM events;

3. Distributed Unique Counting

// Each server maintains local HLL
HyperLogLog server1Hll = new HyperLogLog();
HyperLogLog server2Hll = new HyperLogLog();
HyperLogLog server3Hll = new HyperLogLog();

// Merge for global count
HyperLogLog globalHll = new HyperLogLog();
globalHll.merge(server1Hll);
globalHll.merge(server2Hll);
globalHll.merge(server3Hll);

long totalUnique = globalHll.count();

4. Real-Time Analytics Dashboard

Problem: Count unique users per minute/hour/day

Solution:
- Minute HLL: Track current minute
- Hour HLL: Merge minute HLLs
- Day HLL: Merge hour HLLs

Each level is independently queryable

5. Network Traffic Analysis

// Count unique source IPs
HyperLogLog uniqueSources = new HyperLogLog();

void processPacket(Packet packet) {
    uniqueSources.add(packet.getSourceIp());

    if (uniqueSources.count() > THRESHOLD) {
        alert("Possible DDoS - too many unique IPs");
    }
}

HyperLogLog Operations

Merge (Union)

/** Count distinct elements in either set */
static HyperLogLog union(HyperLogLog hll1, HyperLogLog hll2) {
    HyperLogLog result = new HyperLogLog(hll1.getP());
    for (int i = 0; i < result.getM(); i++) {
        result.getRegisters()[i] = Math.max(
            hll1.getRegisters()[i], hll2.getRegisters()[i]
        );
    }
    return result;
}

// |A ∪ B|

Intersection (Approximate)

/** Estimate |A ∩ B| using inclusion-exclusion */
static long intersectionEstimate(HyperLogLog hll1, HyperLogLog hll2, HyperLogLog hllUnion) {
    // |A ∩ B| = |A| + |B| - |A ∪ B|
    return hll1.count() + hll2.count() - hllUnion.count();
}

Note: Intersection is less accurate than union.


HyperLogLog vs Alternatives

Method Space Error Operations
Exact (HashSet) O(n) 0% Add, Count
Linear Counting O(n) Low Add, Count
LogLog O(log log n) ~1.3/√m Add, Count
HyperLogLog O(log log n) ~1.04/√m Add, Count, Merge
HyperLogLog++ O(log log n) Better at low cardinality Add, Count, Merge

Evolution

Linear Counting → LogLog → SuperLogLog → HyperLogLog → HyperLogLog++
(Most space)                                          (Best accuracy)

Common Interview Questions

1. How does HyperLogLog achieve O(log log n) space?

- We're not storing elements, just "max leading zeros seen"
- Leading zeros ~ log(cardinality)
- We need log(log(n)) bits to store count of leading zeros
- With m registers: O(m × log log n) bits

2. Why use harmonic mean instead of arithmetic mean?

Harmonic mean is more robust to outliers.

If one register has an unusually high value (rare event),
arithmetic mean would overestimate significantly.
Harmonic mean (via 2^-r) diminishes the impact of outliers.

3. What is the bias correction for?

Raw HLL estimate has systematic bias:
- Small cardinalities: Use Linear Counting (count empty registers)
- Large cardinalities: Adjust for hash collision probability
- Medium range: Use standard HLL estimate

This is the "HyperLogLog++" improvement.

4. How accurate is the intersection estimate?

Less accurate than union because:
|A ∩ B| = |A| + |B| - |A ∪ B|

Errors compound:
- If each has 2% error
- Intersection could have 6% error

For small intersections, relative error is very high.

5. Can HyperLogLog handle deletions?

No! Like Bloom filters, HLL only supports:
- Add element
- Estimate count
- Merge with another HLL

To "delete": Create new HLL without the element

Real-World Implementations

System Usage
Redis PFADD, PFCOUNT, PFMERGE commands
PostgreSQL pg_hll extension
BigQuery APPROX_COUNT_DISTINCT()
Presto approx_distinct()
Spark approxCountDistinct()
Druid HyperUnique aggregator
Elasticsearch cardinality aggregation

Redis Example

# Add elements
PFADD visitors user1 user2 user3

# Get count
PFCOUNT visitors
# Returns: 3

# Merge multiple HLLs
PFMERGE total day1 day2 day3
PFCOUNT total

HyperLogLog++ Improvements

Google's improvements over standard HLL:

1. Sparse Representation

For small cardinalities:
- Store (register_index, value) pairs
- Converts to dense when threshold exceeded
- Saves memory for small sets

2. Bias Correction

- Empirically derived correction values
- Better accuracy for small to medium cardinalities
- Lookup table based on raw estimate

3. 64-bit Hashing

- Standard HLL uses 32-bit hashes
- HLL++ uses 64-bit hashes
- Better for very large cardinalities

Comparison Summary

Cardinality Exact (HashSet) HyperLogLog
1,000 ~40 KB 12 KB
100,000 ~4 MB 12 KB
10,000,000 ~400 MB 12 KB
1,000,000,000 ~40 GB 12 KB

HyperLogLog memory is constant regardless of cardinality!


Key Takeaways

  1. Count distinct - Estimates unique elements in a set/stream
  2. Constant memory - ~12 KB for any cardinality (with p=14)
  3. ~0.81% error - Typical with standard precision
  4. Mergeable - Perfect for distributed systems
  5. No deletions - Add-only data structure
  6. Leading zeros insight - Rare patterns indicate more elements
  7. Used everywhere - Redis, databases, analytics systems