HyperLogLog¶

Introduction¶

HyperLogLog (HLL) is a probabilistic data structure used to estimate the cardinality (count of distinct elements) of a set. It can count billions of unique items using only ~12 KB of memory.

Key Properties¶

Property	Description
Space efficient	O(log log n) space for n distinct elements
Fixed memory	~12 KB for billions of elements
Fast	O(1) add, O(m) count (m = number of registers)
Approximate	Typical error rate of ~2%
Mergeable	Can combine HLLs from distributed systems

The Problem It Solves¶

Exact counting:
- Need to store all unique elements
- 1 billion UUIDs = ~16 GB memory

HyperLogLog:
- Estimate count with 2% error
- Only 12 KB memory

How It Works¶

Core Insight¶

When you hash random values, the probability of seeing a hash with: - Leading 0: 50% - Leading 00: 25% - Leading 000: 12.5% - Leading k zeros: 1/2^k

If you've seen a hash with k leading zeros, you've probably seen ~2^k distinct elements.

Algorithm Steps¶

1. Hash and Split¶

element → hash(element) → [prefix | suffix]
                          └─p bits─┘ └─rest─┘

Prefix determines which register (bucket)
Suffix is used to count leading zeros

2. Count Leading Zeros¶

int countLeadingZeros(String hashSuffix) {
    int count = 0;
    for (char bit : hashSuffix.toCharArray()) {
        if (bit == '0') {
            count++;
        } else {
            break;
        }
    }
    return count + 1; // Position of first 1-bit
}

3. Update Register¶

int registerIndex = hashPrefix; // Which register to update
int leadingZeros = countLeadingZeros(hashSuffix);
registers[registerIndex] = Math.max(registers[registerIndex], leadingZeros);

4. Estimate Cardinality¶

double estimate() {
    // Harmonic mean of 2^register values
    double indicator = 0;
    for (int register : registers) {
        indicator += Math.pow(2, -register);
    }
    double rawEstimate = alpha * m * m / indicator;
    return applyCorrections(rawEstimate);
}

Visual Example¶

m = 4 registers (using 2-bit prefix)

Stream: [A, B, C, D, E, A, F, G]

hash(A) = 00|101001  → Register 0, leading zeros = 0 → max(R[0], 1) = 1
hash(B) = 01|011010  → Register 1, leading zeros = 1 → max(R[1], 2) = 2
hash(C) = 10|000101  → Register 2, leading zeros = 3 → max(R[2], 4) = 4
hash(D) = 11|110000  → Register 3, leading zeros = 0 → max(R[3], 1) = 1
hash(E) = 00|001100  → Register 0, leading zeros = 2 → max(R[0], 3) = 3
hash(A) = 00|101001  → Register 0, leading zeros = 0 → max(R[0], 3) = 3 (no change)
hash(F) = 01|000010  → Register 1, leading zeros = 4 → max(R[1], 5) = 5
hash(G) = 10|100000  → Register 2, leading zeros = 0 → max(R[2], 4) = 4 (no change)

Final registers: [3, 5, 4, 1]

Estimate = α × m × m / (2^-3 + 2^-5 + 2^-4 + 2^-1)
        ≈ 0.673 × 4 × 4 / (0.125 + 0.031 + 0.0625 + 0.5)
        ≈ 10.77 / 0.72
        ≈ 15 (actual = 7, but with more registers this is accurate)

Mathematical Foundation¶

Register Count¶

m = 2^p registers
p = precision parameter (typically 4-16)

More registers = more accuracy = more memory

Standard Error¶

Standard Error ≈ 1.04 / √m

For m = 16,384 (p = 14):
Error ≈ 1.04 / √16384 ≈ 0.81%

Memory Usage¶

Memory = m × register_size
       = 2^p × 6 bits (6 bits can count up to 64 leading zeros)

For p = 14:
Memory = 16,384 × 6 bits = 12 KB

Precision vs Memory Trade-off¶

p	Registers	Memory	Error Rate
4	16	12 bytes	26%
10	1,024	768 bytes	3.25%
12	4,096	3 KB	1.625%
14	16,384	12 KB	0.81%
16	65,536	48 KB	0.41%

Common choice: p = 14 (~12 KB, ~0.81% error)

Implementation¶

import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class HyperLogLog {
    private final int p;        // precision
    private final int m;        // number of registers (2^p)
    private final int[] registers;
    private final double alpha; // bias correction constant

    public HyperLogLog(int precision) {
        this.p = precision;
        this.m = 1 << precision; // 2^p registers
        this.registers = new int[m];

        // Alpha constant for bias correction
        if (m == 16) {
            alpha = 0.673;
        } else if (m == 32) {
            alpha = 0.697;
        } else if (m == 64) {
            alpha = 0.709;
        } else {
            alpha = 0.7213 / (1 + 1.079 / m);
        }
    }

    public HyperLogLog() {
        this(14); // default precision
    }

    private int hash(String item) {
        return Hashing.murmur3_32_fixed()
                .hashString(item, StandardCharsets.UTF_8)
                .asInt();
    }

    private int getRegisterIndex(int hashVal) {
        // First p bits determine register
        return hashVal >>> (32 - p);
    }

    private int countLeadingZeros(int hashVal) {
        // Count leading zeros in remaining bits
        int remaining = hashVal & ((1 << (32 - p)) - 1);
        if (remaining == 0) {
            return 32 - p;
        }
        return (32 - p) - (32 - Integer.numberOfLeadingZeros(remaining));
    }

    public void add(String item) {
        int hashVal = hash(item);
        int registerIdx = getRegisterIndex(hashVal);
        int leadingZeros = countLeadingZeros(hashVal) + 1;
        registers[registerIdx] = Math.max(registers[registerIdx], leadingZeros);
    }

    public long count() {
        // Raw estimate using harmonic mean
        double indicator = 0;
        for (int r : registers) {
            indicator += Math.pow(2.0, -r);
        }
        double rawEstimate = alpha * m * m / indicator;

        // Small range correction
        if (rawEstimate <= 2.5 * m) {
            int zeros = 0;
            for (int r : registers) {
                if (r == 0) zeros++;
            }
            if (zeros > 0) {
                return Math.round(m * Math.log((double) m / zeros));
            }
        }

        // Large range correction
        if (rawEstimate > (1L << 32) / 30.0) {
            return Math.round(-(1L << 32) * Math.log(1 - rawEstimate / (1L << 32)));
        }

        return Math.round(rawEstimate);
    }

    /** Merge another HLL into this one */
    public void merge(HyperLogLog other) {
        if (this.p != other.p) {
            throw new IllegalArgumentException("Precision mismatch");
        }
        for (int i = 0; i < m; i++) {
            registers[i] = Math.max(registers[i], other.registers[i]);
        }
    }

    // Expose registers for union/intersection operations
    public int[] getRegisters() { return registers; }
    public int getP() { return p; }
    public int getM() { return m; }

    // Usage
    public static void main(String[] args) {
        HyperLogLog hll = new HyperLogLog(14);

        // Add 1 million unique items
        for (int i = 0; i < 1_000_000; i++) {
            hll.add("user_" + i);
        }

        System.out.println("Estimated: " + hll.count());  // ~1,000,000 (+-0.81%)
        System.out.println("Memory: " + (hll.registers.length * 6 / 8) + " bytes"); // ~12 KB
    }
}

Use Cases¶

1. Unique Visitors Counting¶

// Website analytics
HyperLogLog dailyVisitors = new HyperLogLog();

void trackVisit(String userId) {
    dailyVisitors.add(userId);
}

long getUniqueVisitors() {
    return dailyVisitors.count();
}

2. Database Cardinality Estimation¶

-- PostgreSQL uses HLL internally
SELECT COUNT(DISTINCT user_id) FROM events;

-- With HLL extension
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id)))
FROM events;

3. Distributed Unique Counting¶

// Each server maintains local HLL
HyperLogLog server1Hll = new HyperLogLog();
HyperLogLog server2Hll = new HyperLogLog();
HyperLogLog server3Hll = new HyperLogLog();

// Merge for global count
HyperLogLog globalHll = new HyperLogLog();
globalHll.merge(server1Hll);
globalHll.merge(server2Hll);
globalHll.merge(server3Hll);

long totalUnique = globalHll.count();

4. Real-Time Analytics Dashboard¶

Problem: Count unique users per minute/hour/day

Solution:
- Minute HLL: Track current minute
- Hour HLL: Merge minute HLLs
- Day HLL: Merge hour HLLs

Each level is independently queryable

5. Network Traffic Analysis¶

// Count unique source IPs
HyperLogLog uniqueSources = new HyperLogLog();

void processPacket(Packet packet) {
    uniqueSources.add(packet.getSourceIp());

    if (uniqueSources.count() > THRESHOLD) {
        alert("Possible DDoS - too many unique IPs");
    }
}

HyperLogLog Operations¶

Merge (Union)¶

/** Count distinct elements in either set */
static HyperLogLog union(HyperLogLog hll1, HyperLogLog hll2) {
    HyperLogLog result = new HyperLogLog(hll1.getP());
    for (int i = 0; i < result.getM(); i++) {
        result.getRegisters()[i] = Math.max(
            hll1.getRegisters()[i], hll2.getRegisters()[i]
        );
    }
    return result;
}

// |A ∪ B|

Intersection (Approximate)¶

/** Estimate |A ∩ B| using inclusion-exclusion */
static long intersectionEstimate(HyperLogLog hll1, HyperLogLog hll2, HyperLogLog hllUnion) {
    // |A ∩ B| = |A| + |B| - |A ∪ B|
    return hll1.count() + hll2.count() - hllUnion.count();
}

Note: Intersection is less accurate than union.

HyperLogLog vs Alternatives¶

Method	Space	Error	Operations
Exact (HashSet)	O(n)	0%	Add, Count
Linear Counting	O(n)	Low	Add, Count
LogLog	O(log log n)	~1.3/√m	Add, Count
HyperLogLog	O(log log n)	~1.04/√m	Add, Count, Merge
HyperLogLog++	O(log log n)	Better at low cardinality	Add, Count, Merge

Evolution¶

Linear Counting → LogLog → SuperLogLog → HyperLogLog → HyperLogLog++
(Most space)                                          (Best accuracy)

Common Interview Questions¶

1. How does HyperLogLog achieve O(log log n) space?¶

- We're not storing elements, just "max leading zeros seen"
- Leading zeros ~ log(cardinality)
- We need log(log(n)) bits to store count of leading zeros
- With m registers: O(m × log log n) bits

2. Why use harmonic mean instead of arithmetic mean?¶

Harmonic mean is more robust to outliers.

If one register has an unusually high value (rare event),
arithmetic mean would overestimate significantly.
Harmonic mean (via 2^-r) diminishes the impact of outliers.

3. What is the bias correction for?¶

Raw HLL estimate has systematic bias:
- Small cardinalities: Use Linear Counting (count empty registers)
- Large cardinalities: Adjust for hash collision probability
- Medium range: Use standard HLL estimate

This is the "HyperLogLog++" improvement.

4. How accurate is the intersection estimate?¶

Less accurate than union because:
|A ∩ B| = |A| + |B| - |A ∪ B|

Errors compound:
- If each has 2% error
- Intersection could have 6% error

For small intersections, relative error is very high.

5. Can HyperLogLog handle deletions?¶

No! Like Bloom filters, HLL only supports:
- Add element
- Estimate count
- Merge with another HLL

To "delete": Create new HLL without the element

Real-World Implementations¶

System	Usage
Redis	PFADD, PFCOUNT, PFMERGE commands
PostgreSQL	pg_hll extension
BigQuery	APPROX_COUNT_DISTINCT()
Presto	approx_distinct()
Spark	approxCountDistinct()
Druid	HyperUnique aggregator
Elasticsearch	cardinality aggregation

Redis Example¶

# Add elements
PFADD visitors user1 user2 user3

# Get count
PFCOUNT visitors
# Returns: 3

# Merge multiple HLLs
PFMERGE total day1 day2 day3
PFCOUNT total

HyperLogLog++ Improvements¶

Google's improvements over standard HLL:

1. Sparse Representation¶

For small cardinalities:
- Store (register_index, value) pairs
- Converts to dense when threshold exceeded
- Saves memory for small sets

2. Bias Correction¶

- Empirically derived correction values
- Better accuracy for small to medium cardinalities
- Lookup table based on raw estimate

3. 64-bit Hashing¶

- Standard HLL uses 32-bit hashes
- HLL++ uses 64-bit hashes
- Better for very large cardinalities

Comparison Summary¶

Cardinality	Exact (HashSet)	HyperLogLog
1,000	~40 KB	12 KB
100,000	~4 MB	12 KB
10,000,000	~400 MB	12 KB
1,000,000,000	~40 GB	12 KB

HyperLogLog memory is constant regardless of cardinality!

Key Takeaways¶

Count distinct - Estimates unique elements in a set/stream
Constant memory - ~12 KB for any cardinality (with p=14)
~0.81% error - Typical with standard precision
Mergeable - Perfect for distributed systems
No deletions - Add-only data structure
Leading zeros insight - Rare patterns indicate more elements
Used everywhere - Redis, databases, analytics systems