Skip to content

Count-Min Sketch

Introduction

Count-Min Sketch is a probabilistic data structure used to estimate the frequency of elements in a data stream. It answers the question: "How many times have I seen element X?"

Key Properties

Property Description
Space efficient Sub-linear space O(ε⁻¹ × log(δ⁻¹))
Fast O(d) for update and query (d = depth)
Never underestimates Actual count ≤ Estimated count
May overestimate Due to hash collisions
Supports increment Can add counts (not just 1)
Mergeable Can combine sketches from distributed systems

Use Cases

  • Heavy hitters detection - Find most frequent items
  • Network traffic monitoring - Count packets per IP
  • Click stream analysis - Popular URLs/products
  • NLP - Word frequency estimation
  • Approximate query processing - Database aggregations

How It Works

Structure

  • 2D array of counters: d rows × w columns
  • d independent hash functions (one per row)
  • w = width (more width = more accuracy)
  • d = depth (more depth = lower error probability)
Hash Functions: h1, h2, h3, ... hd

         Column 0   1   2   3   4   5   6   7   ...   w-1
Row 0    [  0  |  0  |  5  |  0  |  3  |  0  |  0  |  2  | ... ]  ← h1
Row 1    [  0  |  2  |  0  |  0  |  0  |  4  |  0  |  0  | ... ]  ← h2
Row 2    [  1  |  0  |  0  |  3  |  0  |  0  |  0  |  0  | ... ]  ← h3
Row 3    [  0  |  0  |  0  |  0  |  2  |  0  |  1  |  0  | ... ]  ← h4
Row d-1  [  0  |  0  |  0  |  0  |  0  |  0  |  3  |  0  | ... ]  ← hd

Update Operation (Add element x with count c)

For each row i from 0 to d-1:
    column = hash_i(x) % w
    counters[i][column] += c

Query Operation (Estimate count of x)

min_count = infinity
For each row i from 0 to d-1:
    column = hash_i(x) % w
    min_count = min(min_count, counters[i][column])
return min_count

Why minimum? Each counter may be inflated by collisions. Taking minimum gives the least inflated estimate.


Visual Example

Stream: [A, B, A, C, A, B, D, A, C, A]

Insert A (appears 5 times):
h1(A) = 2, h2(A) = 5, h3(A) = 1

After all insertions:
         0   1   2   3   4   5   6   7
Row 0 [  0 | 1 | 5 | 0 | 0 | 2 | 1 | 0 ]  h1: A→2, B→5, C→5, D→6
Row 1 [  0 | 2 | 1 | 0 | 0 | 5 | 0 | 0 ]  h2: A→5, B→1, C→2, D→2
Row 2 [  0 | 5 | 2 | 0 | 1 | 0 | 0 | 0 ]  h3: A→1, B→2, C→4, D→2

Query A:
- Row 0, Column 2: 5
- Row 1, Column 5: 5
- Row 2, Column 1: 5
- Estimate: min(5, 5, 5) = 5 ✓ Exact!

Query B (appears 2 times):
- Row 0, Column 5: 2 (but C also hashes here!)
- Row 1, Column 1: 2
- Row 2, Column 2: 2
- Estimate: min(2, 2, 2) = 2 ✓ Exact!

Query E (never appeared):
- Row 0, Column 3: 0
- Estimate: 0 (might be > 0 due to collisions)

Error Bounds

Parameters

w = ⌈e/ε⌉       (width, e ≈ 2.718)
d = ⌈ln(1/δ)⌉   (depth)

Where:
- ε = error factor (smaller = more accurate)
- δ = probability of exceeding error bound

Guarantee

True count ≤ Estimated count ≤ True count + ε × N

With probability ≥ 1 - δ

Where N = total count of all elements

Example

For ε = 0.01 and δ = 0.001: - Width w = e/0.01 ≈ 272 - Depth d = ln(1000) ≈ 7 - Space: 272 × 7 = 1,904 counters


Implementation

import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;

public class CountMinSketch {
    private final int width;
    private final int depth;
    private final int[][] counters;

    public CountMinSketch(int width, int depth) {
        this.width = width;
        this.depth = depth;
        this.counters = new int[depth][width];
    }

    private int hash(String item, int seed) {
        int h = Hashing.murmur3_32_fixed(seed)
                .hashString(item, StandardCharsets.UTF_8)
                .asInt();
        return Math.floorMod(h, width);
    }

    /** Add item with given count */
    public void add(String item, int count) {
        for (int i = 0; i < depth; i++) {
            int col = hash(item, i);
            counters[i][col] += count;
        }
    }

    public void add(String item) {
        add(item, 1);
    }

    /** Estimate frequency of item */
    public int estimate(String item) {
        int minCount = Integer.MAX_VALUE;
        for (int i = 0; i < depth; i++) {
            int col = hash(item, i);
            minCount = Math.min(minCount, counters[i][col]);
        }
        return minCount;
    }

    /** Merge another sketch into this one */
    public void merge(CountMinSketch other) {
        if (this.width != other.width || this.depth != other.depth) {
            throw new IllegalArgumentException("Sketches must have same dimensions");
        }
        for (int i = 0; i < depth; i++) {
            for (int j = 0; j < width; j++) {
                this.counters[i][j] += other.counters[i][j];
            }
        }
    }

    /** Create with optimal parameters */
    public static CountMinSketch createOptimal(double epsilon, double delta) {
        int width = (int) Math.ceil(Math.E / epsilon);
        int depth = (int) Math.ceil(Math.log(1.0 / delta));
        return new CountMinSketch(width, depth);
    }

    public int getWidth() { return width; }
    public int getDepth() { return depth; }
    public int[][] getCounters() { return counters; }

    // Usage
    public static void main(String[] args) {
        CountMinSketch cms = CountMinSketch.createOptimal(0.001, 0.01);

        // Stream processing
        String[] stream = {"apple", "banana", "apple", "cherry", "apple", "banana"};
        for (String item : stream) {
            cms.add(item);
        }

        System.out.println(cms.estimate("apple"));   // ~3
        System.out.println(cms.estimate("banana"));  // ~2
        System.out.println(cms.estimate("grape"));   // ~0 (or small overestimate)
    }
}

Finding Heavy Hitters

Problem

Find all elements with frequency > threshold in a stream.

Solution: Count-Min Sketch + Heap

import java.util.*;
import java.util.stream.Collectors;

public class HeavyHitters {
    private final int k; // Number of heavy hitters to track
    private final CountMinSketch cms;
    private HashMap<String, Integer> candidates; // item -> estimated count

    public HeavyHitters(int k, double epsilon) {
        this.k = k;
        this.cms = CountMinSketch.createOptimal(epsilon, 0.01);
        this.candidates = new HashMap<>();
    }

    public void add(String item) {
        cms.add(item);
        int estimated = cms.estimate(item);
        candidates.put(item, estimated);

        // Prune if too many candidates
        if (candidates.size() > 10 * k) {
            prune();
        }
    }

    private void prune() {
        // Keep only top candidates
        candidates = candidates.entrySet().stream()
                .sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
                .limit(2 * k)
                .collect(Collectors.toMap(
                        Map.Entry::getKey, Map.Entry::getValue,
                        (e1, e2) -> e1, HashMap::new));
    }

    public List<Map.Entry<String, Integer>> getHeavyHitters() {
        // Re-estimate and return top k
        for (String item : candidates.keySet()) {
            candidates.put(item, cms.estimate(item));
        }
        return candidates.entrySet().stream()
                .sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
                .limit(k)
                .collect(Collectors.toList());
    }
}

Count-Min Sketch vs Alternatives

Structure Space Query Updates Use Case
Hash Map O(n) O(1) exact O(1) Small datasets
Count-Min Sketch O(1/ε × log(1/δ)) O(d) approx O(d) Large streams
Count Sketch O(1/ε²) O(d) approx O(d) Unbiased estimates
Lossy Counting O(1/ε) O(1) O(1) Heavy hitters

Count-Min vs Count Sketch

Feature Count-Min Sketch Count Sketch
Estimate bias Always overestimates Unbiased (can under/over)
Space O(1/ε) O(1/ε²)
Better for Point queries Heavy hitters with negative counts

Common Interview Questions

1. Why take minimum across rows?

Each counter may include counts from colliding elements.
Collisions only ADD to the count (never subtract).
Therefore:
- True count is always ≤ any single counter
- Minimum counter is closest to true count

2. Can Count-Min Sketch underestimate?

No! It only overestimates (or gives exact count).

Reason: Counters are only incremented, never decremented.
Any collision adds to the count, never subtracts.

3. How to handle deletions?

Option 1: Use negative counts (not standard CMS)
Option 2: Use Count Sketch (designed for this)
Option 3: Conservative Update (reduce overestimation)

4. What's the space complexity?

Space = width × depth × counter_size
      = O(e/ε) × O(ln(1/δ)) × O(log(max_count))
      = O(1/ε × log(1/δ)) (ignoring counter size)

For ε=0.001, δ=0.001:
≈ 2718 × 7 × 4 bytes = ~76 KB

5. How to use in distributed systems?

Count-Min Sketches are mergeable!

Node 1: CMS1 ─┐
Node 2: CMS2 ─┼──▶ Merge all counters element-wise ──▶ Combined CMS
Node 3: CMS3 ─┘

merge(CMS1, CMS2, CMS3):
    for each position (i, j):
        result[i][j] = CMS1[i][j] + CMS2[i][j] + CMS3[i][j]

Real-World Applications

1. Network Traffic Monitoring

// Count packets per source IP
CountMinSketch packetCounter = new CountMinSketch(10000, 5);

public void processPacket(Packet packet) {
    String sourceIp = packet.getSourceIp();
    packetCounter.add(sourceIp);

    // Detect DDoS
    if (packetCounter.estimate(sourceIp) > THRESHOLD) {
        flagSuspicious(sourceIp);
    }
}

2. Database Query Optimization

Query: SELECT COUNT(*) FROM logs WHERE user_id = ?

Without CMS: Full table scan
With CMS: Approximate count in O(1)
public class TrendingTopics {
    private CountMinSketch currentWindow;
    private CountMinSketch previousWindow;

    public TrendingTopics() {
        this.currentWindow = new CountMinSketch(1000, 5);
        this.previousWindow = new CountMinSketch(1000, 5);
    }

    public void addTopic(String topic) {
        currentWindow.add(topic);
    }

    public int getTrendingScore(String topic) {
        int current = currentWindow.estimate(topic);
        int previous = previousWindow.estimate(topic);
        return current - previous; // Growth rate
    }
}

4. Cache Admission Policy

TinyLFU: Use CMS to track access frequency
Only admit items to cache if CMS count > threshold
Prevents cache pollution from one-time accesses

Optimizations

Conservative Update

Reduce overestimation:

public void addConservative(String item, int count) {
    // Get current minimum estimate
    int minVal = estimate(item);

    // Only increment counters that equal the minimum
    for (int i = 0; i < depth; i++) {
        int col = hash(item, i);
        if (counters[i][col] == minVal) {
            counters[i][col] += count;
        }
    }
}

Time-Decayed Counting

Weight recent events more:

public void addWithDecay(String item, long timestamp) {
    double decayFactor = Math.exp(-LAMBDA * (currentTime - timestamp));
    add(item, (int) Math.round(decayFactor));
}


Key Takeaways

  1. Approximate frequency estimation - Trade accuracy for space
  2. Never underestimates - Always safe for "at least N" queries
  3. May overestimate - Due to hash collisions
  4. Sublinear space - Fixed size regardless of stream length
  5. Mergeable - Perfect for distributed systems
  6. Fast operations - O(d) where d is typically 5-10
  7. Best for - High-volume streams where exact counts aren't needed