Count-Min Sketch¶

Introduction¶

Count-Min Sketch is a probabilistic data structure used to estimate the frequency of elements in a data stream. It answers the question: "How many times have I seen element X?"

Key Properties¶

Property	Description
Space efficient	Sub-linear space O(ε⁻¹ × log(δ⁻¹))
Fast	O(d) for update and query (d = depth)
Never underestimates	Actual count ≤ Estimated count
May overestimate	Due to hash collisions
Supports increment	Can add counts (not just 1)
Mergeable	Can combine sketches from distributed systems

Use Cases¶

Heavy hitters detection - Find most frequent items
Network traffic monitoring - Count packets per IP
Click stream analysis - Popular URLs/products
NLP - Word frequency estimation
Approximate query processing - Database aggregations

How It Works¶

Structure¶

2D array of counters: d rows × w columns
d independent hash functions (one per row)
w = width (more width = more accuracy)
d = depth (more depth = lower error probability)

Hash Functions: h1, h2, h3, ... hd

         Column 0   1   2   3   4   5   6   7   ...   w-1
Row 0    [  0  |  0  |  5  |  0  |  3  |  0  |  0  |  2  | ... ]  ← h1
Row 1    [  0  |  2  |  0  |  0  |  0  |  4  |  0  |  0  | ... ]  ← h2
Row 2    [  1  |  0  |  0  |  3  |  0  |  0  |  0  |  0  | ... ]  ← h3
Row 3    [  0  |  0  |  0  |  0  |  2  |  0  |  1  |  0  | ... ]  ← h4
  ⋮
Row d-1  [  0  |  0  |  0  |  0  |  0  |  0  |  3  |  0  | ... ]  ← hd

Update Operation (Add element x with count c)¶

For each row i from 0 to d-1:
    column = hash_i(x) % w
    counters[i][column] += c

Query Operation (Estimate count of x)¶

min_count = infinity
For each row i from 0 to d-1:
    column = hash_i(x) % w
    min_count = min(min_count, counters[i][column])
return min_count

Why minimum? Each counter may be inflated by collisions. Taking minimum gives the least inflated estimate.

Visual Example¶

Stream: [A, B, A, C, A, B, D, A, C, A]

Insert A (appears 5 times):
h1(A) = 2, h2(A) = 5, h3(A) = 1

After all insertions:
         0   1   2   3   4   5   6   7
Row 0 [  0 | 1 | 5 | 0 | 0 | 2 | 1 | 0 ]  h1: A→2, B→5, C→5, D→6
Row 1 [  0 | 2 | 1 | 0 | 0 | 5 | 0 | 0 ]  h2: A→5, B→1, C→2, D→2
Row 2 [  0 | 5 | 2 | 0 | 1 | 0 | 0 | 0 ]  h3: A→1, B→2, C→4, D→2

Query A:
- Row 0, Column 2: 5
- Row 1, Column 5: 5
- Row 2, Column 1: 5
- Estimate: min(5, 5, 5) = 5 ✓ Exact!

Query B (appears 2 times):
- Row 0, Column 5: 2 (but C also hashes here!)
- Row 1, Column 1: 2
- Row 2, Column 2: 2
- Estimate: min(2, 2, 2) = 2 ✓ Exact!

Query E (never appeared):
- Row 0, Column 3: 0
- Estimate: 0 (might be > 0 due to collisions)

Error Bounds¶

Parameters¶

w = ⌈e/ε⌉       (width, e ≈ 2.718)
d = ⌈ln(1/δ)⌉   (depth)

Where:
- ε = error factor (smaller = more accurate)
- δ = probability of exceeding error bound

Guarantee¶

True count ≤ Estimated count ≤ True count + ε × N

With probability ≥ 1 - δ

Where N = total count of all elements

Example¶

For ε = 0.01 and δ = 0.001: - Width w = e/0.01 ≈ 272 - Depth d = ln(1000) ≈ 7 - Space: 272 × 7 = 1,904 counters

Implementation¶

import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;

public class CountMinSketch {
    private final int width;
    private final int depth;
    private final int[][] counters;

    public CountMinSketch(int width, int depth) {
        this.width = width;
        this.depth = depth;
        this.counters = new int[depth][width];
    }

    private int hash(String item, int seed) {
        int h = Hashing.murmur3_32_fixed(seed)
                .hashString(item, StandardCharsets.UTF_8)
                .asInt();
        return Math.floorMod(h, width);
    }

    /** Add item with given count */
    public void add(String item, int count) {
        for (int i = 0; i < depth; i++) {
            int col = hash(item, i);
            counters[i][col] += count;
        }
    }

    public void add(String item) {
        add(item, 1);
    }

    /** Estimate frequency of item */
    public int estimate(String item) {
        int minCount = Integer.MAX_VALUE;
        for (int i = 0; i < depth; i++) {
            int col = hash(item, i);
            minCount = Math.min(minCount, counters[i][col]);
        }
        return minCount;
    }

    /** Merge another sketch into this one */
    public void merge(CountMinSketch other) {
        if (this.width != other.width || this.depth != other.depth) {
            throw new IllegalArgumentException("Sketches must have same dimensions");
        }
        for (int i = 0; i < depth; i++) {
            for (int j = 0; j < width; j++) {
                this.counters[i][j] += other.counters[i][j];
            }
        }
    }

    /** Create with optimal parameters */
    public static CountMinSketch createOptimal(double epsilon, double delta) {
        int width = (int) Math.ceil(Math.E / epsilon);
        int depth = (int) Math.ceil(Math.log(1.0 / delta));
        return new CountMinSketch(width, depth);
    }

    public int getWidth() { return width; }
    public int getDepth() { return depth; }
    public int[][] getCounters() { return counters; }

    // Usage
    public static void main(String[] args) {
        CountMinSketch cms = CountMinSketch.createOptimal(0.001, 0.01);

        // Stream processing
        String[] stream = {"apple", "banana", "apple", "cherry", "apple", "banana"};
        for (String item : stream) {
            cms.add(item);
        }

        System.out.println(cms.estimate("apple"));   // ~3
        System.out.println(cms.estimate("banana"));  // ~2
        System.out.println(cms.estimate("grape"));   // ~0 (or small overestimate)
    }
}

Finding Heavy Hitters¶

Problem¶

Find all elements with frequency > threshold in a stream.

Solution: Count-Min Sketch + Heap¶

import java.util.*;
import java.util.stream.Collectors;

public class HeavyHitters {
    private final int k; // Number of heavy hitters to track
    private final CountMinSketch cms;
    private HashMap<String, Integer> candidates; // item -> estimated count

    public HeavyHitters(int k, double epsilon) {
        this.k = k;
        this.cms = CountMinSketch.createOptimal(epsilon, 0.01);
        this.candidates = new HashMap<>();
    }

    public void add(String item) {
        cms.add(item);
        int estimated = cms.estimate(item);
        candidates.put(item, estimated);

        // Prune if too many candidates
        if (candidates.size() > 10 * k) {
            prune();
        }
    }

    private void prune() {
        // Keep only top candidates
        candidates = candidates.entrySet().stream()
                .sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
                .limit(2 * k)
                .collect(Collectors.toMap(
                        Map.Entry::getKey, Map.Entry::getValue,
                        (e1, e2) -> e1, HashMap::new));
    }

    public List<Map.Entry<String, Integer>> getHeavyHitters() {
        // Re-estimate and return top k
        for (String item : candidates.keySet()) {
            candidates.put(item, cms.estimate(item));
        }
        return candidates.entrySet().stream()
                .sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
                .limit(k)
                .collect(Collectors.toList());
    }
}

Count-Min Sketch vs Alternatives¶

Structure	Space	Query	Updates	Use Case
Hash Map	O(n)	O(1) exact	O(1)	Small datasets
Count-Min Sketch	O(1/ε × log(1/δ))	O(d) approx	O(d)	Large streams
Count Sketch	O(1/ε²)	O(d) approx	O(d)	Unbiased estimates
Lossy Counting	O(1/ε)	O(1)	O(1)	Heavy hitters

Count-Min vs Count Sketch¶

Feature	Count-Min Sketch	Count Sketch
Estimate bias	Always overestimates	Unbiased (can under/over)
Space	O(1/ε)	O(1/ε²)
Better for	Point queries	Heavy hitters with negative counts

Common Interview Questions¶

1. Why take minimum across rows?¶

Each counter may include counts from colliding elements.
Collisions only ADD to the count (never subtract).
Therefore:
- True count is always ≤ any single counter
- Minimum counter is closest to true count

2. Can Count-Min Sketch underestimate?¶

No! It only overestimates (or gives exact count).

Reason: Counters are only incremented, never decremented.
Any collision adds to the count, never subtracts.

3. How to handle deletions?¶

Option 1: Use negative counts (not standard CMS)
Option 2: Use Count Sketch (designed for this)
Option 3: Conservative Update (reduce overestimation)

4. What's the space complexity?¶

Space = width × depth × counter_size
      = O(e/ε) × O(ln(1/δ)) × O(log(max_count))
      = O(1/ε × log(1/δ)) (ignoring counter size)

For ε=0.001, δ=0.001:
≈ 2718 × 7 × 4 bytes = ~76 KB

5. How to use in distributed systems?¶

Count-Min Sketches are mergeable!

Node 1: CMS1 ─┐
Node 2: CMS2 ─┼──▶ Merge all counters element-wise ──▶ Combined CMS
Node 3: CMS3 ─┘

merge(CMS1, CMS2, CMS3):
    for each position (i, j):
        result[i][j] = CMS1[i][j] + CMS2[i][j] + CMS3[i][j]

Real-World Applications¶

1. Network Traffic Monitoring¶

// Count packets per source IP
CountMinSketch packetCounter = new CountMinSketch(10000, 5);

public void processPacket(Packet packet) {
    String sourceIp = packet.getSourceIp();
    packetCounter.add(sourceIp);

    // Detect DDoS
    if (packetCounter.estimate(sourceIp) > THRESHOLD) {
        flagSuspicious(sourceIp);
    }
}

2. Database Query Optimization¶

Query: SELECT COUNT(*) FROM logs WHERE user_id = ?

Without CMS: Full table scan
With CMS: Approximate count in O(1)

public class TrendingTopics {
    private CountMinSketch currentWindow;
    private CountMinSketch previousWindow;

    public TrendingTopics() {
        this.currentWindow = new CountMinSketch(1000, 5);
        this.previousWindow = new CountMinSketch(1000, 5);
    }

    public void addTopic(String topic) {
        currentWindow.add(topic);
    }

    public int getTrendingScore(String topic) {
        int current = currentWindow.estimate(topic);
        int previous = previousWindow.estimate(topic);
        return current - previous; // Growth rate
    }
}

4. Cache Admission Policy¶

TinyLFU: Use CMS to track access frequency
Only admit items to cache if CMS count > threshold
Prevents cache pollution from one-time accesses

Optimizations¶

Conservative Update¶

Reduce overestimation:

public void addConservative(String item, int count) {
    // Get current minimum estimate
    int minVal = estimate(item);

    // Only increment counters that equal the minimum
    for (int i = 0; i < depth; i++) {
        int col = hash(item, i);
        if (counters[i][col] == minVal) {
            counters[i][col] += count;
        }
    }
}

Time-Decayed Counting¶

Weight recent events more:

public void addWithDecay(String item, long timestamp) {
    double decayFactor = Math.exp(-LAMBDA * (currentTime - timestamp));
    add(item, (int) Math.round(decayFactor));
}

Key Takeaways¶

Approximate frequency estimation - Trade accuracy for space
Never underestimates - Always safe for "at least N" queries
May overestimate - Due to hash collisions
Sublinear space - Fixed size regardless of stream length
Mergeable - Perfect for distributed systems
Fast operations - O(d) where d is typically 5-10
Best for - High-volume streams where exact counts aren't needed