Count-Min Sketch¶
Introduction¶
Count-Min Sketch is a probabilistic data structure used to estimate the frequency of elements in a data stream. It answers the question: "How many times have I seen element X?"
Key Properties¶
| Property | Description |
|---|---|
| Space efficient | Sub-linear space O(ε⁻¹ × log(δ⁻¹)) |
| Fast | O(d) for update and query (d = depth) |
| Never underestimates | Actual count ≤ Estimated count |
| May overestimate | Due to hash collisions |
| Supports increment | Can add counts (not just 1) |
| Mergeable | Can combine sketches from distributed systems |
Use Cases¶
- Heavy hitters detection - Find most frequent items
- Network traffic monitoring - Count packets per IP
- Click stream analysis - Popular URLs/products
- NLP - Word frequency estimation
- Approximate query processing - Database aggregations
How It Works¶
Structure¶
- 2D array of counters:
drows ×wcolumns dindependent hash functions (one per row)w= width (more width = more accuracy)d= depth (more depth = lower error probability)
Hash Functions: h1, h2, h3, ... hd
Column 0 1 2 3 4 5 6 7 ... w-1
Row 0 [ 0 | 0 | 5 | 0 | 3 | 0 | 0 | 2 | ... ] ← h1
Row 1 [ 0 | 2 | 0 | 0 | 0 | 4 | 0 | 0 | ... ] ← h2
Row 2 [ 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | ... ] ← h3
Row 3 [ 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | ... ] ← h4
⋮
Row d-1 [ 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | ... ] ← hd
Update Operation (Add element x with count c)¶
Query Operation (Estimate count of x)¶
min_count = infinity
For each row i from 0 to d-1:
column = hash_i(x) % w
min_count = min(min_count, counters[i][column])
return min_count
Why minimum? Each counter may be inflated by collisions. Taking minimum gives the least inflated estimate.
Visual Example¶
Stream: [A, B, A, C, A, B, D, A, C, A]
Insert A (appears 5 times):
h1(A) = 2, h2(A) = 5, h3(A) = 1
After all insertions:
0 1 2 3 4 5 6 7
Row 0 [ 0 | 1 | 5 | 0 | 0 | 2 | 1 | 0 ] h1: A→2, B→5, C→5, D→6
Row 1 [ 0 | 2 | 1 | 0 | 0 | 5 | 0 | 0 ] h2: A→5, B→1, C→2, D→2
Row 2 [ 0 | 5 | 2 | 0 | 1 | 0 | 0 | 0 ] h3: A→1, B→2, C→4, D→2
Query A:
- Row 0, Column 2: 5
- Row 1, Column 5: 5
- Row 2, Column 1: 5
- Estimate: min(5, 5, 5) = 5 ✓ Exact!
Query B (appears 2 times):
- Row 0, Column 5: 2 (but C also hashes here!)
- Row 1, Column 1: 2
- Row 2, Column 2: 2
- Estimate: min(2, 2, 2) = 2 ✓ Exact!
Query E (never appeared):
- Row 0, Column 3: 0
- Estimate: 0 (might be > 0 due to collisions)
Error Bounds¶
Parameters¶
w = ⌈e/ε⌉ (width, e ≈ 2.718)
d = ⌈ln(1/δ)⌉ (depth)
Where:
- ε = error factor (smaller = more accurate)
- δ = probability of exceeding error bound
Guarantee¶
True count ≤ Estimated count ≤ True count + ε × N
With probability ≥ 1 - δ
Where N = total count of all elements
Example¶
For ε = 0.01 and δ = 0.001: - Width w = e/0.01 ≈ 272 - Depth d = ln(1000) ≈ 7 - Space: 272 × 7 = 1,904 counters
Implementation¶
import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;
public class CountMinSketch {
private final int width;
private final int depth;
private final int[][] counters;
public CountMinSketch(int width, int depth) {
this.width = width;
this.depth = depth;
this.counters = new int[depth][width];
}
private int hash(String item, int seed) {
int h = Hashing.murmur3_32_fixed(seed)
.hashString(item, StandardCharsets.UTF_8)
.asInt();
return Math.floorMod(h, width);
}
/** Add item with given count */
public void add(String item, int count) {
for (int i = 0; i < depth; i++) {
int col = hash(item, i);
counters[i][col] += count;
}
}
public void add(String item) {
add(item, 1);
}
/** Estimate frequency of item */
public int estimate(String item) {
int minCount = Integer.MAX_VALUE;
for (int i = 0; i < depth; i++) {
int col = hash(item, i);
minCount = Math.min(minCount, counters[i][col]);
}
return minCount;
}
/** Merge another sketch into this one */
public void merge(CountMinSketch other) {
if (this.width != other.width || this.depth != other.depth) {
throw new IllegalArgumentException("Sketches must have same dimensions");
}
for (int i = 0; i < depth; i++) {
for (int j = 0; j < width; j++) {
this.counters[i][j] += other.counters[i][j];
}
}
}
/** Create with optimal parameters */
public static CountMinSketch createOptimal(double epsilon, double delta) {
int width = (int) Math.ceil(Math.E / epsilon);
int depth = (int) Math.ceil(Math.log(1.0 / delta));
return new CountMinSketch(width, depth);
}
public int getWidth() { return width; }
public int getDepth() { return depth; }
public int[][] getCounters() { return counters; }
// Usage
public static void main(String[] args) {
CountMinSketch cms = CountMinSketch.createOptimal(0.001, 0.01);
// Stream processing
String[] stream = {"apple", "banana", "apple", "cherry", "apple", "banana"};
for (String item : stream) {
cms.add(item);
}
System.out.println(cms.estimate("apple")); // ~3
System.out.println(cms.estimate("banana")); // ~2
System.out.println(cms.estimate("grape")); // ~0 (or small overestimate)
}
}
Finding Heavy Hitters¶
Problem¶
Find all elements with frequency > threshold in a stream.
Solution: Count-Min Sketch + Heap¶
import java.util.*;
import java.util.stream.Collectors;
public class HeavyHitters {
private final int k; // Number of heavy hitters to track
private final CountMinSketch cms;
private HashMap<String, Integer> candidates; // item -> estimated count
public HeavyHitters(int k, double epsilon) {
this.k = k;
this.cms = CountMinSketch.createOptimal(epsilon, 0.01);
this.candidates = new HashMap<>();
}
public void add(String item) {
cms.add(item);
int estimated = cms.estimate(item);
candidates.put(item, estimated);
// Prune if too many candidates
if (candidates.size() > 10 * k) {
prune();
}
}
private void prune() {
// Keep only top candidates
candidates = candidates.entrySet().stream()
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
.limit(2 * k)
.collect(Collectors.toMap(
Map.Entry::getKey, Map.Entry::getValue,
(e1, e2) -> e1, HashMap::new));
}
public List<Map.Entry<String, Integer>> getHeavyHitters() {
// Re-estimate and return top k
for (String item : candidates.keySet()) {
candidates.put(item, cms.estimate(item));
}
return candidates.entrySet().stream()
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
.limit(k)
.collect(Collectors.toList());
}
}
Count-Min Sketch vs Alternatives¶
| Structure | Space | Query | Updates | Use Case |
|---|---|---|---|---|
| Hash Map | O(n) | O(1) exact | O(1) | Small datasets |
| Count-Min Sketch | O(1/ε × log(1/δ)) | O(d) approx | O(d) | Large streams |
| Count Sketch | O(1/ε²) | O(d) approx | O(d) | Unbiased estimates |
| Lossy Counting | O(1/ε) | O(1) | O(1) | Heavy hitters |
Count-Min vs Count Sketch¶
| Feature | Count-Min Sketch | Count Sketch |
|---|---|---|
| Estimate bias | Always overestimates | Unbiased (can under/over) |
| Space | O(1/ε) | O(1/ε²) |
| Better for | Point queries | Heavy hitters with negative counts |
Common Interview Questions¶
1. Why take minimum across rows?¶
Each counter may include counts from colliding elements.
Collisions only ADD to the count (never subtract).
Therefore:
- True count is always ≤ any single counter
- Minimum counter is closest to true count
2. Can Count-Min Sketch underestimate?¶
No! It only overestimates (or gives exact count).
Reason: Counters are only incremented, never decremented.
Any collision adds to the count, never subtracts.
3. How to handle deletions?¶
Option 1: Use negative counts (not standard CMS)
Option 2: Use Count Sketch (designed for this)
Option 3: Conservative Update (reduce overestimation)
4. What's the space complexity?¶
Space = width × depth × counter_size
= O(e/ε) × O(ln(1/δ)) × O(log(max_count))
= O(1/ε × log(1/δ)) (ignoring counter size)
For ε=0.001, δ=0.001:
≈ 2718 × 7 × 4 bytes = ~76 KB
5. How to use in distributed systems?¶
Count-Min Sketches are mergeable!
Node 1: CMS1 ─┐
Node 2: CMS2 ─┼──▶ Merge all counters element-wise ──▶ Combined CMS
Node 3: CMS3 ─┘
merge(CMS1, CMS2, CMS3):
for each position (i, j):
result[i][j] = CMS1[i][j] + CMS2[i][j] + CMS3[i][j]
Real-World Applications¶
1. Network Traffic Monitoring¶
// Count packets per source IP
CountMinSketch packetCounter = new CountMinSketch(10000, 5);
public void processPacket(Packet packet) {
String sourceIp = packet.getSourceIp();
packetCounter.add(sourceIp);
// Detect DDoS
if (packetCounter.estimate(sourceIp) > THRESHOLD) {
flagSuspicious(sourceIp);
}
}
2. Database Query Optimization¶
Query: SELECT COUNT(*) FROM logs WHERE user_id = ?
Without CMS: Full table scan
With CMS: Approximate count in O(1)
3. Trending Topics Detection¶
public class TrendingTopics {
private CountMinSketch currentWindow;
private CountMinSketch previousWindow;
public TrendingTopics() {
this.currentWindow = new CountMinSketch(1000, 5);
this.previousWindow = new CountMinSketch(1000, 5);
}
public void addTopic(String topic) {
currentWindow.add(topic);
}
public int getTrendingScore(String topic) {
int current = currentWindow.estimate(topic);
int previous = previousWindow.estimate(topic);
return current - previous; // Growth rate
}
}
4. Cache Admission Policy¶
TinyLFU: Use CMS to track access frequency
Only admit items to cache if CMS count > threshold
Prevents cache pollution from one-time accesses
Optimizations¶
Conservative Update¶
Reduce overestimation:
public void addConservative(String item, int count) {
// Get current minimum estimate
int minVal = estimate(item);
// Only increment counters that equal the minimum
for (int i = 0; i < depth; i++) {
int col = hash(item, i);
if (counters[i][col] == minVal) {
counters[i][col] += count;
}
}
}
Time-Decayed Counting¶
Weight recent events more:
public void addWithDecay(String item, long timestamp) {
double decayFactor = Math.exp(-LAMBDA * (currentTime - timestamp));
add(item, (int) Math.round(decayFactor));
}
Key Takeaways¶
- Approximate frequency estimation - Trade accuracy for space
- Never underestimates - Always safe for "at least N" queries
- May overestimate - Due to hash collisions
- Sublinear space - Fixed size regardless of stream length
- Mergeable - Perfect for distributed systems
- Fast operations - O(d) where d is typically 5-10
- Best for - High-volume streams where exact counts aren't needed