Bloom Filter¶
Introduction¶
A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you: - "Definitely not in set" - 100% accurate - "Probably in set" - May have false positives
Key Properties¶
| Property | Description |
|---|---|
| Space efficient | Much smaller than storing actual elements |
| Fast | O(k) for insert and lookup (k = number of hash functions) |
| No false negatives | If it says "no", element is definitely not present |
| False positives possible | If it says "yes", element might not be present |
| No deletion | Standard Bloom filter doesn't support deletion |
How It Works¶
Structure¶
- Bit array of size
m(initialized to all 0s) kindependent hash functions
Insert Operation¶
For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Set bits at positions h1(x), h2(x), ..., hk(x) to 1
Lookup Operation¶
For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Check all k positions
3. If ALL bits are 1 → "Probably in set"
4. If ANY bit is 0 → "Definitely not in set"
Visual Example¶
False Positive Probability¶
Formula¶
P(false positive) ≈ (1 - e^(-kn/m))^k
Where:
- m = size of bit array
- n = number of elements inserted
- k = number of hash functions
Optimal Number of Hash Functions¶
Optimal Bit Array Size¶
Quick Reference Table¶
| False Positive Rate | Bits per Element | Hash Functions |
|---|---|---|
| 1% | 9.6 bits | 7 |
| 0.1% | 14.4 bits | 10 |
| 0.01% | 19.2 bits | 13 |
Rule of thumb: ~10 bits per element for 1% false positive rate
Implementation¶
Python Implementation¶
import mmh3 # MurmurHash3
from bitarray import bitarray
class BloomFilter:
def __init__(self, size: int, num_hashes: int):
self.size = size
self.num_hashes = num_hashes
self.bit_array = bitarray(size)
self.bit_array.setall(0)
def _get_hash_positions(self, item: str) -> list:
positions = []
for seed in range(self.num_hashes):
hash_val = mmh3.hash(item, seed) % self.size
positions.append(hash_val)
return positions
def add(self, item: str):
for pos in self._get_hash_positions(item):
self.bit_array[pos] = 1
def contains(self, item: str) -> bool:
return all(
self.bit_array[pos]
for pos in self._get_hash_positions(item)
)
@classmethod
def create_optimal(cls, expected_items: int, fp_rate: float):
"""Create Bloom filter with optimal parameters"""
import math
m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
k = int((m / expected_items) * math.log(2))
return cls(m, k)
# Usage
bf = BloomFilter.create_optimal(expected_items=1000000, fp_rate=0.01)
bf.add("user123")
bf.add("user456")
print(bf.contains("user123")) # True (correct)
print(bf.contains("user789")) # False (correct) or True (false positive)
Use Cases¶
1. Database Query Optimization¶
Used by: HBase, Cassandra, LevelDB, RocksDB
2. Web Crawler (Avoid Re-crawling)¶
visited_urls = BloomFilter(size=10_000_000, num_hashes=7)
def should_crawl(url):
if visited_urls.contains(url):
return False # Already visited (probably)
visited_urls.add(url)
return True
3. Spell Checker¶
4. Content Recommendation (Avoid Re-showing)¶
shown_to_user = BloomFilter(...)
def recommend(user_id, item_id):
key = f"{user_id}:{item_id}"
if shown_to_user.contains(key):
return False # Already shown
shown_to_user.add(key)
return True
5. Weak Password Detection¶
Store common passwords in Bloom Filter
Check new passwords against it
False positive = reject good password (acceptable)
False negative = never happens (security maintained)
6. Distributed Systems (Reduce Network Calls)¶
Variants¶
Counting Bloom Filter¶
- Replace bits with counters
- Supports deletion
- Higher space overhead
Scalable Bloom Filter¶
- Grows dynamically as elements added
- Chain of Bloom filters with decreasing FP rates
- Each new filter has stricter FP rate
Cuckoo Filter¶
- Supports deletion
- Better space efficiency for low FP rates
- Can return actual fingerprint (partial match)
Quotient Filter¶
- Cache-friendly
- Supports deletion
- Mergeable
Bloom Filter vs Alternatives¶
| Structure | Space | Lookup | Insert | Delete | False Positives |
|---|---|---|---|---|---|
| Bloom Filter | O(n) bits | O(k) | O(k) | ❌ | Yes |
| Counting Bloom | O(n) × 4 bits | O(k) | O(k) | ✅ | Yes |
| Cuckoo Filter | O(n) bits | O(1) | O(1) | ✅ | Yes |
| Hash Set | O(n) × element size | O(1) | O(1) | ✅ | No |
Common Interview Questions¶
1. Why can't we delete from a Bloom Filter?¶
Problem: Multiple elements may share the same bit position
Example:
- "apple" sets bits at positions 1, 4, 7
- "banana" sets bits at positions 1, 5, 9
- Deleting "apple" by clearing bit 1 would affect "banana" lookup
Solution: Use Counting Bloom Filter
2. How to choose optimal parameters?¶
Given: n (expected elements), p (desired false positive rate)
Calculate:
- m (bits) = -n × ln(p) / (ln(2))²
- k (hashes) = (m/n) × ln(2)
Example: n=1M, p=1%
- m = -1,000,000 × ln(0.01) / 0.4804 ≈ 9.6M bits ≈ 1.2 MB
- k = 9.6 × 0.693 ≈ 7 hash functions
3. What happens when Bloom Filter fills up?¶
- False positive rate increases
- Never becomes 100% (unless all bits are 1)
- Solutions:
1. Create larger filter and rebuild
2. Use Scalable Bloom Filter
3. Accept higher FP rate
4. How to check if an element was definitely added?¶
You can't! Bloom Filter only tells you:
- "Definitely NOT added" (bit is 0)
- "Possibly added" (could be false positive)
If you need certainty, use secondary storage for confirmation.
5. How to implement "recently seen" with expiration?¶
Real-World Usage¶
| System | Use Case |
|---|---|
| Google BigTable | Avoid disk reads for non-existent rows |
| Apache Cassandra | SSTable lookup optimization |
| Apache HBase | Block cache optimization |
| Medium | Avoid showing read articles |
| Ethereum | Log event filtering |
| Chrome | Safe browsing (malicious URL check) |
| Akamai | CDN cache optimization |
| Bitcoin | SPV wallet transaction filtering |
Space Comparison¶
For 1 million items:
| Structure | Space |
|---|---|
| Hash Set (strings avg 20 bytes) | ~20 MB |
| Bloom Filter (1% FP) | ~1.2 MB |
| Bloom Filter (0.1% FP) | ~1.8 MB |
Savings: 10-15x less memory
Key Takeaways¶
- Space efficient: 10 bits/element for 1% false positive
- No false negatives: "No" always means "definitely no"
- Fast operations: O(k) where k is small (typically 3-10)
- No deletion: Use Counting Bloom Filter if needed
- Choose wisely: Size based on expected elements and acceptable FP rate
- Best for: Membership testing where false positives are acceptable