Bloom Filter¶

Introduction¶

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you: - "Definitely not in set" - 100% accurate - "Probably in set" - May have false positives

Key Properties¶

Property	Description
Space efficient	Much smaller than storing actual elements
Fast	O(k) for insert and lookup (k = number of hash functions)
No false negatives	If it says "no", element is definitely not present
False positives possible	If it says "yes", element might not be present
No deletion	Standard Bloom filter doesn't support deletion

How It Works¶

Structure¶

Bit array of size m (initialized to all 0s)
k independent hash functions

Insert Operation¶

For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Set bits at positions h1(x), h2(x), ..., hk(x) to 1

Lookup Operation¶

For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Check all k positions
3. If ALL bits are 1 → "Probably in set"
4. If ANY bit is 0 → "Definitely not in set"

Visual Example¶

Bloom Filter Visual Example

False Positive Probability¶

Formula¶

P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
- m = size of bit array
- n = number of elements inserted
- k = number of hash functions

Optimal Number of Hash Functions¶

k = (m/n) × ln(2) ≈ 0.693 × (m/n)

Optimal Bit Array Size¶

m = -n × ln(p) / (ln(2))²

Where p = desired false positive probability

Quick Reference Table¶

False Positive Rate	Bits per Element	Hash Functions
1%	9.6 bits	7
0.1%	14.4 bits	10
0.01%	19.2 bits	13

Rule of thumb: ~10 bits per element for 1% false positive rate

Implementation¶

Python Implementation¶

import mmh3  # MurmurHash3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size: int, num_hashes: int):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def _get_hash_positions(self, item: str) -> list:
        positions = []
        for seed in range(self.num_hashes):
            hash_val = mmh3.hash(item, seed) % self.size
            positions.append(hash_val)
        return positions

    def add(self, item: str):
        for pos in self._get_hash_positions(item):
            self.bit_array[pos] = 1

    def contains(self, item: str) -> bool:
        return all(
            self.bit_array[pos]
            for pos in self._get_hash_positions(item)
        )

    @classmethod
    def create_optimal(cls, expected_items: int, fp_rate: float):
        """Create Bloom filter with optimal parameters"""
        import math
        m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
        k = int((m / expected_items) * math.log(2))
        return cls(m, k)


# Usage
bf = BloomFilter.create_optimal(expected_items=1000000, fp_rate=0.01)
bf.add("user123")
bf.add("user456")

print(bf.contains("user123"))  # True (correct)
print(bf.contains("user789"))  # False (correct) or True (false positive)

Use Cases¶

1. Database Query Optimization¶

Database Query Optimization with Bloom Filter

Used by: HBase, Cassandra, LevelDB, RocksDB

2. Web Crawler (Avoid Re-crawling)¶

visited_urls = BloomFilter(size=10_000_000, num_hashes=7)

def should_crawl(url):
    if visited_urls.contains(url):
        return False  # Already visited (probably)
    visited_urls.add(url)
    return True

3. Spell Checker¶

Dictionary in Bloom Filter (memory efficient)
Check if word exists before suggesting corrections

4. Content Recommendation (Avoid Re-showing)¶

shown_to_user = BloomFilter(...)

def recommend(user_id, item_id):
    key = f"{user_id}:{item_id}"
    if shown_to_user.contains(key):
        return False  # Already shown
    shown_to_user.add(key)
    return True

5. Weak Password Detection¶

Store common passwords in Bloom Filter
Check new passwords against it
False positive = reject good password (acceptable)
False negative = never happens (security maintained)

6. Distributed Systems (Reduce Network Calls)¶

Distributed Systems - Reduce Network Calls

Variants¶

Counting Bloom Filter¶

Replace bits with counters
Supports deletion
Higher space overhead

Counting Bloom Filter

Scalable Bloom Filter¶

Grows dynamically as elements added
Chain of Bloom filters with decreasing FP rates
Each new filter has stricter FP rate

Cuckoo Filter¶

Supports deletion
Better space efficiency for low FP rates
Can return actual fingerprint (partial match)

Quotient Filter¶

Cache-friendly
Supports deletion
Mergeable

Bloom Filter vs Alternatives¶

Structure	Space	Lookup	Insert	Delete	False Positives
Bloom Filter	O(n) bits	O(k)	O(k)	❌	Yes
Counting Bloom	O(n) × 4 bits	O(k)	O(k)	✅	Yes
Cuckoo Filter	O(n) bits	O(1)	O(1)	✅	Yes
Hash Set	O(n) × element size	O(1)	O(1)	✅	No

Common Interview Questions¶

1. Why can't we delete from a Bloom Filter?¶

Problem: Multiple elements may share the same bit position

Example:
- "apple" sets bits at positions 1, 4, 7
- "banana" sets bits at positions 1, 5, 9
- Deleting "apple" by clearing bit 1 would affect "banana" lookup

Solution: Use Counting Bloom Filter

2. How to choose optimal parameters?¶

Given: n (expected elements), p (desired false positive rate)

Calculate:
- m (bits) = -n × ln(p) / (ln(2))²
- k (hashes) = (m/n) × ln(2)

Example: n=1M, p=1%
- m = -1,000,000 × ln(0.01) / 0.4804 ≈ 9.6M bits ≈ 1.2 MB
- k = 9.6 × 0.693 ≈ 7 hash functions

3. What happens when Bloom Filter fills up?¶

- False positive rate increases
- Never becomes 100% (unless all bits are 1)
- Solutions:
  1. Create larger filter and rebuild
  2. Use Scalable Bloom Filter
  3. Accept higher FP rate

4. How to check if an element was definitely added?¶

You can't! Bloom Filter only tells you:
- "Definitely NOT added" (bit is 0)
- "Possibly added" (could be false positive)

If you need certainty, use secondary storage for confirmation.

5. How to implement "recently seen" with expiration?¶

Time-Windowed Bloom Filters

Real-World Usage¶

System	Use Case
Google BigTable	Avoid disk reads for non-existent rows
Apache Cassandra	SSTable lookup optimization
Apache HBase	Block cache optimization
Medium	Avoid showing read articles
Ethereum	Log event filtering
Chrome	Safe browsing (malicious URL check)
Akamai	CDN cache optimization
Bitcoin	SPV wallet transaction filtering

Space Comparison¶

For 1 million items:

Structure	Space
Hash Set (strings avg 20 bytes)	~20 MB
Bloom Filter (1% FP)	~1.2 MB
Bloom Filter (0.1% FP)	~1.8 MB

Savings: 10-15x less memory

Key Takeaways¶

Space efficient: 10 bits/element for 1% false positive
No false negatives: "No" always means "definitely no"
Fast operations: O(k) where k is small (typically 3-10)
No deletion: Use Counting Bloom Filter if needed
Choose wisely: Size based on expected elements and acceptable FP rate
Best for: Membership testing where false positives are acceptable