Skip to content

Bloom Filter

Introduction

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can tell you: - "Definitely not in set" - 100% accurate - "Probably in set" - May have false positives

Key Properties

Property Description
Space efficient Much smaller than storing actual elements
Fast O(k) for insert and lookup (k = number of hash functions)
No false negatives If it says "no", element is definitely not present
False positives possible If it says "yes", element might not be present
No deletion Standard Bloom filter doesn't support deletion

How It Works

Structure

  • Bit array of size m (initialized to all 0s)
  • k independent hash functions

Insert Operation

For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Set bits at positions h1(x), h2(x), ..., hk(x) to 1

Lookup Operation

For element x:
1. Compute k hash values: h1(x), h2(x), ..., hk(x)
2. Check all k positions
3. If ALL bits are 1 → "Probably in set"
4. If ANY bit is 0 → "Definitely not in set"

Visual Example

Bloom Filter Visual Example


False Positive Probability

Formula

P(false positive) ≈ (1 - e^(-kn/m))^k

Where:
- m = size of bit array
- n = number of elements inserted
- k = number of hash functions

Optimal Number of Hash Functions

k = (m/n) × ln(2) ≈ 0.693 × (m/n)

Optimal Bit Array Size

m = -n × ln(p) / (ln(2))²

Where p = desired false positive probability

Quick Reference Table

False Positive Rate Bits per Element Hash Functions
1% 9.6 bits 7
0.1% 14.4 bits 10
0.01% 19.2 bits 13

Rule of thumb: ~10 bits per element for 1% false positive rate


Implementation

Python Implementation

import mmh3  # MurmurHash3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size: int, num_hashes: int):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def _get_hash_positions(self, item: str) -> list:
        positions = []
        for seed in range(self.num_hashes):
            hash_val = mmh3.hash(item, seed) % self.size
            positions.append(hash_val)
        return positions

    def add(self, item: str):
        for pos in self._get_hash_positions(item):
            self.bit_array[pos] = 1

    def contains(self, item: str) -> bool:
        return all(
            self.bit_array[pos]
            for pos in self._get_hash_positions(item)
        )

    @classmethod
    def create_optimal(cls, expected_items: int, fp_rate: float):
        """Create Bloom filter with optimal parameters"""
        import math
        m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
        k = int((m / expected_items) * math.log(2))
        return cls(m, k)


# Usage
bf = BloomFilter.create_optimal(expected_items=1000000, fp_rate=0.01)
bf.add("user123")
bf.add("user456")

print(bf.contains("user123"))  # True (correct)
print(bf.contains("user789"))  # False (correct) or True (false positive)

Use Cases

1. Database Query Optimization

Database Query Optimization with Bloom Filter

Used by: HBase, Cassandra, LevelDB, RocksDB

2. Web Crawler (Avoid Re-crawling)

visited_urls = BloomFilter(size=10_000_000, num_hashes=7)

def should_crawl(url):
    if visited_urls.contains(url):
        return False  # Already visited (probably)
    visited_urls.add(url)
    return True

3. Spell Checker

Dictionary in Bloom Filter (memory efficient)
Check if word exists before suggesting corrections

4. Content Recommendation (Avoid Re-showing)

shown_to_user = BloomFilter(...)

def recommend(user_id, item_id):
    key = f"{user_id}:{item_id}"
    if shown_to_user.contains(key):
        return False  # Already shown
    shown_to_user.add(key)
    return True

5. Weak Password Detection

Store common passwords in Bloom Filter
Check new passwords against it
False positive = reject good password (acceptable)
False negative = never happens (security maintained)

6. Distributed Systems (Reduce Network Calls)

Distributed Systems - Reduce Network Calls


Variants

Counting Bloom Filter

  • Replace bits with counters
  • Supports deletion
  • Higher space overhead

Counting Bloom Filter

Scalable Bloom Filter

  • Grows dynamically as elements added
  • Chain of Bloom filters with decreasing FP rates
  • Each new filter has stricter FP rate

Cuckoo Filter

  • Supports deletion
  • Better space efficiency for low FP rates
  • Can return actual fingerprint (partial match)

Quotient Filter

  • Cache-friendly
  • Supports deletion
  • Mergeable

Bloom Filter vs Alternatives

Structure Space Lookup Insert Delete False Positives
Bloom Filter O(n) bits O(k) O(k) Yes
Counting Bloom O(n) × 4 bits O(k) O(k) Yes
Cuckoo Filter O(n) bits O(1) O(1) Yes
Hash Set O(n) × element size O(1) O(1) No

Common Interview Questions

1. Why can't we delete from a Bloom Filter?

Problem: Multiple elements may share the same bit position

Example:
- "apple" sets bits at positions 1, 4, 7
- "banana" sets bits at positions 1, 5, 9
- Deleting "apple" by clearing bit 1 would affect "banana" lookup

Solution: Use Counting Bloom Filter

2. How to choose optimal parameters?

Given: n (expected elements), p (desired false positive rate)

Calculate:
- m (bits) = -n × ln(p) / (ln(2))²
- k (hashes) = (m/n) × ln(2)

Example: n=1M, p=1%
- m = -1,000,000 × ln(0.01) / 0.4804 ≈ 9.6M bits ≈ 1.2 MB
- k = 9.6 × 0.693 ≈ 7 hash functions

3. What happens when Bloom Filter fills up?

- False positive rate increases
- Never becomes 100% (unless all bits are 1)
- Solutions:
  1. Create larger filter and rebuild
  2. Use Scalable Bloom Filter
  3. Accept higher FP rate

4. How to check if an element was definitely added?

You can't! Bloom Filter only tells you:
- "Definitely NOT added" (bit is 0)
- "Possibly added" (could be false positive)

If you need certainty, use secondary storage for confirmation.

5. How to implement "recently seen" with expiration?

Time-Windowed Bloom Filters


Real-World Usage

System Use Case
Google BigTable Avoid disk reads for non-existent rows
Apache Cassandra SSTable lookup optimization
Apache HBase Block cache optimization
Medium Avoid showing read articles
Ethereum Log event filtering
Chrome Safe browsing (malicious URL check)
Akamai CDN cache optimization
Bitcoin SPV wallet transaction filtering

Space Comparison

For 1 million items:

Structure Space
Hash Set (strings avg 20 bytes) ~20 MB
Bloom Filter (1% FP) ~1.2 MB
Bloom Filter (0.1% FP) ~1.8 MB

Savings: 10-15x less memory


Key Takeaways

  1. Space efficient: 10 bits/element for 1% false positive
  2. No false negatives: "No" always means "definitely no"
  3. Fast operations: O(k) where k is small (typically 3-10)
  4. No deletion: Use Counting Bloom Filter if needed
  5. Choose wisely: Size based on expected elements and acceptable FP rate
  6. Best for: Membership testing where false positives are acceptable