Bloom Filter

Recall our goal in the previous session: given a data d and a collection S, we want an efficient way to know:

If d is in S (membership query)
Add d to S (insertion) while maintaining not duplicate in S

Bit Array

We started from a bit array, e.g. set([0, 1, 4]) can be represented as:

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0

When a new data d comes, we hash it to a bit position and set the bit to 1

However, the problem is that we need the size of the bit array to be as large as the number of data we want to store. This is not efficient (sparse array)

Hash Function

We then introduced a hash function that maps a data d to a bit position in the array, effectively compressing the array size

(e.g. $h(x) = x \mod 8$ )

set([0, 1, 4, 15]) can be represented as:

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
1 | 1 | 0 | 0 | 1 | 0 | 0 | 1

Collision Handling

But we have a problem: collisions.

When two data d1 and d2 are mapped to the same bit position, we cannot distinguish them anymore.

e.g. $h(8) = h(0) = 0$

So that we now need to store the data in the array as well

set([0, 1, 4, 15, 8]) can be represented as:

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
0 | 1 | _ | _ | 4 | _ | _ | 15
8 |   |   |   |   |   |   |

But the smaller the array size, the more collisions we would have. If the array size is too small, the mapping table would effectively becomes a linked list:

set([0, 1, 4, 15, 8]) can be represented as:

0 | 1
-----
0 | 1
4 | 15
8 |

which now has O(n) lookup time (instead of O(1))

There is a trade-off between the array size and the number of collisions (and thus the lookup time) -> classic memory vs. time trade-off

Bloom Filter

Bloom filter is a probabilistic data structure that solves the problem of memory vs. accuracy trade-off.

The idea is - instead of only using one hash function - we use multiple hash functions to map a data d to multiple bit positions in the array.

# This is just a simple hash function that takes a string and returns an integer.
# Don't worry about the implementation details
def h(x, i, filter_size):
    h2 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 7919
    h3 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 4933
    return (hash(x) + i * h2(x) + i * i * h3(x)) % filter_size

h1 = h("Hello", 0, 10)
h2 = h("Hello", 1, 10)
h3 = h("Hello", 2, 10)
print(h1, h2, h3)

9 4 1

So, “hello” is mapped to 3 hash values: h1, h2, h3 and we set the corresponding bits to 1 (similar to the bit array approach)

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
-------------------------------------
0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1

To check whether “hello” is in the set, we check whether all the corresponding bits are 1. If not, we know that “hello” is not in the set. If yes, we know that “hello” is probably in the set (but not 100% sure)

Let’s store more data: “world”

h1 = h("alice", 0, 10)
h2 = h("alice", 1, 10)
h3 = h("alice", 2, 10)
print(h1, h2, h3)

8 1 2

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
-------------------------------------
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1

Now, let’s check whether “bob” is exist:

h1 = h("bob", 0, 10)
h2 = h("bob", 1, 10)
h3 = h("bob", 2, 10)
print(h1, h2, h3)

9 6 5

Since bit 6 and 5 are 0, we are sure that “bob” is not in the set

h1 = h("catur", 0, 10)
h2 = h("catur", 1, 10)
h3 = h("catur", 2, 10)
print(h1, h2, h3)

4 5 2

“catur” is also definitely not in the set (bit 5 is 0)

h1 = h("levi", 0, 10)
h2 = h("levi", 1, 10)
h3 = h("levi", 2, 10)
print(h1, h2, h3)

4 4 4

Is “levi” in the set?

Bit 4 are 1, so we can conclude that “levi” is probably in the set (it can be a false positive)

Type I and Type II Error

x	Actual Positive	Actual Negative
Predicted True	True Positive	False Positive
Predicted False	False Negative	True Negative

False Positive means: we predict it is positive, but it is actually negative

Bloom Filter has no False Negative, but it has False Positive i.e. when Bloom Filter say: - it’s not in the set -> it’s definitely not in the set - it’s in the set -> it’s probably in the set

Full Implementation

Note: this is just a conceptual implementation, not a production-ready implementation

# This is just a simple hash function that takes a string and returns an integer.
# Don't worry about the implementation details
def h(x, i, filter_size):
    h2 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 7919
    h3 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 4933
    return (hash(x) + i * h2(x) + i * i * h3(x)) % filter_size

class BloomFilter:
    def __init__(self, size):
        self.size = size
        self.filter = [0] * size

    def add(self, item):
        for i in range(3):  # we are using three hashes
            index = h(item, i, self.size)
            self.filter[index] = 1

    def check(self, item):
        for i in range(3):  # we are using three hashes
            index = h(item, i, self.size)
            if self.filter[index] == 0:
                return False  # definitely not present
        return True  # possibly present

# Usage
bloom_size = 1000
bloom_filter = BloomFilter(bloom_size)

# Add some elements
bloom_filter.add("apple")
bloom_filter.add("banana")

# Check for elements
print(bloom_filter.check("apple"))  
print(bloom_filter.check("banana")) 
print(bloom_filter.check("cherry")) 
print(bloom_filter.check("durian"))

True
True
False
False

Note that, in a normal Bloom Filter implementation, we can’t remove an element from the set

Complexity

Space: O(m) where m is the size of the bit array, m can be << n (the number of data)
Lookup: O(k) where k is the number of hash functions

Optimal Parameters

The ideal values for m and k in a Bloom filter, where:

m is the size of the bit array.
k is the number of hash functions.

are dependent on two factors:

The number of elements to be inserted into the Bloom filter, denoted as n.
The acceptable false positive probability, denoted as p.

For the optimal number of hash functions k, the formula is:

$k = \frac{m}{n} \ln 2$

This is derived from the fact that the false positive rate is minimized when each of the k hash functions maps an inserted element to a bit that is set to 1 with the probability 1/2.

For the size of the bit array m, given the desired false positive probability p, the formula is:

$m = -\frac{n \ln p}{(\ln 2)^2}$

These equations are derived from the analysis that aims to minimize the probability of false positives.

Use Cases

Source: Wikipedia

Weak Password Detection

Weak passwords can be stored in a Bloom Filter. When a user tries to register a new password, we can check whether it is in the Bloom Filter. - If yes, we can reject the password (or double check) - If no, it is not a weak password

Used Username Detection

We store all the used usernames in a Bloom Filter. When a user tries to register a new username, we can check whether it is in the Bloom Filter. - If yes, we can double check to database whether it is really used - If no, it is not a used username, we can safely register it

Malicious URL Detection

Chrome used to use Bloom Filter to detect malicious URLs. When a user tries to visit a URL, Chrome can check whether it is in the Bloom Filter. - If yes, it probably a malicious URL, Chrome can warn the user (or double check) - If no, it is not a malicious URL, Chrome can safely visit it

Database

In databases, Bloom filters are used to reduce the disk lookups for non-existing rows or columns. They can quickly tell if a given data is not present in the database, thereby saving costly disk access.

Spell Checker

Bloom filters can be used to reduce the size of the dictionary, allowing for quick lookups to see if a word is in the dictionary or definitely not. - If yes, it’s probably a word - If no, we can warn the user that the word is not in the dictionary