# This is just a simple hash function that takes a string and returns an integer.
# Don't worry about the implementation details
def h(x, i, filter_size):
= lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 7919
h2 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 4933
h3 return (hash(x) + i * h2(x) + i * i * h3(x)) % filter_size
Bloom Filter
Recall our goal in the previous session: given a data d
and a collection S
, we want an efficient way to know:
- If
d
is inS
(membership query) - Add
d
toS
(insertion) while maintaining not duplicate inS
Bit Array
We started from a bit array, e.g. set([0, 1, 4])
can be represented as:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0
When a new data d
comes, we hash it to a bit position and set the bit to 1
However, the problem is that we need the size of the bit array to be as large as the number of data we want to store. This is not efficient (sparse array)
Hash Function
We then introduced a hash function that maps a data d
to a bit position in the array, effectively compressing the array size
(e.g. h(x) = x \mod 8)
set([0, 1, 4, 15])
can be represented as:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
1 | 1 | 0 | 0 | 1 | 0 | 0 | 1
Collision Handling
But we have a problem: collisions.
When two data d1
and d2
are mapped to the same bit position, we cannot distinguish them anymore.
e.g. h(8) = h(0) = 0
So that we now need to store the data in the array as well
set([0, 1, 4, 15, 8])
can be represented as:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
-----------------------------
0 | 1 | _ | _ | 4 | _ | _ | 15
8 | | | | | | |
But the smaller the array size, the more collisions we would have. If the array size is too small, the mapping table would effectively becomes a linked list:
set([0, 1, 4, 15, 8])
can be represented as:
0 | 1
-----
0 | 1
4 | 15
8 |
which now has O(n)
lookup time (instead of O(1)
)
There is a trade-off between the array size and the number of collisions (and thus the lookup time) -> classic memory vs. time trade-off
Bloom Filter
Bloom filter is a probabilistic data structure that solves the problem of memory vs. accuracy trade-off.
The idea is - instead of only using one hash function - we use multiple hash functions to map a data d
to multiple bit positions in the array.
= h("Hello", 0, 10)
h1 = h("Hello", 1, 10)
h2 = h("Hello", 2, 10)
h3 print(h1, h2, h3)
9 4 1
So, “hello” is mapped to 3 hash values: h1
, h2
, h3
and we set the corresponding bits to 1 (similar to the bit array approach)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
-------------------------------------
0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1
To check whether “hello” is in the set, we check whether all the corresponding bits are 1. If not, we know that “hello” is not in the set. If yes, we know that “hello” is probably in the set (but not 100% sure)
Let’s store more data: “world”
= h("alice", 0, 10)
h1 = h("alice", 1, 10)
h2 = h("alice", 2, 10)
h3 print(h1, h2, h3)
8 1 2
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
-------------------------------------
0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1
Now, let’s check whether “bob” is exist:
= h("bob", 0, 10)
h1 = h("bob", 1, 10)
h2 = h("bob", 2, 10)
h3 print(h1, h2, h3)
9 6 5
Since bit 6 and 5 are 0, we are sure that “bob” is not in the set
= h("catur", 0, 10)
h1 = h("catur", 1, 10)
h2 = h("catur", 2, 10)
h3 print(h1, h2, h3)
4 5 2
“catur” is also definitely not in the set (bit 5 is 0)
= h("levi", 0, 10)
h1 = h("levi", 1, 10)
h2 = h("levi", 2, 10)
h3 print(h1, h2, h3)
4 4 4
Is “levi” in the set?
Bit 4 are 1, so we can conclude that “levi” is probably in the set (it can be a false positive)
Type I and Type II Error
x | Actual Positive | Actual Negative |
---|---|---|
Predicted True | True Positive | False Positive |
Predicted False | False Negative | True Negative |
False Positive means: we predict it is positive, but it is actually negative
Bloom Filter has no False Negative, but it has False Positive i.e. when Bloom Filter say: - it’s not in the set -> it’s definitely not in the set - it’s in the set -> it’s probably in the set
Full Implementation
Note: this is just a conceptual implementation, not a production-ready implementation
# This is just a simple hash function that takes a string and returns an integer.
# Don't worry about the implementation details
def h(x, i, filter_size):
= lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 7919
h2 = lambda x: int.from_bytes(x.encode('utf-8'), 'little') % 4933
h3 return (hash(x) + i * h2(x) + i * i * h3(x)) % filter_size
class BloomFilter:
def __init__(self, size):
self.size = size
self.filter = [0] * size
def add(self, item):
for i in range(3): # we are using three hashes
= h(item, i, self.size)
index self.filter[index] = 1
def check(self, item):
for i in range(3): # we are using three hashes
= h(item, i, self.size)
index if self.filter[index] == 0:
return False # definitely not present
return True # possibly present
# Usage
= 1000
bloom_size = BloomFilter(bloom_size)
bloom_filter
# Add some elements
"apple")
bloom_filter.add("banana")
bloom_filter.add(
# Check for elements
print(bloom_filter.check("apple"))
print(bloom_filter.check("banana"))
print(bloom_filter.check("cherry"))
print(bloom_filter.check("durian"))
True
True
False
False
Note that, in a normal Bloom Filter implementation, we can’t remove an element from the set
Complexity
- Space:
O(m)
wherem
is the size of the bit array,m
can be <<n
(the number of data) - Lookup:
O(k)
wherek
is the number of hash functions
Optimal Parameters
The ideal values for m
and k
in a Bloom filter, where:
m
is the size of the bit array.k
is the number of hash functions.
are dependent on two factors:
- The number of elements to be inserted into the Bloom filter, denoted as
n
. - The acceptable false positive probability, denoted as
p
.
For the optimal number of hash functions k
, the formula is:
k = \frac{m}{n} \ln 2
This is derived from the fact that the false positive rate is minimized when each of the k
hash functions maps an inserted element to a bit that is set to 1 with the probability 1/2
.
For the size of the bit array m
, given the desired false positive probability p
, the formula is:
m = -\frac{n \ln p}{(\ln 2)^2}
These equations are derived from the analysis that aims to minimize the probability of false positives.
Use Cases
Source: Wikipedia
Weak Password Detection
Weak passwords can be stored in a Bloom Filter. When a user tries to register a new password, we can check whether it is in the Bloom Filter. - If yes, we can reject the password (or double check) - If no, it is not a weak password
Used Username Detection
We store all the used usernames in a Bloom Filter. When a user tries to register a new username, we can check whether it is in the Bloom Filter. - If yes, we can double check to database whether it is really used - If no, it is not a used username, we can safely register it
Malicious URL Detection
Chrome used to use Bloom Filter to detect malicious URLs. When a user tries to visit a URL, Chrome can check whether it is in the Bloom Filter. - If yes, it probably a malicious URL, Chrome can warn the user (or double check) - If no, it is not a malicious URL, Chrome can safely visit it
Database
In databases, Bloom filters are used to reduce the disk lookups for non-existing rows or columns. They can quickly tell if a given data is not present in the database, thereby saving costly disk access.
Spell Checker
Bloom filters can be used to reduce the size of the dictionary, allowing for quick lookups to see if a word is in the dictionary or definitely not. - If yes, it’s probably a word - If no, we can warn the user that the word is not in the dictionary