Why bloom filters are cool
Published March 08, 2025
Recently I was needing to store a few million hashes for a project. I found, and started reading about bloom filters. From what I reading it seemed like the perfect solution.
Project Requirements
Without getting into the context of what the application is (maybe one day), here are my requirements.
- Cheaply store millions, and potentially up to billions of sha256 hashes.
- Be able to "index" these hashes and ship them in a way that can be queried offline and client-side by my application.
- I didn't need 100% accuracy. A failure-rate of 0.01% would be acceptable.
- The ability to query up to billions of hashes within milliseconds.
- Not shipping the values of hashes to clients would be a plus.
Why, and what are Bloom Filters?
From wikipedia:
a Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set"
Bloom filters in the wild:
-
The Google Chrome web browser previously used a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).
-
The Squid Web Proxy Cache uses Bloom filters for cache digests.
-
Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain.
-
Fruit flies use a modified version of Bloom filters to detect novelty of odors, with additional features including similarity of novel odor to that of previously experienced examples, and time elapsed since previous experience of the same odor.
Redis Bloom
While doing more research I found redis bloom filters. After testing I soon found myself facing memory limits upwards of close to 1 billion hashes (likely a user error lol). Also part of my requirements were offline access to the data and shipping a redis instance to each client was obviously not feasible.
Rust go brr!
After deciding Redis Bloom wasn't going to work I found fastbloom, which is written in rust and is supposed to be extremely fast. It also has a python module, so that means I don't have to fight with the rust borrow checker 🦀.
Installing fastbloom for python
Installation is pretty standard from pypi using pip:
$ pip install fastbloom-rs
Test data
I had thicc list of 1 billion sha256 hashes to use as part of our test. This dataset was 61GB.
$ wc -l 1billion-sha256-random-hashes.txt
1000000000 1billion-sha256-random-hashes.tx
$ du -sch 1billion-sha256-random-hashes.txt
61G 1billion-sha256-random-hashes.txt
61G total
Creating the bloom filter
We are creating out filter with an error rate of 0.01%
import os
from fastbloom_rs import BloomFilter
# Create the bloom filter and save it on disk
def create_bloom(capacity: int, error_rate: float, filename: str):
bloom = BloomFilter(capacity, error_rate)
return bloom
def process_file_in_batches(file_path, bloom, batch_size=50_000):
""" Reads the file in batches and adds data to the bloom filter efficiently. """
buffer = []
with open(file_path, 'r') as file:
for line in file:
buffer.append(line.strip()) # Strip and store in buffer
if len(buffer) >= batch_size:
bloom.add_str_batch(buffer) # Add batch
buffer.clear() # Clear buffer to free memory
# Add any remaining elements in buffer
if buffer:
bloom.add_str_batch(buffer)
def main():
# Where to save the bloom filter on disk
bloom_file = 'hashes.bloom'
# 61GB, contains 1 billion random sha256 hashes
test_hashes_file = '1billion-sha256-random-hashes.txt'
# Bloom filter parameters
capacity = 1_000_000_000
error_rate = 0.01
# Check if bloom filter already exists
if os.path.exists(bloom_file):
print(f'Bloom filter already exists')
else:
print(f'Creating bloom filter {bloom_file}')
bloom = create_bloom(capacity, error_rate, bloom_file)
# Stream and process the file in batches
process_file_in_batches(test_hashes_file, bloom, batch_size=10_000)
print(f'Done processing hashes, saving bloom filter')
bloom.save_to_file_with_hashes(bloom_file)
if __name__ == '__main__':
main()
After running this we can see that we now have the bloom filter on disk, and is only 1GB in size. It only took 4 minutes to create the filter.
$ time python create_bloom.py
Creating bloom filter hashes.bloom
Done processing hashes, saving bloom filter
real 4m2.956s
user 3m28.012s
sys 0m21.589s
$ du -sch hashes.bloom
1.1G hashes.bloom
1.1G total
Running a test of 10 random sha256 hashes
Testing was a great succeess, and went better than I had anticipated. This is why I think bloom filters are cool.
$ time python query_test.py
2b0d1209ef85cd0d749b76e02f1ead013e102274eb8683b2448c4c863e6b3abf => True
00018ba422b515e65f22ff789f092a6aaf4de445a9e45ae95b866c530db2b5f0 => True
121a91d7bb9fa90c8b398f2359e6005a41c3f5ac38dd0fd451142e36ca3af661 => True
ed1d139239ba5073b9cee3b11927610e15ab9c368a66a81d4d4a04b20a384349 => True
a183856530b28fd51f2730e01753ce4a07422a1ee9a0dbd68fd5ffd0d8447b44 => True
c7b4b33c167c5f0fd71a2459d672fb4458aa9a47870d554f5c7f541b1bd5fc2e => True
5ac0e40ad0a5e6dd990d91654576f29275082a4ba50eacda4ecd498f34031fda => True
4076fb50e890dcca4b1aa1f3bff32f08c2dc18848b7592143810518f2c3ad716 => True
e6d7e144d8e033471d398955b3332664b4601300eae95b6b80269598ffe8091d => True
2e01a1a47269243555afd8f3fd5120a0c868539e21c266fcedbcb3818e5cc6fa => True
real 0m0.214s
user 0m0.026s
sys 0m0.182s