Bloom filters debunked: Dispelling 30 Years of bad math with Coq!

https://gopiandcode.uk/logs/log-bloomfilters-debunked.html#fn.1

This was an interesting read, but it doesn’t really help in the way I expected. This is final derived proof for the false-positive rate of a bloom filter, which is impossible (for me) to use intuitively:

$$ \frac{1}{m^{k(n+1)}}\sum_{i=1}^m i^k i! \begin{pmatrix}m\\i\end{pmatrix} \begin{Bmatrix}kn\\i\end{Bmatrix} $$

All the code lives at: https://github.com/certichain/ceramist


While conceptually simple, this feature actually requires more engineering effort than one would expect - in particular, tracking the set of known malicious URLs in a practical manner turns out to be somewhat difficult. This is because, on one hand, trying to store the database of all known malicious URLs, which, bear in mind, may contain millions and millions of entries, is something that is just practically infeasible for most users. Conversely, sending every URL that a user visits to some external service, where it could be logged and data-mined by nefarious third parties, is something that most users should probably not be comfortable with.

Two key properties of Bloom filters:

  • No false negatives: This states that if a query for an URL in the Bloom filter returns negative, then the queried item can be guaranteed to not be present in the set of malicious URLs - i.e the Bloom filter is guaranteed to return a positive result for all known malicious URLs.
  • Low false positive rate: This property states that for any URL that is not in the set of malicious URLs, the likelihood of a Bloom filter query returning a positive result should be fairly low - thus minimising the number of unnecessary infractions on user privacy.

However, the widely cited expression for the false positive rate of a bloom filter is wrong!

A Bloom filter \(bf\) is a simply a combination of two separate components:

  • A bit-vector of \(n\) bits.
  • A sequence of \(k\) hash functions.

To use the Bloom filter as an approximate set, we can perform the standard set operations as follows:

  • Insertion: To insert a value \(x\) into a Bloom filter, simply hash it over each one of the \(k\) hash functions, and treating the hash outputs as indices into the bit-vector, raise the corresponding bits.
  • Query: To query for a value \(y\) in the Bloom filter, once again hash it over each one of the \(k\) hash functions, and then check whether all of the corresponding bits are raised.

As there are no operations that can unset these bits, we can be certain that any subsequent query for the inserted element must return true. In other words, if we get a negative result for a query, we can be certain that the item has never been inserted into the Bloom filter before.

Ideally, we’d like to characterise the frequency at which these kinds of results may occur, however this will generally be highly dependent on the particular choice of hash functions, limiting the scope of our analysis. As is the standard approach for these kinds of data structures, in order to get a more general result, we can instead treat the hash functions and their outputs as random

Edit