Cardinality Estimation in HyperLogLog
The HyperLogLog (HLL) cardinality estimator is one of the most interesting and useful things I’ve learned about in the last few years. I’ve made a few attempts to understand the HLL paper but I inevitably run aground on this sentence:
For instance, observing in the stream S at the beginning of a string a bitpattern 0 ρ−1 is more or less a likely indication that the cardinality n of S is at least 2 ρ.
This formulation always seems weirdly squishy – “more or less a likely indication” – so I’ve been puzzling over how to make sense of it.
Defining the Problem
To back up a little, the setup is this. You associate a random bit-string with each element in a set. Specifically, the bit-strings are random in the sense that each bit is chosen uniformly and independently of the others. So for example, the probability that a given bit-string starts with a 1 is .5. The probability that it starts with 01 is .25, etc. In general the probability that it starts with exactly ρ-1 zeroes followed by a one is 2-ρ. (The same is true of any specific prefix of length (ρ), of course.)
So far so good. The tricky bit is that in the HLL you are observing these bit-strings in a stream, and the only information you preserve from one to the next is the maximum number of leading zeroes that you’ve seen so far. From that you need to make an estimate of N, the number of distinct bit-strings you’ve seen, with the assumption that bit-strings map uniquely on to elements of the set and so N also gives an estimate of the cardinality of the set.
So, let’s define Z as the random variable representing the maximum number of leading zeroes observed in a set of N random bit-strings. A natural question to ask is, for a given value of N, what is the distribution of Z? If we know that, then we can find the maximum-likelihood estimate for N, by choosing N to maximize P(ẑ) for a specific observed value ẑ of Z. There’s probably a textbook answer to this question but I wanted to try to work it out from scratch in the interest of understanding it better.
Computing the Distribution
I tried approaching it by working through the combinatorics case-by-case. It starts out easily enough. Suppose N=2, then we have:
| ẑ | Possible bit-string prefix pairs | Probability of prefix pair | P(ẑ) |
|---|---|---|---|
| 0 | (1x,1x) | .25 | .25 |
| 1 | (1x,01) | .125 | .3125 |
| (01,1x) | .125 | ||
| (01,01) | .0625 | ||
| 2 | (1xx,001) | .0625 | .203125 |
| (01x,001) | .03125 | ||
| (001,1xx) | .0625 | ||
| (001,01x) | .03125 | ||
| (001,001) | .015625 |
This will obviously get out-of-hand fast, but it seems clear that the probabilities will fall off rapidly for ẑ > 2. It’s encouraging to see that the best estimate of N in this example is is 2ẑ.
I thought about this a bit more and realized that there’s a recurrence relation that leads to a straightforward dynamic-programming solution to compute the distribution for each value of N. (Which is great, because I’m befuddled by dynamic programming and need all the practice I can get).
It’s easiest to think about if you assume you just have the N bit-strings all lined up in a row and can consider them one after another. Suppose you’ve visited 5 bit-strings and the observed ẑ is 3, i.e. you’ve seen 3 leading zeros in at least one bit-string. There are three ways this could happen:
- You observed z < 3 in the first 4 bit-strings, and z = 3 in the fifth.
- You observed z = 3 in at least one of the first 4 bit-strings, and z < 3 in the fifth.
- You observed z = 3 in at least one of the first 4 bit-strings, and z = 3 also in the fifth.
(You can actually combine the second and third cases above but the code turns out a little nicer if you think about it this way.) This breakdown gives us the following recurrence relation:
Once I had the recurrence relation defined I of course wanted to code it up. This gist implements the above recurrence relation in Python using numpy.
Results
The code posted above has a couple of output functions for interrogating the results. One prints out the probability table up to maxN. Happily, this confirms the results I derived by hand above for the N=2 case.
More interestingly it can print out the most likely value of N for each value of ẑ up to some limit. I ran it for ẑ < 10, N < 10000 and got the following results.
| ẑ | Most likely N | P(ẑ | N) |
|---|---|---|
| 0 | 1 | 0.5 |
| 1 | 2 | 0.3125 |
| 2 | 5 | 0.275604248046875 |
| 3 | 11 | 0.26149056621483169 |
| 4 | 22 | 0.25559409343845607 |
| 5 | 44 | 0.25275771691991533 |
| 6 | 88 | 0.25136619509499902 |
| 7 | 177 | 0.25068007612282972 |
| 8 | 354 | 0.25033913265559699 |
| 9 | 709 | 0.25016941106842389 |
| 10 | 1419 | 0.25008466256395218 |
| 11 | 2839 | 0.25004231632716933 |
| 12 | 5678 | 0.25002115600302027 |
The most striking thing in the results is that the estimate of N is roughly exponential but actually grows a bit faster than exponentially. I believe the paper discusses correcting for this bias in the estimator but I need to go back and take a look. Perhaps that will be a topic for another day.