A HyperLogLog is a probabilistic data structure used in order to count unique values. Mathematically it is defined as a probabilistic data structure to estimate the cardinality of a data set. Computing the count of distinct elements in a large data set is often necessary but computationally intensive. Say you need to calculate the number of distinct users visiting your website in the past week. Doing this with a traditional SQL query on a large data set would take a long period of time and a large amount of memory. But instead of exact count if an approximation is allowed we can achieve this in no time with a small amount of memory usage by using the HyperLogLog algorithm. In this post I’m going to give a quick introduction to Redis HyperLogLog. From now on, we will call it HLL in rest of this article.
Let’s start. First you need to start your redis server. See my previous posts to install and run redis server instance.
Now connect to redis server instance by running
Redis provides following three commands to work with the HLL.
The above commands are prefixed with
PF in memory of Philippe Flajolet, the inventor of the HyperLogLog algorithm.
The basic idea to use Redis HyperLogLog is every time you see a new element, you add it to the count with
PFADD. And when you want to retrieve the current approximation of the unique elements added with
PFADD so far, you use the
Let say you want to calculate the unique visitor count for your website. User “alice” first visit the site, you add it to your HLL named “visitors”.
> PFADD visitors alice (integer) 1
1 if the approximated cardinality of the HLL was changed when adding the element. If not, it returns
Next, user “bob” and “carol” visit the site, you add them to the “visitors” HLL.
> PFADD visitors bob (integer) 1 > PFADD visitors carol (integer) 1
Redis also supports adding multiple elements with a single command:
> PFADD visitors bob carol (integer) 1
Now you want to count to number of unique visitors so far, use
PFCOUNT to get that.
> PFCOUNT visitors (integer) 3
PFCOUNT uses caching in order to remember the cardinality previously computed. This cache is invalidated on the next
PFADD only if the cardinality changes. The returned cardinality of the observed set is not exact, but approximated with a standard error of 0.81%.
Let say user “alice” visits the site once again. So you would add it to HLL like before:
> PFADD visitors alice (integer) 0
But this time the output of the command is 0 as “alice” is already in the HLL. Calling the
PFCOUNT will return 3, not 4.
> PFCOUNT visitors (integer) 3
> GET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD" > SET users "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD" OK > PFCOUNT users (integer) 3
This is useful if the HyperLogLog needs to be copied or persisted elsewhere and later restored.
PFMERGE merges multiple HLL values into one unique value that will approximate the cardinality of the union of the existing HyperLogLogs:
> PFADD subscribers alice ted (integer) 1 > PFMERGE everyone visitors subscribers OK > PFCOUNT everyone (integer) 4
The same result can be achieved by supplying multiple keys to
> PFCOUNT visitors subscribers (integer) 4
PFCOUNT is called with multiple keys, an on-the-fly merge of the HyperLogLogs is performed, which is slow, moreover the cardinality of the union can’t be cached. So when used with multiple keys
PFCOUNT may take a time in the order of magnitude of the millisecond. You should take in mind that single-key and multiple-keys executions of
PFCOUNT command are semantically different and have different performances. When you need to run
PFCOUNT with multiple-keys multiple times, it’s better to use
PFMERGEfirst and then use the
PFCOUNT command with the merged key.
That’s it. Now you should have a good understanding about the Redis HyperLogLog and you should be able to use it in your application to simplify calculating distinct counts. Thanks for reading.