TLDR: Count Distinct With Minimal Memory - Flajolet Martin Algorithm
Date: 2020-12-06 Source: https://arpitbhayani.me/blogs/flajolet-martin
Overview
Approximate count-distinct in streams using the Flajolet-Martin algorithm. Dive into its intuition, implementation, and applications. Measuring the number of distinct elements from a stream of values is one of the most common utilities that finds its application in the field of Database Query Optimizations, Network Topology, Internet Routing, Big Data Analytics, and Data Mining.
Key Points
- Measuring the number of distinct elements from a stream of values is one of the most common utilities that finds its application in the field of Database Query Optimizations, Network Topology, Internet Routing, Big Data Analytics, and Data Mining.
- The intuition: Given a good uniform distribution of numbers, the probability that the rightmost set bit is at position 0 is 1/2, probability of rightmost set bit is at position 1 is 1/2 * 1/2 = 1/4, at position 2 it is 1/8 and so on. ! In general, we can say, the probability of the rightmost set bit, in binary presentation, to be at the position k in a uniform distribution of numbers is ! The probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least Significant Bit to the Most Significant Bit. ! So if we keep on recording the position of the rightmost set bit, ρ, for every element in the stream (assuming uniform distribution) we should expect ρ = 0 to be 0.5, ρ = 1 to be 0.25, and so on.