MinHash - Fast Jaccard Similarity at Scale
Source: https://arpitbhayani.me/blogs/jaccard-minhash Date: 2020-11-08
Learn Jaccard Similarity and MinHash, a technique to efficiently estimate set similarity at scale for tasks like near-duplicate detection.
Set similarity measure finds its application spanning the Computer Science spectrum; some applications being - user segmentation, finding near-duplicate webpages/documents, clustering, recommendation generation, sequence alignment, and many more. In this essay, we take a detailed look into a set-similarity measure called - Jaccard’s Similarity Coefficient and how its computation can be optimized using a neat technique called MinHash.
Jaccard Similarity Coefficient
Jaccard Similarity Coefficient quantifies how similar two finite sets really are and is defined as the size of their intersection divided by the size of their union. This similarity measure is very intuitive and we can clearly see that it is a real-valued measure bounded in the interval .




