TLDR: MinHash - Fast Jaccard Similarity at Scale
Date: 2020-11-08 Source: https://arpitbhayani.me/blogs/jaccard-minhash
Overview
Learn Jaccard Similarity and MinHash, a technique to efficiently estimate set similarity at scale for tasks like near-duplicate detection. Set similarity measure finds its application spanning the Computer Science spectrum; some applications being - user segmentation, finding near-duplicate webpages/documents, clustering, recommendation generation, sequence alignment, and many more.
Key Points
- Set similarity measure finds its application spanning the Computer Science spectrum; some applications being - user segmentation, finding near-duplicate webpages/documents, clustering, recommendation generation, sequence alignment, and many more.
- Jaccard Similarity Coefficient as Probability: Jaccard Coefficient can also be interpreted as the probability that an element picked at random from the universal set U is present in both sets A and B. !https://user-images.githubusercontent.com/4745789/98462221-8dc3bd00-21d8-11eb-95bf-5a9267e88b97.png Another analogy for this probability is the chances of throwing a dart and it hitting the intersection.