Set Distance Function
(Redirected from set distance measure)
- AKA: Set Overlap Measure, Set Similarity Function.
- See: Intersection Set Operation, Bag-of-Words Vector.
- Jaccard Distance Another common method for comparing strings, which is actually much more efficient to implement, is the so-called "Jaccard distance". The Jaccard distance implementation in spell.JaccardDistance operates at a token level, comparing two strings by first tokenizing them and then dividing the number of tokens shared by the strings by the total number of tokens.
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.