tf-idf Vector Distance Function

(Redirected from TF-IDF)
Jump to: navigation, search

A tf-idf Vector Distance Function is a cosine distance function between TF-IDF vectors (based on relative term frequency and inverse document frequency).



  • (Wikipedia, 2015) ⇒–idf Retrieved:2015-2-21.
    • tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

      The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

      Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

      One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.




    • QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.

      Suppose we have a collection docs of n strings, which we will call documents in keeping with tradition. Further let df(t,docs) be the document frequency of token t, that is, the number of documents in which the token t appears. Then the inverse document frequency (IDF) of t is defined by: idf(t,docs) = sqrt(log(n/df(t,docs))).

      If the document frequency df(t,docs) of a term is zero, then idf(t,docs) is set to zero. As a result, only terms that appeared in at least one training document are used during comparison.

      The term vector for a string is then defined by its term frequencies. If count(t,cs) is the count of term t in character sequence cs, then the term frequency (TF) is defined by: tf(t,cs) = sqrt(count(t,cs)) . The term-frequency/inverse-document frequency (TF/IDF) vector tfIdf(cs,docs) for a character sequence cs over a collection of documents ds has a value tfIdf(cs,docs)(t) for term t defined by: tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)

      The proximity between character sequences cs1 and cs2 is defined as the cosine of their TF/IDF vectors:

      dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))

      Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:

      cos(x,y) = x . y / (|x| * |y| )

      where dot products are defined by:

      x . y = Σi x[i] * y[i]

      and length is defined by:

      |x| = sqrt(x . x)

      Distance is then just 1 minus the proximity value.

      distance(cs1,cs2) = 1 - proximity(cs1,cs2)


    • TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.