# tf-idf Vector Distance Function

A tf-idf Vector Distance Function is a cosine distance function between TF-IDF vectors (based on relative term frequency and inverse document frequency).

**Context**:**domain:**2 tf-idf Vectors; and an IDF Model (from the same multiset set).**range:**a Distance Score.- It can be calculated as [math]\mathrm{tf-idf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t,D)[/math].
- It can (often) be used as:
- a String Distance Function, by mapping each string and underlying Base Corpus as Multisets. (however, it cannot handle the Word Semantic Challenge).
- a Document Distance Function, by mapping each Document and the underlying Base Corpus as Multisets.
- a Information Retrieval Ranking Function to compare Document similarity and distance to a Keyword Query
- a TF-IDF Ranking Function.

**Example(s):**- tf-idf Distance({a,b},{b,a},C) = 0
- tf-idf Distance({a,b},{c,d},C) = 1
- IF TF(a)=0.5, THEN TFIDF Distance({a,a,b},{a,b,b})= ???, because IDF(a)= ???

**Counter-Example(s):****See:**Term Vector Space Model; Stop-Words.

## References

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Tf–idf Retrieved:2015-2-21.
**tf–idf**, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

### 2012

- http://en.wikipedia.org/wiki/Tf*idf
- QUOTE: The
**tf*idf**weight (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.Variations of the tf*idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf*idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

^{[1]}One of the simplest ranking functions is computed by summing the tf*idf for each query term; many more sophisticated ranking functions are variants of this simple model.

- QUOTE: The

### 2011

- (Sammut & Webb, 2011) ⇒ Claude Sammut, and Geoffrey I. Webb. (2011). “TF-IDF.” In: (Sammut & Webb, 2011) p.986

### 2010

- http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
- QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection

`docs`

of`n`

strings, which we will call documents in keeping with tradition. Further let`df(t,docs)`

be the document frequency of token`t`

, that is, the number of documents in which the token`t`

appears. Then the inverse document frequency (IDF) of`t`

is defined by:`idf(t,docs) = sqrt(log(n/df(t,docs)))`

.If the document frequency

`df(t,docs)`

of a term is zero, then`idf(t,docs)`

is set to zero. As a result, only terms that appeared in at least one training document are used during comparison.The term vector for a string is then defined by its term frequencies. If

`count(t,cs)`

is the count of term`t`

in character sequence`cs`

, then the term frequency (TF) is defined by:`tf(t,cs) = sqrt(count(t,cs))`

. The term-frequency/inverse-document frequency (TF/IDF) vector`tfIdf(cs,docs)`

for a character sequence`cs`

over a collection of documents`ds`

has a value`tfIdf(cs,docs)(t)`

for term`t`

defined by:`tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)`

The proximity between character sequences

`cs1`

and`cs2`

is defined as the cosine of their TF/IDF vectors:`dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))`

Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:

where dot products are defined by:`cos(x,y) = x`

^{.}y / (|x| * |y| )

and length is defined by:`x`

^{.}y = Σ_{i}x[i] * y[i]`|x| = sqrt(x`

^{.}x)Distance is then just 1 minus the proximity value.

distance(cs1,cs2) = 1 - proximity(cs1,cs2)

- QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.

### 2009

- http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.

### 2003

- (Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- Two strings [math]s[/math] and [math]t[/math] can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The Jaccard similarity between the word sets S and T is simply jS\Tj jS[Tj . TFIDF or 1 Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. cosine similarity, which is widely used in the information retrieval community