tf-idf Vector Distance Function

A tf-idf Vector Distance Function is a cosine distance function between TF-IDF vectors (based on relative term frequency and inverse document frequency).

Context:
- domain: 2 tf-idf Vectors; and an IDF Model (from the same multiset set).
- range: a Distance Score.
- It can be calculated as [math]\displaystyle{ \mathrm{tf-idf}(t,d,D) = \mathrm{tf}(t,d) \times \mathrm{idf}(t,D) }[/math].
- It can (often) be used as:
  - a String Distance Function, by mapping each string and underlying Base Corpus as Multisets. (however, it cannot handle the Word Semantic Challenge).
  - a Document Distance Function, by mapping each Document and the underlying Base Corpus as Multisets.
  - a Information Retrieval Ranking Function to compare Document similarity and distance to a Keyword Query.
  - a TF-IDF Ranking Function.
- ...
Example(s):
- tf-idf Distance({a,b},{b,a},C) = 0
- tf-idf Distance({a,b},{c,d},C) = 1
- IF TF(a)=0.5, THEN TFIDF Distance({a,a,b},{a,b,b})= ???, because IDF(a)= ???
- ...
Counter-Example(s):
See: Term Vector Space Model; Stop-Word, TF-IDF-based Text-Item Feature Generation Algorithm.

References

2020

(Qi, 2020) ⇒ Zhang Qi. (2020). “The Text Classification of Theft Crime based on TF-IDF and XGBoost Model.” In: 2020 IEEE International conference on artificial intelligence and computer applications (ICAICA).
- NOTE:
  - It utilizes 2622 preprocessed theft crime cases from a city spanning 2009-2019, aiming to enhance crime prediction accuracy using text classification.
  - It employs the TF-IDF (Term Frequency-Inverse Document Frequency) model for feature extraction, determining the relevance of words in the crime data documents.

2015

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Tf–idf Retrieved:2015-2-21.
- tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
  The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
  Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
  One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

2011

(Sammut & Webb, 2011) ⇒ Claude Sammut, and Geoffrey I. Webb. (2011). “TF-IDF.” In: (Sammut & Webb, 2011) p.986

2010

http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html
- QUOTE: Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
  Suppose we have a collection docs of n strings, which we will call documents in keeping with tradition. Further let df(t,docs) be the document frequency of token t, that is, the number of documents in which the token t appears. Then the inverse document frequency (IDF) of t is defined by: idf(t,docs) = sqrt(log(n/df(t,docs))).
  If the document frequency df(t,docs) of a term is zero, then idf(t,docs) is set to zero. As a result, only terms that appeared in at least one training document are used during comparison.
  The term vector for a string is then defined by its term frequencies. If count(t,cs) is the count of term t in character sequence cs, then the term frequency (TF) is defined by: tf(t,cs) = sqrt(count(t,cs)) . The term-frequency/inverse-document frequency (TF/IDF) vector tfIdf(cs,docs) for a character sequence cs over a collection of documents ds has a value tfIdf(cs,docs)(t) for term t defined by: tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
  The proximity between character sequences cs1 and cs2 is defined as the cosine of their TF/IDF vectors:
  dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
  
  Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
  cos(x,y) = x ^. y / (|x| * |y| )
  where dot products are defined by:
  x ^. y = Σ_i x[i] * y[i]
  and length is defined by:
  |x| = sqrt(x ^. x)
  
  Distance is then just 1 minus the proximity value.
  
  distance(cs1,cs2) = 1 - proximity(cs1,cs2)

2009

http://alias-i.com/lingpipe/demos/tutorial/stringCompare/read-me.html
- TF/IDF Distance LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation. TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.

2003

(Cohen et al., 2003) ⇒ William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. (2003). “A Comparison of String Distance Metrics for Name-Matching Tasks.” In: Workshop on Information Integration on the Web (IIWeb-03).
- Two strings [math]\displaystyle{ s }[/math] and [math]\displaystyle{ t }[/math] can also be considered as multisets (or bags) of words (or tokens). We also considered several token-based distance metrics. The Jaccard similarity between the word sets S and T is simply jS\Tj jS[Tj . TFIDF or 1 Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions. cosine similarity, which is widely used in the information retrieval community

tf-idf Vector Distance Function

References

2020

2015

2011

2010

2009

2003

Navigation menu

Search