tf-idf Score

From GM-RKB
Jump to navigation Jump to search

A tf-idf Score is a non-negative real number score from a tf-idf function (for a vocabulary member relative to a multiset set member).

  • Context:
    • It can (typically) increase with respect to Set Member Frequency (frequent vocab members within a single multiset/document are more informative than rare items).
    • It can (typically) increase with respect to IDF Score (frequent vocab members over an entire multiset/corpus are less informative than rare terms).
    • It can be a member of a tf-idf Vector.
  • Example(s):
    • [math]\displaystyle{ 0 }[/math], when every multiset contains the member.
    • [math]\displaystyle{ 0.046... }[/math] for [math]\displaystyle{ \operatorname{tf-idf}(``\text{quaint}'',\text{doc}_{184}, \text{Newsgroups 20 corpus}) }[/math], i.e. [math]\displaystyle{ \frac{\log(200)}{500} \equiv \frac{4}{2,000} \times \log(\frac{8,000}{40}) }[/math], if the word quaint is present 4 times in document [math]\displaystyle{ \text{doc}_{184} }[/math]with 2,000 words, and is contained in 40 documents from a corpus with 8,000 documents.
  • Counter-Example(s):
  • See: TF-IDF Ranking Function.


References

2009


2007

  1. Note that in the description of tf*idf weights, the word “document” is traditionally used since the original motivation was to retrieve documents. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called tf*idf. In general, tf*idf systems have weights that increase monotonically with term frequency and decrease monotonically with document frequency.