tf-idf Score: Difference between revisions

From GM-RKB
Jump to navigation Jump to search
m (Text replacement - "“" to "“")
m (Remove links to pages that are actually redirects to this page.)
Line 25: Line 25:
=== 2007 ===
=== 2007 ===
* ([[Pazzani & Billsus, 2007]]) ⇒ [[Michael J. Pazzani]], and [[Daniel Billsus]]. ([[2007]]). “Content-based Recommendation Systems.” In: The adaptive web. Springer Berlin Heidelberg, 2007.
* ([[Pazzani & Billsus, 2007]]) ⇒ [[Michael J. Pazzani]], and [[Daniel Billsus]]. ([[2007]]). “Content-based Recommendation Systems.” In: The adaptive web. Springer Berlin Heidelberg, 2007.
** QUOTE: ... associated with a [[text term|term]] is a [[real number score|real number]] that represents the [[importance or relevance]]. This value is called the [[tf*idf weight]] ([[term-frequency times inverse document frequency]]). The [[tf*idf weight]], w(t,d), of a term t in a document d is a function of the frequency of t in the document (tft,d), the number of documents that contain the term (dft) and the number of documents in the collection (N)<ref>Note that in the description of [[tf*idf weight]]s, the word “document” is traditionally used since the original motivation was to [[retrieve documents]]. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called [[tf*idf]]. In general, [[tf*idf system]]s have weights that increase monotonically with [[term frequency]] and decrease monotonically with [[document frequency]].</ref>
** QUOTE: ... associated with a [[text term|term]] is a [[real number score|real number]] that represents the [[importance or relevance]]. This value is called the [[tf-idf Score|tf*idf weight]] ([[term-frequency times inverse document frequency]]). The [[tf-idf Score|tf*idf weight]], w(t,d), of a term t in a document d is a function of the frequency of t in the document (tft,d), the number of documents that contain the term (dft) and the number of documents in the collection (N)<ref>Note that in the description of [[tf-idf Score|tf*idf weight]]s, the word “document” is traditionally used since the original motivation was to [[retrieve documents]]. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called [[tf*idf]]. In general, [[tf*idf system]]s have weights that increase monotonically with [[term frequency]] and decrease monotonically with [[document frequency]].</ref>
<references/>
<references/>



Revision as of 20:45, 23 December 2019

A tf-idf Score is a non-negative real number score from a tf-idf function (for a vocabulary member relative to a multiset set member).

  • Context:
    • It can (typically) increase with respect to Set Member Frequency (frequent vocab members within a single multiset/document are more informative than rare items).
    • It can (typically) increase with respect to IDF Score (frequent vocab members over an entire multiset/corpus are less informative than rare terms).
    • It can be a member of a tf-idf Vector.
  • Example(s):
    • [math]\displaystyle{ 0 }[/math], when every multiset contains the member.
    • [math]\displaystyle{ 0.046... }[/math] for [math]\displaystyle{ \operatorname{tf-idf}(``\text{quaint}'',\text{doc}_{184}, \text{Newsgroups 20 corpus}) }[/math], i.e. [math]\displaystyle{ \frac{\log(200)}{500} \equiv \frac{4}{2,000} \times \log(\frac{8,000}{40}) }[/math], if the word quaint is present 4 times in document [math]\displaystyle{ \text{doc}_{184} }[/math]with 2,000 words, and is contained in 40 documents from a corpus with 8,000 documents.
  • Counter-Example(s):
  • See: TF-IDF Ranking Function.


References

2009


2007

  1. Note that in the description of tf*idf weights, the word “document” is traditionally used since the original motivation was to retrieve documents. While the chapter will stick with the original terminology, in a recommendation system, the documents correspond to a text description of an item to be recommended. Note that the equations here are representative of the class of formulae called tf*idf. In general, tf*idf systems have weights that increase monotonically with term frequency and decrease monotonically with document frequency.