tf-idf Matrix

From GM-RKB
(Redirected from tf-idf Score Matrix)
Jump to navigation Jump to search

A tf-idf Matrix is a real-number score matrix composed of tf-idf vectors (with tf-idf scores).

  • Context:
  • Example(s):
    • a Word-Word PMI Matrix, such as: [math]\displaystyle{ \begin{array}{c|ccccc} & d_1& ... & d_{1,348} & ... & d_{20,181} \\ \hline aardvark & 0.036 & ... & 0.0 & ... & 0.0 \\ ... & ... & ... & ... & ... & ... \\ midterm & 0.0 & ... & 0.004 & ... & 0.0 \\ ... & ... & ... & ... & ... & ... \\ zoo & 0.081 & ... & 0.0 & ... & 0.0 \end{array} }[/math].
  • Counter-Example(s):
  • See: Co-Occurrence Matrix, Sparse Matrix.


References

2007

  • (Tata & Patel, 2007) ⇒ Sandeep Tata, and Jignesh M. Patel. (2007). “Estimating the Selectivity of tf-idf based Cosine Similarity Predicates.” In: ACM SIGMOD Record, 36(2).
    • QUOTE: An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.