- (Xu et al., 2003) ⇒ Wei Xu, Xin Liu, Yihong Gong. (2003). “Document Clustering Based on Non-Negative Matrix Factorization.” In: Proceedings of the 26th ACM SIGIR Conference (SIGIR 2003). doi:10.1145/860435.860485
- (Chagoyen et al., 2006) ⇒ Monica Chagoyen, Pedro Carmona-Saez, Hagit Shatkay, Jose M Carazo, and Alberto Pascual-Montano. (2006). “Discovering semantic features in the literature: a foundation for building functional associations.” In: BMC Bioinformatics. 7:41.
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.
4.1 Data Corpora
We conducted the performance evaluations using the TDT22 and the Reuters3 document corpora. These two document corpora have been among the ideal test sets for document clustering purposes because documents in the corpora have been manually clustered based on their topics and each document has been assigned one or more labels indicating which topic/topics it belongs to. The TDT2 corpus consists of 100 document clusters, each of which reports a major news event occurred in 1998. It contains a total of 64527 documents from six news agencies such as ABC, CNN, VOA, NYT, PRI and APW, among which 7803 documents have a unique category label. The number of documents for different news events is very unbalanced, ranging from 1 to 1485. In our experiments, we excluded those events with5 documents, which left us with a total of 56 events. The final test set is still very unbalanced, with some large clusters more than 100 times larger than some small ones.
- 2 Nist topic detection and tracking corpus at http://www.nist.gov/speech/tests/tdt/tdt98/index.htm
- 3 Reuters-21578, distribution 1.0 corpus at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html,
|2003 DocumentClustBasedOnNonNegMatFact||Wei Xu|
|Document Clustering Based on Non-Negative Matrix Factorization||Proceedings of the 26th ACM SIGIR Conference||http://mall.psy.ohio-state.edu/LexicalSemantics/XuLiuGong03.pdf||10.1145/860435.860485||2003|
|Author||Wei Xu +, Xin Liu + and Yihong Gong +|
|journal||Proceedings of the 26th ACM SIGIR Conference +|
|title||Document Clustering Based on Non-Negative Matrix Factorization +|