2003 DocumentClustBasedOnNonNegMatFact

From GM-RKB
Jump to: navigation, search

Subject Headings: Text Clustering Algorithm, Matrix Decomposition Algorithm.

Notes

Cited By

Quotes

Abstract

In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.

4.1 Data Corpora

We conducted the performance evaluations using the TDT22 and the Reuters3 document corpora. These two document corpora have been among the ideal test sets for document clustering purposes because documents in the corpora have been manually clustered based on their topics and each document has been assigned one or more labels indicating which topic/topics it belongs to. The TDT2 corpus consists of 100 document clusters, each of which reports a major news event occurred in 1998. It contains a total of 64527 documents from six news agencies such as ABC, CNN, VOA, NYT, PRI and APW, among which 7803 documents have a unique category label. The number of documents for different news events is very unbalanced, ranging from 1 to 1485. In our experiments, we excluded those events with5 documents, which left us with a total of 56 events. The final test set is still very unbalanced, with some large clusters more than 100 times larger than some small ones.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 DocumentClustBasedOnNonNegMatFactWei Xu
Xin Liu
Yihong Gong
Document Clustering Based on Non-Negative Matrix FactorizationProceedings of the 26th ACM SIGIR Conferencehttp://mall.psy.ohio-state.edu/LexicalSemantics/XuLiuGong03.pdf10.1145/860435.8604852003