1999 FastAndEffTextMiningUsingLinTimeDocClust

Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm.


Cited By


Author Keywords

Clustering, text mining, multi-document summarization


Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F-Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidf and feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and k-means. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, “vector average damping,” that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.


  • (Cutting et al., 1992) ⇒ Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. (1992). “Scatter/Gather: a cluster-based approach to browsing large document collections.” In: Proceedings of the 15th ACM SIGIR Conference retrieval (SIGIR 1992).
  • Douglass R. Cutting, David R. Karger, Jan O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections, Proceedings of the 16th ACM SIGIR Conference retrieval, p.126-134, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States doi:10.1145/160688.160706
  • A. El-Hamdouchi, P. Willett, Hierarchic document classification using Ward's clustering method, Proceedings of the 9th ACM SIGIR Conference retrieval, p.149-156, September 1986, Palazzo dei Congressi, Pisa, Italy doi:10.1145/253168.253200
  • Faber, V. "Clustering and the Continuous k-Means Algorithm", Los Alamos Science, November 22, 1994.
  • Harman, D. and Ellen Voorhees., editors. Proceedings of the Fifth Text Retrieval Conference (TREC-5). National Institute of Standards and Technology, Department of Commerce. 1996.
  • Marti A. Hearst, Jan O. Pedersen, Reexamining the cluster hypothesis: scatter/gather on retrieval results, Proceedings of the 19th ACM SIGIR Conference retrieval, p.76-84, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243216
  • Lagus, K., Honkela, T., Kaski, S., and Kohonen, T. "Self- Organizing Maps of Document Collections: A New Approach to Interactive Exploration," In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
  • MacQueen, J. "Some methods for classification and analysis of multivariate observations," In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I, Statistics. Edited by Lucien M. Le Cam and Jerzy Neyman. University of California Press. 1967.
  • Peter Pirolli, Patricia Schank, Marti Hearst, Christine Diehl, Scatter/gather browsing communicates the topic structure of a very large text collection, Proceedings of the SIGCHI conference on Human factors in computing systems: common ground, p.213-220, April 13-18, 1996, Vancouver, British Columbia, Canada doi:10.1145/238386.238489.
  • Mehran Sahami, Salim Yusufali, Michelle Q. W. Baldonaldo, SONIA: a service for organizing networked information autonomously, Proceedings of the third ACM conference on Digital libraries, p.200-209, June 23-26, 1998, Pittsburgh, Pennsylvania, United States doi:10.1145/276675.276697
  • Gerard M. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1989.
  • Hinrich Schütze, Craig Silverstein, Projections for efficient document clustering, Proceedings of the 20th ACM SIGIR Conference retrieval, p.74-81, July 27-31, 1997, Philadelphia, Pennsylvania, United States.
  • Yiming Yang, Tom Pierce, Jaime Carbonell, A study of retrospective and on-line event detection, Proceedings of the 21st ACM SIGIR Conference retrieval, p.28-36, August 24-28, 1998, Melbourne, Australia doi:10.1145/290941.290953
  • Zamir, O., Oren Etzioni, Madani, O., and Karp, R. "Fast and Intuitive Clustering of Web Documents," In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, (1997).


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 FastAndEffTextMiningUsingLinTimeDocClustBjornar Larsen
Chinatsu Aone
Fast and Effective Text Mining Using Linear-time Document Clusteringhttp://www.scils.rutgers.edu/~muresan/IR/Docs/Articles/sigkddLarsen1999.pdf10.1145/312129.312186