2000 AComparisonOfDocClustTechniques

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, k-Means Clustering Algorithm, Hierarchical Clustering Algorithm.

Notes

Cited By

Quotes

Abstract

2. Evaluation of Cluster Quality

  • We use two metrics for evaluating cluster quality: entropy, which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering, and the F-measure, which measures the effectiveness of a hierarchical clustering. (The F measure was recently extended to document hierarchies in [5].) Our results will show that the bisecting K-means algorithm has the best performance for these two measures of cluster quality.

4. Agglomerative Hierarchical Techniques

  • We used three different agglomerative hierarchical techniques for clustering documents.
    • Intra-Cluster Similarity Technique: This hierarchical technique looks at the similarity of all the documents in a cluster to their cluster centroid and is defined by ...
    • Centroid Similarity Technique: This hierarchical technique defines the similarity of two clusters to be the cosine similarity between the centroids of the two clusters.
    • UPGMA: This is the UPGMA scheme as described in [2]. It defines the cluster similarity as follows, where d1 and d2 are documents in cluster1 and cluster2, respectively.

6. Why agglomerative hierarchical clustering performs poorly

  • What distinguishes documents of different classes is the frequency with which the words are used. Furthermore, each document has only a subset of all words from the complete vocabulary. Also, because of the probabilistic nature of how words are distributed, any two documents may share many of the same words. Thus, we would expect that two documents could often be nearest neighbors without belonging to the same class.
  • Since, in many cases, the nearest neighbors of a document are of different classes, agglomerative hierarchical clustering will often put documents of the same class in the same cluster, even at the earliest stages of the clustering process. Because of the way that hierarchical clustering works, these “mistakes” cannot be fixed once they happen.
  • In cases where nearest neighbors are unreliable, a different approach is needed that relies on more global properties. (This issue was discussed in a non-document context in [3].) Since computing the cosine similarity of a document to a cluster centroid is the same as computing the average similarity of the document to all the cluster’s documents [6], K-means is implicitly making use of such a “global property” approach. We believe that this explains why K-means does better vis-à-vis agglomerative hierarchical clustering in the document domain, although this is not the case in some other domains.

7. Conclusions

  • This paper presented the results of an experimental study of some common document clustering techniques. In particular, we compared the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. For K-means we used a standard K-means and a variant of Kmeans, bisecting K-means. Our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested. In addition, the run time of bisecting K-means is very attractive when compared to that of agglomerative hierarchical clustering techniques - O(n) versus O(n^2).

References

  • [1] Cutting, D., Karger, D., Pedersen, J. and John W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” SIGIR ‘92, 318– 329 (1992).
  • [2] Dubes, R. C. and Jain, A. K., Algorithms for Clustering Data, Prentice Hall (1988).
  • [3] Guha, S., Rastogi, R. and Shim, K., “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” ICDE 15 (1999).
  • [4] Kowalski, G., Information Retrieval Systems – Theory and Implementation, Kluwer Academic Publishers (1997).
  • [5] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” KDD-99, San Diego, California (1999).
  • [6] Steinbach, M., Karypis, G., Vipin Kumar, “A Comparison of Document Clustering Techniques,” University of Minnesota, Technical Report #00-034 (2000). http://www.cs.umn.edu/tech_reports/,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 AComparisonOfDocClustTechniquesVipin Kumar
George Karypis
Michael Steinbach
A Comparison of Document Clustering Techniqueshttp://www.cs.cmu.edu/~dunja/KDDpapers/Steinbach IR.pdf