2002 EvaluationOfHierClustAlgsForDocDatasets

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, Hiearchical Clustering Algorithm, CLUTO System, Hierarchical Clustering Algorithm, Agglomerative Clustering Algorithm, Partitional Clustering Algorithm.

Notes

Quotes

Abstract

  • Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

1. Introduction

  • Hierarchical clustering solutions, which are in the form of trees called dendrograms, are of great interest for a number of application domains. Hierarchical trees provide a view of the data at different levels of abstraction. The consistency of clustering solutions at different levels of granularity allows flat partitions of different granularity to be extracted during data analysis, making them ideal for interactive exploration and visualization. In addition, there are many times when clusters have subclusters, and the hierarchical structure are indeed a natural constrain on the underlying application domain (e.g., biological taxonomy, phylogenetic trees) [9]. Hierarchical clustering solutions have been primarily obtained using agglomerative algorithms [25, 17, 10, 11, 16], in which objects are initially assigned to its own cluster and then pairs of clusters are repeatedly merged until the whole tree is formed. However, partitional algorithms [20, 14, 22, 5, 31, 13, 27, 2, 8] can also be used to obtain hierarchical clustering solutions via a sequence of repeated bisections. In recent years, various researchers have recognized that partitional clustering algorithms are well-suited for clustering large document datasets due to their relatively low computational requirements [6, 18, 1, 26]. However, there is the common belief that in terms of clustering quality, partitional algorithms are actually inferior and less effective than their agglomerative counterparts. This belief is based both on experiments with low dimensional datasets as well was as a limited number of studies in which agglomerative approaches outperformed partitional K-means based approaches. For example, Larsen [18] observed that group average greedy agglomerative clustering outperformed various partitional clustering algorithms in document datasets from TREC and Reuters.
  • In light of recent advances in partitional clustering [6, 18, 7, 2, 8], we revisited the question of whether or not agglomerative approaches generate superior hierarchical trees than partitional approaches. The focus of this paper is to compare various agglomerative and partitional approaches for the task of obtaining hierarchical clustering solution. The partitional methods that we compared use different clustering criterion functions to derive the solutions and the agglomerative methods use different schemes for selecting the pair of clusters to merge next. For partitional clustering algorithms, we used six recently studied criterion functions [32] that have been shown to produce high-quality partitional clustering solutions. For agglomerative clustering algorithms, we evaluated three traditional merging criteria (i.e., single-link, complete-link, and group average (UPGMA)) and a new set of merging criteria derived from the six partitional criterion functions. Overall, we compared six partitional methods and nine agglomerative methods.
  • In addition to the traditional partitional and agglomerative algorithms, we developed a new class of agglomerative algorithms, in which we introduced intermediate clusters obtained by partitional clustering algorithms to constrain the space over which agglomeration decisions are made. We refer to them as constrained agglomerative algorithms. These algorithms generate hierarchical trees in two steps. First, for each of the intermediate partitional clusters, an agglomerative algorithm builds a hierarchical subtree. Second, the subtrees are combined into a single tree by building an upper tree using these subtrees as leaves.
  • We experimentally evaluated the performance of these methods to obtain hierarchical clustering solutions using twelve different datasets derived from various sources. Our experiments showed that partitional algorithms always generate better hierarchical clustering solutions than agglomerative algorithms and that the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and in most cases they outperform partitional methods as well. We believe that the observed poor performance of agglomerative algorithms is because of the errors they make during early agglomeration. The superiority of partitional algorithm also suggests that partitional clustering algorithms are well-suited for obtaining hierarchical clustering solutions of large document datasets due to not only their relatively low computational requirements, but also comparable or better performance.

7. Concluding Remarks

  • In the paper we experimentally evaluated nine agglomerative algorithms and six partitional algorithms to obtain hierarchical clustering solutions for document datasets. We also introduced a new class of agglomerative algorithms by constraining the agglomeration process using clusters obtained by partitional algorithms. Our experimental results showed that partitional methods produced better hierarchical solutions than agglomerative methods, and that the constrained agglomerative methods improved the clustering solutions obtained by agglomerative or partitional methods alone. These results suggest that the poor performance of agglomerative methods may be attributed to the merging errors they make during early stages, which can be eliminated to some extent by introducing partitional constrains.

References

  • 1. Charu C. Aggarwal, Stephen C. Gates, Philip S. Yu, On the merits of building categorization systems by supervised clustering, Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.352-356, August 15-18, 1999, San Diego, California, United States  doi:10.1145/312129.312279
  • 2. Daniel Boley, Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery, v.2 n.4, p.325-344, December 1998  doi:10.1023/A:1009740529316
  • 3. Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, Document Categorization and Query Generation on the World Wide WebUsing WebACE, Artificial Intelligence Review, v.13 n.5-6, p.365-391, Dec. 1999
  • 4. Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, Kyle Hastings, Partitioning-based clustering for Web document categorization, Decision Support Systems, v.27 n.3, p.329-341, Dec.1999  doi:10.1016/S0167-9236(99)00055-X
  • 5. Peter Cheeseman, John Stutz, Bayesian classification (AutoClass): theory and results, Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, Menlo Park, CA, 1996
  • 6. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th ACM SIGIR Conference retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark  doi:10.1145/133160.133214
  • 7. Inderjit Dhillon, and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.
  • 8. C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.
  • 9. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000
  • 10. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD Conference, p.73-84, June 01-04, 1998, Seattle, Washington, United States
  • 11. ROCK: A Robust Clustering Algorithm for Categorical Attributes, Proceedings of the 15th International Conference on Data Engineering, p.512, March 23-26, 1999
  • 12. Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, WebACE: a Web agent for document categorization and exploration, Proceedings of the second International Conference on Autonomous agents, p.408-415, May 10-13, 1998, Minneapolis, Minnesota, United States  doi:10.1145/280765.280872
  • 13. E. Han, G. Karypis, Vipin Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.
  • 14. Anil K. Jain, Richard C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, 1988
  • 15. G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, (2000). Available on the WWW at URL http://www.cs.umn.edu/~karypis.
  • 16. George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, v.32 n.8, p.68-75, August 1999  doi:10.1109/2.781637
  • 17. B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.
  • 18. Bjornar Larsen, Chinatsu Aone, Fast and effective text mining using linear-time document clustering, Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.16-22, August 15-18, 1999, San Diego, California, United States  doi:10.1145/312129.312186
  • 19. D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.
  • 20. J. MacQueen. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Symp. Math. Statist, Prob., pages 281--297, 1967.
  • 21. J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, Vipin Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.
  • 22. Raymond T. Ng, Jiawei Han, Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the 20th International Conference on Very Large Data Bases, p.144-155, September 12-15, 1994
  • 23. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
  • 24. Gerard M. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1989
  • 25. P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
  • 26. M. Steinbach, G. Karypis, and Vipin Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
  • 27. Alexander Strehl, Joydeep Ghosh, A Scalable Approach to Balanced, High-Dimensional Clustering of Market-Baskets, Proceedings of the 7th International Conference on High Performance Computing, p.525-536, December 17-20, 2000
  • 28. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
  • 29. TREC. Text REtrieval conference. http://trec.nist.gov, 1999.
  • 30. Yahoo! Yahoo! http://www.yahoo.com.
  • 31. K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.
  • 32. (Zhao & Karypsis, 2001) ⇒ Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, (2001). Available on the WWW at http://cs.umn.edu/~karypis/publications.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 EvaluationOfHierClustAlgsForDocDatasetsYing Zhao
George Karypis
Evaluation of Hierarchical Clustering Algorithms for Document DatasetsConference on Information and Knowledge Managementhttp://www.cs.umn.edu/tech reports upload/tr2002/02-022.pdf10.1145/584792.5848772002