2003 HierarchicalDocClusteringUsingFreqItemsets

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, Text Document, Frequent Itemset.

Notes

  • It focuses on
    • Reduced dimensionality, by using frequent itemsets
    • High clustering accuracy.
    • Number of clusters as an optional input parameter.
    • Easy to browse with meaningful cluster description.

Cited By

  • ~900

Quotes

Abstract

A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.

1. Introduction

Document clustering has been studied intensively because of its wide applicability in areas such as web mining, search engines, information retrieval, and topological analysis. Unlike in document classification, in document clustering no labeled documents are provided. Although standard clustering techniques such as k-means can be applied to document clustering, they usually do not satisfy the special requirements for clustering documents: high dimensionality, high volume of data, ease for browsing, and meaningful cluster labels. In addition, many existing document clustering algorithms require the user to specify the number of clusters as an input parameter and are not robust enough to handle different types of document sets in a real-world environment. For example, in some document sets the cluster size varies from few to thousands of documents. This variation tremendously reduces the clustering accuracy for some of the state-of-the art algorithms.

In this paper, we propose a novel approach, Frequent Itemset-based Hierarchical Clustering (FIHC), for document clustering based on the idea of frequent itemsets proposed by Agrawal et. al [1]. The intuition of our clustering criterion is that there are some frequent itemsets for each cluster (topic) in the document set, and different clusters share few frequent itemsets. A frequent itemset is a set of words that occur together in some minimum fraction of documents in a cluster. Therefore, a frequent itemset describes something common to many documents in a cluster. We use frequent itemsets to construct clusters and to organize clusters into a topic hierarchy. Here are the features of our approach.

Reduced dimensionality. We use only the frequent items that occur in some minimum fraction of documents in document vectors, which drastically reduces the dimensionality of the document set. Experiments show that clustering with reduced dimensionality is significantly more efficient and scalable. This decision is consistent with the study from linguistics (Longman Lancaster Corpus) that only 3000 words are required to cover 80% of the written text in English [13, 18] and the result is coherent with the Zipf’s law [21] and the findings in Mladenic et al. [12] and Yang et al. [20].

High clustering accuracy. Experimental results show that the proposed approach FIHC outperforms best documents clustering algorithms in terms of accuracy. It is robust even when applied to large and complicated document sets.

Number of clusters as an optional input parameter. Many existing clustering algorithms require the user to specify the desired number of clusters as an input parameter. FIHC treats it only as an optional input parameter. Close to optimal clustering quality can be achieved even when this value is unknown.

Easy to browse with meaningful cluster description. The topic tree provides a sensible structure for browsing documents. Each cluster in the tree has a set of frequent itemsets as its description. Users may utilize them to navigate the topic tree.

The outline of this paper is as follows. Section 2 briefly discusses some well-known clustering algorithms and some common preprocessing steps. Sections 3 and 4 present our algorithm in two stages, cluster construction and tree building, with a running example. Section 5 shows the experimental results and the comparison with other algorithms. Section 6 provides an analysis on our method. We conclude the paper and outline future directions of research in section 7.

2. Related Work

Hierarchical and partitioning methods are two major categories of clustering algorithms. One popular approach in document clustering is agglomerative hierarchical clustering [5]. Algorithms in this family follow a similar template: Compute the similarity between all pairs of clusters and then merge the most similar pair. Different agglomerative algorithms may employ different similarity measuring schemes. Recently, Steinbach et al. [14] shows that UPGMA [5, 9] is the most accurate one in its category. K-means and its variants [4, 9, 10] represent the category of partitioning clustering algorithms. [14] illustrates that one of the variants, bisecting k-means, outperforms basic k-means as well as the agglomerative approach in terms of accuracy and efficiency. The bisecting k-means algorithm first selects a cluster to split. Then it utilizes basic k-means to form two sub-clusters and repeats until the desired number of clusters is reached.

Wang et al. [17] introduces a new criterion for clustering transactions using frequent itemsets. In principle, this method can also be applied to document clustering by treating a document as a transaction; however, the method does not create a hierarchy for browsing. The HFTC proposed by Beil et al. [2] attempts to address the special requirements in document clustering using the notion of frequent itemsets. HFTC greedily picks up the next frequent itemset (representing the next cluster) to minimize the overlapping of the documents that contain both the itemset and some remaining itemsets. The clustering result depends on the order of picking up itemsets, which in turn depends on the greedy heuristic used. In our approach, we do not follow a sequential order of selecting clusters. Rather, we assign documents to the best clusters with all clusters available. Experiments show that this strategy produces better clusters and is more scalable.

Similar to most document clustering algorithms, our method employs several preprocessing steps including stop words removal and stemming on the document set. Each document is represented by a vector of frequencies of remaining items within the document. As an extra preprocessing step, many document clustering algorithms would replace the actual term frequency of an item by the weighted frequency, i.e., term frequency £ inverse document frequency (TF£IDF), in the document vector. The idea is that if an item is too common across different documents, then it would have little discriminating power, and vice versa [16]. Note that our algorithm applies TF£IDF after stemming; therefore, each document is actually represented by a vector of weighted frequencies. The effect of TF£IDF on our algorithm will be explained in Section 3.2.

7. Conclusions

Most traditional clustering methods do not satisfy the special requirements for document clustering, such as high dimensionality, high volume, and ease of browsing with meaningful cluster labels. In this paper, we present a new approach to address these issues. The novelty of this approach is that it exploits frequent itemsets for defining a cluster, organizing the cluster hierarchy, and reducing the dimensionality of document sets. The experimental results show that our approach outperforms its competitors in terms of accuracy, efficiency, and scalability.

References

  • Rakesh Agrawal and R. Srikant. Fast algorithm for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of 20th International Conference Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12-15 1994.
  • F. Beil, Martin Ester, and X. Xu. Frequent term-based text clustering. In: Proceedings of 8th International Conference on Knowledge Discovery and Data Mining (KDD)’2002, Edmonton, Alberta, Canada, 2002.
  • Classic. ftp://ftp.cs.cornell.edu/pub/smart/.
  • D. R. Cutting, D. R. Karger, J. O. Pedersen, and John W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth ACM SIGIR Conference Retrieval, pages 318–329, 1992.
  • R. C. Dubes and A. K. Jain. Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs, NJ, March 1998.
  • (Han et al., 2000) ⇒ Jiawei Han, Jian Pei, and Y. Yin. (2000). “Mining Frequent Patterns Without Candidate Generation.” In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD 2000).
  • J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for association rule mining - a general survey and comparison. ACM SIGKDD Explorations, 2(1):58–64, July 2000.
  • G. Karypis. Cluto 2.0 clustering toolkit, April 2002. http://www-users.cs.umn.edu/˜ karypis/cluto/.
  • L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, March 1990.
  • B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. KDD’99, pages 16–22, 1999.
  • D. D. Lewis. Reuters. http://www.research.att.com/˜ lewis/.
  • D. Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of The International Conference on Machine Learning, 1999.
  • I. S. P. Nation and J. Coady. Vocabulary and reading. In Carter and McCarthy (Eds.) Vocabulary and Language Teaching. Longman, London, 1988.
  • M. Steinbach, G. Karypis, and Vipin Kumar. A comparison of document clustering techniques. KDD Workshop on Text Mining’00, 2000.
  • Text REtrival Conference TIPSTER, 1999. http://trec.nist.gov/.
  • C. J. van Rijsbergen. Information Retrieval. Dept. of Computer Science, University of Glasgow, Butterworth, London, 2 edition, 1979.
  • K. Wang, C. Xu, and Bing Liu . Clustering transactions using large items. In: Proceedings of CIKM’99, pages 483–490, 1999.
  • R. Waring. Second language vocabulary acquisition, linguistic context and vocabulary task design. Summary of a paper presented at The British Council Conference in St Andrews Scotland, September 1995.
  • Yahoo! http://www.yahoo.com/.
  • M. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In: Proceedings of The International Conference on Machine Learning, 1997.
  • G. K. Zipf. Selective studies and the principle of relative frequency in language, 1932.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 HierarchicalDocClusteringUsingFreqItemsetsMartin Ester
Benjamin C. M. Fung
Ke Wang
Hierarchical Document Clustering using Frequent Itemsetshttp://www.siam.org/proceedings/datamining/2003/dm03 06FungB.pdf