2002 FrequentTermbasedTextClustering

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, Subspace Clustering Algorithm.

Notes

Cited By

2007

Quotes

Abstract

Text clustering methods can be used to structure large sets of text or hypertext documents. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very large size of the databases and understandability of the cluster description. In this paper, we introduce a novel approach which uses frequent item (term) sets for text clustering. Such frequent sets can be efficiently discovered using algorithms for association rule mining. To cluster based on frequent term sets, we measure the mutual overlap of frequent sets with respect to the sets of supporting documents. We present two algorithms for frequent term-based text clustering, FTC which creates flat clusterings and HFTC for hierarchical clustering. An experimental evaluation on classical text documents as well as on web documents demonstrates that the proposed algorithms obtain clusterings of comparable quality significantly more efficiently than state-of-the-art text clustering algorithms. Furthermore, our methods provide an understandable description of the discovered clusters by their frequent term sets.

1. Introduction

The world wide web continues to grow at an amazing speed. On the other hand, there is also a quickly growing number of text and hypertext documents managed in organizational intranets, representing the accumulated knowledge of organizations that becomes more and more important for their success in today’s information society. Due to the huge size, high dynamics, and large diversity of the web and of organizational intranets, it has become a very challenging task to find the truly relevant content for some user or purpose. For example, the standard web search engines have low precision, since typically a large number of irrelevant web pages is returned together with a small number of relevant pages. This phenomenon is mainly due to the fact that keywords specified by the user may occur in different contexts, consider for example the term " cluster ". Consequently, a web search engine typically returns long lists of results, but the user, in his limited amount of time, processes only the first few results. Thus, a lot of truly relevant information hidden in the long result lists will never be discovered. Text clustering methods can be applied to structure the large result set such that they can be interactively browsed by the user. Effective knowledge management is a major competitive advantage in today’s information society. To structure large sets of hypertexts available in a company’s intranet, again methods of text clustering can be used.

Compared to previous applications of clustering, three major challenges must be addressed for clustering (hyper) text databases (see also [1]):

A lot of different text clustering algorithms have been proposed in the literature, including Scatter/Gather (Cutting et al., 2002), SuffixTree Clustering (Zamir & Etzioni, 1998) and bisecting k-means (Steinbach et al.,2000). A recent comparison (Steinbach et al.,2000) demonstrates that bisecting k-means outperforms the other well-known techniques, in particular hierarchical clustering algorithms, with respect to clustering quality. Furthermore, this algorithm is efficient. However, bisecting k-means like most of the other algorithms does not really address the above mentioned problems of text clustering: it clusters the full high-dimensional vector space of term frequency vectors and the discovered means of the clusters do not provide an understandable description of the documents grouped in some cluster.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 FrequentTermbasedTextClusteringMartin Ester
Xiaowei Xu
Florian Beil
Frequent Term-based Text ClusteringKDD-200210.1145/775047.7751102002