2002 FrequentTermbasedTextClustering

Subject Headings: Text Clustering Algorithm, Subspace Clustering Algorithm.

Notes

The world wide web continues to grow at an amazing speed. On the other hand, there is also a quickly growing number of text and hypertext documents managed in organizational intranets, representing the accumulated knowledge of organizations that becomes more and more important for their success in today’s information society. Due to the huge size, high dynamics, and large diversity of the web and of organizational intranets, it has become a very challenging task to find the truly relevant content for some user or purpose. For example, the standard web search engines have low precision, since typically a large number of irrelevant web pages is returned together with a small number of relevant pages. This phenomenon is mainly due to the fact that keywords specified by the user may occur in different contexts, consider for example the term " cluster ". Consequently, a web search engine typically returns long lists of results, but the user, in his limited amount of time, processes only the first few results. Thus, a lot of truly relevant information hidden in the long result lists will never be discovered. Text clustering methods can be applied to structure the large result set such that they can be interactively browsed by the user. Effective knowledge management is a major competitive advantage in today’s information society. To structure large sets of hypertexts available in a company’s intranet, again methods of text clustering can be used.

Compared to previous applications of clustering, three major challenges must be addressed for clustering (hyper) text databases (see also [1]):

A lot of different text clustering algorithms have been proposed in the literature, including Scatter/Gather (Cutting et al., 2002), SuffixTree Clustering (Zamir & Etzioni, 1998) and bisecting k-means (Steinbach et al.,2000). A recent comparison (Steinbach et al.,2000) demonstrates that bisecting k-means outperforms the other well-known techniques, in particular hierarchical clustering algorithms, with respect to clustering quality. Furthermore, this algorithm is efficient. However, bisecting k-means like most of the other algorithms does not really address the above mentioned problems of text clustering: it clusters the full high-dimensional vector space of term frequency vectors and the discovered means of the clusters do not provide an understandable description of the documents grouped in some cluster.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2002 FrequentTermbasedTextClustering	Martin Ester Xiaowei Xu Florian Beil			Frequent Term-based Text Clustering		KDD-2002		10.1145/775047.775110		2002