2000 ImpactOfSimMeasOnWebPClustering

(Strehl et al., 2000) ⇒ Alexander Strehl, Joydeep Ghosh, and Raymond Mooney. (2000). “Impact of Similarity Measures on Web-page Clustering.” In: Workshop at AAAI 2000 on Artificial Intelligence for Web Search.

Subject Headings: Webpage Clustering Algorithm.

Notes

Cited By

~291 http://scholar.google.com/scholar?cites=6295192295633406766&as_sdt=2000

2001

(Dhillon, 2001) ⇒ Inderjit S. Dhillon. (2001). “Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning.” In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001) doi:10.1145/502512.502550

Quotes

Abstract

Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as Yahoo that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized k-means, weighted graph partitioning), on high dimensional sparse data representing web documents. Performance is measured against a human-imposed classification into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical signicance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2000 ImpactOfSimMeasOnWebPClustering	Joydeep Ghosh Raymond J. Mooney Alexander Strehl			Impact of Similarity Measures on Web-page Clustering		Workshop at AAAI 2000 on Artificial Intelligence for Web Search	https://www.aaai.org/Papers/Workshops/2000/WS-00-01/WS00-01-011.pdf			2000