2001 OntologyBasedTextClustering

Jump to: navigation, search

Subject Headings: Text Clustering Algorithm, Document Vector, Word Vector.


Cited By



Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. In this paper, we propose a new approach for applying background knowledge during preprocessing in order to improve clustering results and allow for selection between results. We built various views basing our selection of text features on a heterarchy of concepts. Based on these aggregations, we compute multiple clustering results using K-Means. The results may be distinguished and explained by the corresponding selection of concepts in the ontology. Our results compare favourably with a sophisticated baseline preprocessing strategy.


  • Rakesh Agrawal, J. Gehrke, D. Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACMSIGMOD Int’l Conference on Management of Data, Seattle, Washington, June (1998). ACM Press, 1998.
  • K. Beyer, Jonathan Goldstein, R. Ramakrishnan, and U. Shaft. When is ‘nearest neighbor’ meaningful. In: Proceedings of ICDT-1999, Jerusalem, Israel, 1999, pages 217–235, 1999.
  • P. Bradley, Usama M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In: Proceedings of KDD-1998, New York, NY, USA, August 1998, pages 9–15, Menlo Park, CA, USA, (1998). AAAI Press.
  • J. Fuernkranz, Tom M. Mitchell, and Ellen Riloff. A Case Study in Using Linguistic Phrases for Text Categorization on the WWW. In: Proceedings of AAAI/ICML Workshop Learning for Text Categorization, Madison, WI, (1998). AAAI Press, 1998.
  • A. Hinneburg, C. Aggarwal, and D.A. Keim. What is the nearest neighbor in high dimensional spaces? In: Proceedings of VLDB-2000, Cairo, Egypt, September 2000, pages 506–515. Morgan Kaufmann, 2000.
  • A. Hinneburg and D.A. Keim. Optimal gridclustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of VLDB-1999, Edinburgh, Scotland, September (2000). Morgan Kaufmann, 1999.
  • A. Hinneburg, M. Wawryniuk, and D.A. Keim. Visual mining of high-dimensional data. Computer Graphics & Applications Journal, September 1999.
  • L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
  • S. A. Macskassy, A. Banerjee, B.D. Davison, and H. Hirsh. Human performance on clustering web pages: a preliminary study. In: Proceedings of KDD-1998, New York, NY, USA, August 1998, pages 264–268, Menlo Park, CA, USA, (1998). AAAI Press.
  • Alexander Maedche and Steffen Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2), 2001.
  • George A. Miller. WordNet: A lexical database for english. CACM, 38(11):39–41, 1995.
  • G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information extraction core system for real world german text processing. In ANLP-1997 — Proceedings of the Conference on Applied Natural Language Processing, pages 208–215,Washington, USA, 1997.
  • M. Devaneyand A. Ram. Efficient feature selection in conceptual clustering. In: Proceedings of ICML-1997, Nashville, TN, (1998). Morgan Kaufmann, 1998.
  • H. Schuetze and C. Silverstein. Projections for efficient document clustering. In: Proceedings of SIGIR-1997, Philadelphia, PA, July 1997, pages 74–81. Morgan Kaufmann, 1997.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2001 OntologyBasedTextClusteringAndreas Hotho
Alexander Maedche
Steffen Staab
Ontology-based Text ClusteringProceedings of the IJCAI-2001 Workshop on Text Learning: Beyond Supervisionhttp://www.uni-koblenz.de/~staab/Research/Publications/hothoetal-ijcaiws2001.pdf2001