2003 WordnetImprovesTextDocumentClustering

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, WordNet, Bisecting k-Means Clustering Algorithm.

Notes

Cited By

2007

2004

Quotes

Abstract

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. The bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with the problem, we integrate background knowledge — in our application Wordnet — into the process of clustering text documents. We cluster the documents by a standard partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks.


References

  • Eneko Agirre and G. Rigau. Word sense disambiguation using conceptual density. In: Proceedings of COLING’96, 1996.
  • G. Amati, C. Carpineto, and G. Romano. Fub at trec-10 web track: A probabilistic framework for topic relevance term weighting. In The Tenth Text Retrieval Conference (TREC 2001). National Institute of Standards and Technology (NIST), online publication, 2001.
  • E. Bozsak et al. Kaon - towards a large scale semantic web. In: Proceedings of EC-Web, pages 304–313, Aix-en-Provence, France, (2002). LNCS 2455 Springer.
  • S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.
  • M. de Buenaga Rodrıguez, J. M. G. Hidalgo, and B. D´ıaz-Agudo. Using WordNet to complement training information in text categorization. In Recent Advances in Natural Language Processing II, volume 189. John Benjamins, 2000.
  • B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer, Berlin – Heidelberg, 1999.
  • J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarr´an. Indexing with WordNet synsets can improve text retrieval. In: Proceedings ACL/COLING Workshop on Usage of WordNet for Natural Language Processing, 1998.
  • S. J. Green. Building hypertext links in newspaper articles using semantic similarity. In: Proceedings of third Workshop on Applications of Natural Language to Information Systems (NLDB ’97), 1997.
  • S. J. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(5):713–730, 1999.
  • (Hofmann, 1999) ⇒ Thomas Hofmann. (1999). “Probabilistic Latent Semantic Indexing.” In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999) doi:10.1145/312624.312649
  • Andreas Hotho, Steffen Staab, and G. Stumme. Explaining text clustering results using semantic structures. In Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22-26, 2003, LNCS. Springer, 2003.
  • Andreas Hotho, Steffen Staab, and G. Stumme. Text clustering based on background knowledge. Technical report, University of Karlsruhe, Institute AIFB, (2003). 36 pages.
  • N. Ide and J. Véronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–40, 1998.
  • G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In: Proceedings of CIKM-00, pages 12–19. ACM Press, 2000.
  • D. M. P. Kushal Dave, Steve Lawrence. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the Twelfth International World Wide Web Conference, WWW2003. ACM, 2003.
  • D. Lewis. Reuters-21578 text categorization test collection, 1997.
  • George A. Miller. WordNet: A lexical database for english. CACM, 38(11):39–41, 1995.
  • Dan Moldovan and Rada Mihalcea. Using WordNet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1):34–43, 2000.
  • Patrick Pantel and Dekang Lin. Document clustering with committees. In: Proceedings of SIGIR’02, Tampere, Finland, 2002.
  • M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
  • Gerard M. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
  • F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 WordnetImprovesTextDocumentClusteringSteffen Staab
Andreas Hotho
Gerd Stumme
Wordnet Improves Text Document Clusteringhttp://www.uni-koblenz.de/~staab/Research/Publications/sw sigir2003 submit.pdf