2007 ANewUnsupMethForDocClust

Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, Keyword Clustering, Text Document, Bisecting K-Means Algorithm, Multipole, Antipole, WordNet, ANNIE, Document Vector.


Cited By



Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.


  • Allan, J. (2002). Introduction to topic detection and tracking. In Topic detection and tracking: Event-based information organization (pp. 1–16). Kluwer Academic Publishers.
  • ANNIE. Annie — a robust cross-domain ie system. http://www.gate.ac.uk/ie/annie.html
  • Barbara, D., Li, Y., & Couto, J. (2002). Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th International Conference on Information and knowledge management (pp. 582–589).
  • (Beil et al., 2002) ⇒ Florian Beil, Martin Ester, and Xiaowei Xu. (2002). “Frequent Term-based Text Clustering.” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002). doi:10.1145/775047.775110
  • Boley, D. (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.
  • Bolshakova, N., & Azuaje, F. (2003). Improving expression data mining through cluster validation. Information Technology Applications in Biomedicine, 19–22.
  • Borgelt, C. (2000) Apriori — association rule induction/frequent item set mining. http://www.fuzzy.cs.unimagdeburg.de/borgelt/apriori.html
  • Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. SPIRE, 2857, 350–359.
  • Cantone, D., Ferro, A., Pulvirenti, A., Reforgiato, D., & Shasha, D. (2005). Antipole tree indexing to support range search and k-nearest-neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 17(4), 535–550.
  • Chua, S., & Kulathuramaiyer, N. (2004). Semantic feature selection using wordnet. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 166–172).
  • Crowe, M. (2000) Wordnet.net library. http://www.opensvn.csie.org/WordNetDotNet/
  • Hamish Cunningham, Diana Maynard, Bontcheva, K., & Tablan, V. (2002). Gate: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), July 2002.
  • Cutting, D. R., Karger, D. R., Pedersen, J. O., & John W. Tukey (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In: Proceedings of ACM SIGIR 92 (pp. 318–329).
  • Dave, D. M. P. K., & Lawrence, S. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. WWW 03 ACM (pp. 519–528).
  • de Buenaga Rodriguez, M., Gomez Hidalgo, J. M., & Diaz Agudo, B. (2000). Using wordnet to complement training information in text categorization. In N. Nicolov & Ruslan Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from RANLP’97, current issues in linguistic theory (CILT) (pp. 353–364). Amsterdam/Philadelphia: John Benjamins.
  • Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of of the 11th International Conference on knowledge discovery and data mining (pp. 269–274).
  • I. K. Fodor. (2002). “A Survey of Dimension Reduction Techniques.” LLNL technical report, UCRL ID-148494 URL: http://www.llnl.gov/CASC/sapphire/pubs.html
  • Jerome H. Friedman (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J. H. Friedmanm, & H. Wechsler (Eds.), From statistic to neural networks, Proceedings of NATO/ASI Workshop (pp. 1–61).
  • Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.
  • Green, S. J. (1997). Building hypertext links in newspaper articles using semantic similarity. NLDB 97 (pp. 178–190).
  • Green, S. J. (1999). Building hypertext links by computing semantic similarity. TKDE, 11(5), 50–57.
  • (Hotho et al., 2003) ⇒ Andreas Hotho, Steffen Staab, and Gerd Stumme. (2003). “Wordnet improves text document clustering.” In: Proceedings of the Semantic Web Workshop at SIGIR-2003, 26th ACM SIGIR Conference .
  • Jing, L., Zhou, L., Ng, M. K., & Huang, J. Z. (2006). Ontology-based distance measure for text clustering. SIAM conference on data mining.
  • Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on knowledge discovery and data mining (pp. 16–22).
  • Urena Lopez, L. A., Gomez de Buenaga Rodriguez, M., & Gomez Hidalgo, J. M. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215–230.
  • George A. Miller (1995). Wordnet: A lexical database for English. CACM, 38(11), 39–41.
  • Dan Moldovan, & Rada Mihalcea (2000). Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.
  • Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. In: Proceedings of the 8th international workshop on AI and statistics (pp. 261–265).
  • Parson, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.
  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
  • Reforgiato, D. (2007). Hierarchical clustering data structure comparisons. Technical Report.
  • Van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Dept. of Computer Science, University of Glasgow.
  • (Sedding and Kazakov, 2004) ⇒ Julian Sedding and Dimitar Kazakov. (2004). “Wordnet-based Text Document Clustering.” In: COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND).
  • Padhraic Smyth (1996). Clustering using monte carlo cross-validation. Knowledge Discovery and Data Mining, 126–133.
  • Steinbach, M., Karypis, G., & Vipin Kumar (2000). A comparison of document clustering techniques. In: Proceedings of TextMining Workshop, KDD 2000.
  • Ellen Voorhees. (1994). Query expansion using lexical-semantic relations. In: Proceedings of ACM-SIGIR (pp. 61–69).
  • Zamir, O., & Oren Etzioni (1998). Web document clustering: A feasibility demonstration. In: Proceedings of ACM SIGIR 98 (pp. 46–54).
  • Zamir, O., Oren Etzioni, Madani, O., & Karp R. M. (1997). Fast and intuitive clustering of web documents. KDD 97, 287–290.
  • Zervas, G., & Ruger, S. M. (1999). The curse of dimensionality and document clustering. In: Proceedings of the IEEE Searching for Information: AI and IR Approaches (pp. 19/1–19/3).


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 ANewUnsupMethForDocClustDiego R. RecuperoA New Unsupervised Method for Document Clustering by using WordNet Lexical and Conceptual RelationsInformation Retrieval (IR) Taskhttp://dx.doi.org/10.1007/s10791-007-9035-710.1007/s10791-007-9035-72007