2008 TextDocClusteringBasedOnFreqWordMeaningSeq

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Clustering Algorithm, Frequent Word Sequence, Frequent Word Meaning Sequence, WordNet

Quotes

Author Keywords

Text documents; Clustering; Frequent word sequences; Frequent word meaning sequences; Web search; WordNet

Abstract

Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, Vipin Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) Andreas Hotho, Steffen Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, Martin Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.

1 Introduction

Nowadays in every industry, almost all the documents on paper have their electronic copies. This is because the electronic format provides safer storage and occupies much smaller space. Also, the electronic files provide a quick access to these documents. The text database which consists of documents is usually very large. The Word Wide Web is such a database, and how to explore and utilize this kind of text database is a major question in the areas of information retrieval and text mining. With the development of the World Wide Web, it is getting more and more popular to use web search engines to get information. When a user submits a query, the search result is usually a long list of ranked documents. The users may not find what they want from the top 10 documents on the list. It is time-consuming and annoying to browse the result documents one by one. Thus, when the users cannot find matching one after 10-20 clicks, they may give up. This is why the precision of the retrieval for a given query is an important for the search engine.

References

  • 1. Rakesh Agrawal, Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994
  • 2. J. Allan, HARD track overview in TREC 2003 high accuracy retrieval from documents, in: Proceedings of the 12th Text Retrieval Conference, 2003, pp. 24-37.
  • 3. Florian Beil, Martin Ester, and Xiaowei Xu. (2002). “Frequent Term-based Text Clustering.” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002). doi:10.1145/775047.775110
  • 4. Classic data set. .
  • 5. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th ACM SIGIR Conference retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark  doi:10.1145/133160.133214
  • 6. B. Choudhary, P. Bhattacharyya, Text clustering using semantics, in: Proceedings of the 11th International World Wide Web Conference, 2002.
  • 7. A. Doucet, H. Ahonen-Myka, Non-contiguous word sequences for information retrieval, in: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004) Workshop on Multiword Expressions and Integrating Processing, 2004, pp. 88-95.
  • 8. Anil K. Jain, Richard C. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, 1988
  • 9. M. Farach, Optimal suffix tree construction with large alphabets, Proceedings of the 38th Annual Symposium on Foundations of Computer Science (FOCS '97), p.137, October 19-22, 1997
  • 10. Martin Farach, Paolo Ferragina, S. Muthukrishnan, Overcoming the Memory Bottleneck in Suffix Tree Construction, Proceedings of the 39th Annual Symposium on Foundations of Computer Science, p.174, November 08-11, 1998
  • 11. In: Christiane Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press.
  • 12. B.C.M. Fung, K. Wang, Martin Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003.
  • 13. B.C.M. Fung, K. Wang, Martin Ester, Hierarchical document clustering, in: John Wang (Ed.), The Encyclopedia of Data Warehousing and Mining, Idea Group, 2005.
  • 14. Giegerich, R. and Kurtz, S., From Ukkonen to McCreight and Weiner: a unifying view of linear-time suffix tree construction. Algorithmica. v19 i3. 331-353.
  • 15. D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational

Biology, Cambridge University Press, 1997.

  • 16. High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004.
  • 17. Frequent Itemset-based Hierarchical Clustering (FIHC) Software, .
  • 18. Holt, J.D. and Chung, S.M., Multipass algorithms for mining association rules in text databases. (2001). Springer-Verlag.
  • 19. Andreas Hotho, Steffen Staab, Gerd Stumme, Ontologies Improve Text Document Clustering, Proceedings of the Third IEEE International Conference on Data Mining, p.541, November 19-22, 2003
  • 20. G. Karypis, CLUTO-A Clustering Toolkit, Release 2.1.1, Department of Computer Science, University of Minnesota. .
  • 21. Stefan Kurtz, Reducing the space requirement of suffix trees, Software — Practice & Experience, v.29 n.13, p.1149-1171, Nov. 1999  <1149::AID-SPE274>3.0.CO;2-O doi:10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O
  • 22. Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data: An Introduction to Cluster Analysis. (1990). John Wiley & Sons.
  • 23. G. M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, Journal of Algorithms, v.10 n.2, p.157-169, June 1989  doi:10.1016/0196-6774(89)90010-2
  • 24. Bjornar Larsen, Chinatsu Aone, Fast and effective text mining using linear-time document clustering, Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.16-22, August 15-18, 1999, San Diego, California, United States  doi:10.1145/312129.312186
  • 25. The Lemur Toolkit for Language Modeling and Information Retrieval. .
  • 26. Bing Liu, C. W. Chin, and H. T. Ng, “Mining Topic-Specific Concepts and Definitions on the Web,”

Proc. of the 12th Int’l conf. on World Wide Web, 2003, pp. 251–260.

  • 27. Edward M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM (JACM), v.23 n.2, p.262-272, April 1976  doi:10.1145/321941.321946
  • 28. H. Ahonen-Myka, Finding all maximal frequent sequences in text, in: Proceedings of ICML-99 Workshop on Marchine Learning in Text Data Analysis, 1999, pp. 11-17.
  • 29. Helena Ahonen-Myka, Discovery of Frequent Word Sequences in Text, Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery, p.180-189, September 16-19, 2002
  • 30. C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • 31. J. Sedding, D. Kazakov, WordNet-based text document clustering, in: Proceedings of COLING-2004 Workshop on Robust Methods in Analysis of Natural Language Data, 2004.
  • 32. M. Steinbach, G. Karypis, Vipin Kumar, A comparison of document clustering techniques, KDD-2000 Workshop on Text Mining, 2000.
  • 33. Ukkonen, E., On-line construction of suffix trees. Algorithmica. v14. 249-260.
  • 34. Ke Wang, Chu Xu, Bing Liu, Clustering transactions using large items, Proceedings of the eighth International Conference on Information and knowledge management, p.483-490, November 02-06, 1999, Kansas City, Missouri, United States  doi:10.1145/319950.320054
  • 35. P. Weiner, Linear pattern matching algorithms, in: Proceedings of the 14th Annual Symposium on Foundation of Computer Science, 1973, pp. 1-11.
  • 36. O. Zamir, Oren Etzioni, O. Madani, R.M. Karp, Fast and intuitive clustering of web documents, in: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997, pp. 287-290.
  • 37. Oren Zamir, Oren Etzioni, Web document clustering: a feasibility demonstration, Proceedings of the 21st ACM SIGIR Conference retrieval, p.46-54, August 24-28, 1998, Melbourne, Australia  doi:10.1145/290941.290956
  • 38. Ying Zhao, George Karypis, Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering, Machine Learning, v.55 n.3, p.311-331, June 2004  doi:10.1023/B:MACH.0000027785.44527.d6,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 TextDocClusteringBasedOnFreqWordMeaningSeqYanjun Li
Soon M. Chung
John D. Holt
Text Document Clustering Based on Frequent Word Meaning SequencesData & Knowledge Engineeringhttp://storm.cis.fordham.edu/~yli/documents/publications/CFWMS.pdf10.1016/j.datak.2007.08.0012008