2002 DiscoveringWordSensesFromText

From GM-RKB
Jump to navigation Jump to search

See: Word Sense-based Word Form Clustering, Clustering By Committee Algorithm, Hybrid Clustering Algorithm

Notes

Cited by

Cited By

Quotes

Author Keywords

Word sense discovery, clustering, evaluation, machine learning.

Abstract

Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning words to their most similar clusters. After assigning an element to a cluster, we remove their overlapping features from the element. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. Each cluster that a word belongs to represents one of its senses. We also present an evaluation methodology for automatically measuring the precision and recall of discovered senses.


References

  • 1. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark doi:10.1145/133160.133214
  • 2. ROCK: A Robust Clustering Algorithm for Categorical Attributes, Proceedings of the 15th International Conference on Data Engineering, p.512, March 23-26, 1999
  • 3. Harris, Z. 1985. Distributional structure. In: Katz, J. J. (ed.) he hilosophy of inguistics. New York: Oxford University Press. pp. 26--47.
  • 4. Donald Hindle, Noun classification from predicate-argument structures, Proceedings of the 28th annual meeting on Association for Computational Linguistics, p.268-275, June 06-09, 1990, Pittsburgh, Pennsylvania doi:10.3115/981823.981857
  • 5. Hutchins, J. and Sommers, H. (1992). Introduction to achine ranslation,. Academic Press.. 6. A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999 doi:10.1145/331499.331504
  • 7. George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, v.32 n.8, p.68-75, August 1999 doi:10.1109/2.781637
  • 8. Thomas K. Landauer, and Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. sychological eview 104:211--240.
  • 9. Landes, S.; Leacock, C,; and Tengi, R. I. (1998). Building semantic concordances. In ord et n lectronic e ical Database, edited by C. Fellbaum. pp. 199--216. MIT Press.
  • 10. Dekang Lin. (1994). Principar - an efficient, broad-coverage, principle-based parser. roceedings of C I G-. pp. 42--48. Kyoto, Japan.
  • 11. Dekang Lin. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. In roceedings of C-. pp. 64--71. Madrid, Spain.
  • 12. Dekang Lin. (1998). Automatic retrieval and clustering of similar words. In: Proceedings of C I G C -. pp. 768--774. Montreal, Canada..
  • 13. Dekang Lin, Patrick Pantel, Induction of semantic classes from natural language text, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.317-322, August 26-29, 2001, San Francisco, California doi:10.1145/502512.502558
  • 14. Christopher D. Manning, Hinrich Schütze, Foundations of statistical natural language processing, MIT Press, Cambridge, MA, 1999
  • 15. George A. Miller 1990. WordNet: An online lexical database. International ournal of e icography, 1990.. 16. Marius A. Paşca, Sandra M. Harabagiu, High performance question/answering, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.366-374, September 2001, New Orleans, Louisiana, United States doi:10.1145/383952.384025
  • 17. Gerard Salton, Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986
  • 18. W. M. Shaw, Jr., Robert Burgin, Patrick Howell, Performance standards and evaluations in IR test collections: cluster-based retrieval models, Information Processing and Management: an International Journal, v.33 n.1, p.1-14, Jan 1, 1997 doi:10.1016/S0306-4573(96)00043-X
  • 19. Steinbach, M.; Karypis, G.; and Vipin Kumar 2000. A comparison of document clustering techniques, echnical eport 00-0. Department of Computer Science and Engineering, University of Minnesota.
  • 20. Ellen Voorhees. (1998). Using WordNet for text retrieval. In ord et n lectronic e ical Database, edited by C. Fellbaum. pp. 285--303. MIT Press.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 DiscoveringWordSensesFromTextDekang Lin
Patrick Pantel
Discovering Word Senses from Texthttp://www.patrickpantel.com/cgi-bin/Web/Tools/getfile.pl?type=paper&id=2002/kdd02.pdf10.1145/775047.775138