1997 ExpansionOfMultiWordTerms

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Controlled Terms Index Production Task, Terminology Extraction Task.

Notes

Quotes

Abstract

1 Motivation

  • Terms are known to be excellent descriptors of the informational content of textual documents (Srinivasan, 1996), but they are subject to numerous linguistic variations. Terms cannot be retrieved properly with coarse text simplification techniques (e.g. stemming); their identification requires precise and efficient NLP techniques. We have developed a domain independent system for automatic term recognition from unrestricted text. The system presented in this paper takes as input a list of controlled terms and a corpus; it detects and marks occurrences of term variants within the corpus. The system takes as input a precompiled (automatically or manually) term list, and transforms it dynamically into a more complete term list by adding automatically generated variants. This method extends the limits of term extraction as currently practiced in the IR community: it takes into account multiple morphological and syntactic ways linguistic concepts are expressed within language. Our approach is a unique hybrid in allowing the use of manually produced precompiled data as input, combined with fully automatic computational methods for generating term expansions. Our results indicate that we can expand term variations at least 30% within a scientific corpus.

2 Background and Introduction

  • NLP techniques have been applied to extraction of information from corpora for tasks such as free indexing (extraction of descriptors from corpora), (Metzler and Haas, 1989; Schwarz, 1990; Sheridan and Smeaton, 1992; Strzalkowski, 1996), term acquisition (Smadja and McKeown, 1991; Bourigault, 1993; Justeson and Katz, 1995; Dalle, 1996), or extraction of linguistic information e.g. support verbs (Grefenstette and Teufel, 1995), and event structure of verbs (Klavans and Chodorow, 1992).
  • Although useful, these approaches suffer from two weaknesses which we address. First is the issue of filtering term lists; this has been dealt with by constraints on processing and by post-processing overgenerated lists. Second is the problem of difficulties in identifying related terms across parts of speech. We address these limitations through the use of controlled indexing, that is, indexing with reference to previously available authoritative terms lists, such as (NLM, 1995). Our approach is fully automatic, but permits effective combination of available resources (such as thesauri) with language processing technology, i.e., morphology, part-of-speech tagging, and syntactic analysis.
  • Automatic controlled indexing is a more difficult task than it may seem at first glance:
    • controlled indexing on single-words must account for polysemy and word disambiguation (Krovetz and Croft, 1992; Klavans, 1995).
    • controlled indexing on multi-word terms must consider the numerous forms of term variations (Dunham, Pacak, and Pratt, 1978; Sparck Jones and Tait, 1984; Jacquemin, 1996).
  • We focus here on the multi-word task. Our system exploits a morphological processor and a transformation-based parser for the extraction of multi-word controlled indexes.
  • The action of the system is twofold. First, a corpus is enriched by tagging each word unambiguously, and then expanded by linking each word with all its possible derivatives. For example, for English, the word genes is tagged as a plural noun and morphologically connected to genic, genetic, genome, genotoxic, genetically, etc. Second, the term list is dynamically expanded through syntactic transformations which allow the retrieval of term variants. For example, genic expressions, genes were expressed, expression of this gene, etc. are extracted as variants of gene expression.

References

  • 1. AGR, Institut National de l'Information Scientifique et Technique, Vandœuvre, France, 1995. Corpus de l'Agriculture, first edition.
  • 2. Aronoff, Mark. 1976. Word Formation in Generative Grammar. Linguistic Inquiry Monographs. MIT Press, Cambridge, MA.
  • 3. Didier Bourigault, An endogeneous corpus-based method for structural noun phrase disambiguation, Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, April 21-23, 1993, Utrecht, The Netherlands doi:10.3115/976744.976755
  • 4. Boyer, Martin. (1993). Dictionnaire du français. Hydro-Quebec, GNU General Public License, Québec, Canada.
  • 5. Daille, Béatrice. (1996). Study and implementation of combined techniques for automatic extraction of terminology. In Judith L. Klavans and Philip Resnik, editors, The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, MA.
  • 6. Dunham, George S., Milos G. Pacak, and Arnold W. Pratt. 1978. Automatic indexing of pathology data. Journal of the American Society for Information Science, 29(2):81--90.
  • 7. ECI, European Corpus Initiative, 1989 and 1990. "Le Monde" Newspaper.
  • 8. Gregory Grefenstette, Simone Teufel, Corpus-based method for automatic identification of support verbs for nominalizations, Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, March 27-31, 1995, Dublin, Ireland doi:10.3115/976973.976988
  • 9. Harris, Zellig S., Michael Gottfried, Thomas Ryckman, Paul Mattick Jr, Anne Daladier, T. N. Harris, and S. Harris. 1989. The Form of Information in Science, Analysis of Immunology Sublanguage, volume 104 of Boston Studies in the Philosophy of Science. Kluwer, Boston, MA.
  • 10. Christian Jacquemin, Recycling terms into a partial parser, Proceedings of the fourth Conference on Applied Natural Language Processing, October 13-15, 1994, Stuttgart, Germany doi:10.3115/974358.974384
  • 11. Christian Jacquemin, What is the tree that we see through the window: a linguistic approach to windowing and term variation, Information Processing and Management: an International Journal, v.32 n.4, p.445-458, July 1996 doi:10.1016/0306-4573(95)00078-X.
  • 12. Christian Jacquemin, Guessing morphology from terms and corpora, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.156-165, July 27-31, 1997, Philadelphia, Pennsylvania, United States
  • 13. Justeson, John S. and Slava M. Katz. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1):9--27.
  • 14. Klavans, Judith L., editor. (1995). AAAI Symposium on Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity. American Association for Artificial Intelligence, March.
  • 15. Judith L. Klavans, Martin Chodorow, Degrees of stativity: the lexical representation of verb aspect, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992424.992443.
  • 16. Robert Krovetz, W. Bruce Croft, Lexical ambiguity and information retrieval, ACM Transactions on Information Systems (TOIS), v.10 n.2, p.115-141, April 1992 doi:10.1145/146802.146810
  • 17. Lewis, David D., W. Bruce Croft, and Nehru Bhandaru. 1989. Language-oriented information retrieval. International Journal of Intelligent Systems, 4:285--318.
  • 18. Martin, W. J. F., B. P. F. Al, and P. J. G. Van Sterkenburg. 1983. On the processing of a text corpus: From textual data to lexicographical information. In R. R. K. Hartman, editor, Lexicography, Principles and Practice. Academic Press, London, pages 77--87.
  • 19. Douglas P. Metzler, Stephanie W. Haas, The constituent object parser: syntactic structure matching for information retrieval, ACM Transactions on Information Systems (TOIS), v.7 n.3, p.292-316, July 1989 doi:10.1145/65943.65949
  • 20. NLM, National Library of Medicine, Bethesda, MD, 1995. Unified Medical Language System, sixth experimental edition.
  • 21. Popovic, Mirko and Peter Willett. (1992). The effectiveness of stemming for Natural-Language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):384--390.
  • 22. Schwarz, Christoph. (1990). Automatic syntactic analysis of free text. Journal of the American Society for Information Science, 41(6):408--417.
  • 23. Selkirk, Elisabeth O. 1982. The Syntax of Words. MIT Press, Cambridge, MA.
  • 24. Paraic Sheridan, Alan F. Smeaton, The application of morpho-syntactic language processing to effective phrase matching, Information Processing and Management: an International Journal, v.28 n.3, p.349-369, 1992 doi:10.1016/0306-4573(92)90080-J
  • 25. Smadja, Frank and Kathleen R. McKeown. (1991). Using collocations for language generation. Computational Intelligence, 7(4), December.
  • 26. Alan F. Smeaton, Progress in the application of natural language processing to information retrieval tasks, The Computer Journal, v.35 n.3, p.268-278, June 1992 doi:10.1093/comjnl/35.3.268
  • 27. Sparck Jones, Karen and Joel I. Tait. 1984. Automatic search term variant generation. Journal of Documentation, 40(1):50--66.
  • 28. Padmini Srinivasan, Optimal document-indexing vocabulary for MEDLINE, Information Processing and Management: an International Journal, v.32 n.5, p.503-514, Sept. 1996 doi:10.1016/0306-4573(96)00025-8
  • 29. Tomek Strzalkowski, Natural language information retrieval, Information Processing and Management: an International Journal, v.31 n.3, p.397-417, May-June 1995
  • 30. Tzoukermann, Evelyne and Christian Jacquemin. (1997). Analyse automatique de la morphologie dérivationnelle et filtrage de mots possibles. Silexicales, 1:251--260. Colloque Mots possibles et mots existants, SILEX, University of Lille III.
  • 31. Evelyne Tzoukermann, Judith L. Klavans, Christian Jacquemin, Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.148-155, July 27-31, 1997, Philadelphia, Pennsylvania, United States
  • 32. Evelyne Tzoukermann, Mark Y. Liberman, A finite-state morphological processor for Spanish, Proceedings of the 13th conference on Computational linguistics, p.277-282, August 20-25, 1990, Helsinki, Finland doi:10.3115/991146.991195
  • 33. Tzoukermann, Evelyne and Dragomir Radev. (1996). Using word class for part-of-speech disambiguation. In SIGDAT Workshop, pages 1--13, Copenhagen, Denmark.
  • 34. Tzoukermann, Evelyne, Dragomir Radev, and William A. Gale. (1995). Combining linguistic knowledge and statistical learning in French part-of-speech tagging. In EACL SIGDAT Workshop, pages 51--57, Dublin, Ireland.
  • 35. C. J. van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, 1979
  • 36. Viegas, Evelyne, Margarita Gonzalez, and Jeff Longwell. (1996). Morpho-semantics and constructive derivational morphology: A transcategorial approach. Technical Report MCCS-96-295, Computing Research Laboratory, New Mexico State University, Las Cruces, NM.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1997 ExpansionOfMultiWordTermsChristian Jacquemin
Judith Klavans
Evelyne Tzoukermann
Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntaxhttp://www.aclweb.org/anthology-new/P/P97/P97-1004.pdf10.3115/976909.979621