2000 AutomaticRecogOfMultiWordTerms

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Term Recognition Task.

Notes

Quotes

  • Key words: Terms – Automatic extraction – Domain independence – Automatic Term Recognition (ATR) – Linguistic and statistical information

Abstract

  • Technical terms (henceforth called terms), are important elements for digital libraries. In this paper we present a domain-independent method for the automatic extraction of multi-word terms, from machine-readable special language corpora. The method, (C-value/NC-value ), combines linguistic and statistical information. The first part, C-value, enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms); 2) the incorporation of information from term context words to the extraction of terms.

1. Introduction

  • Terms, the linguistic representation of concepts [28], are important elements for digital libraries. Rapid changes in many specialised knowledge domains (particularly in areas like computer science, engineering, medicine etc.), means that new terms are being created all the time, making important the automation of their retrieval.
  • Many techniques for multi-word automatic term recognition (ATR) move lately from using only linguistic information [1-3], to incorporating statistical as well. Dagan and Church, [6], Daille et al., [8], and Justeson and Katz, [18], Enguehard and Pantera, [11], use frequency of occurrence. Daille et al., and Lauriston, [21], propose the likelihood ratio for terms consisting of two words. For the same type of terms, Damerau, [9], proposed a measure based on mutual information (MI). Those of the above methods that aim to multi-word terms which may consist of more than two words, use as the only statistical parameter the frequency of occurrence of the candidate term in the corpus. A detailed description and evaluation of previous work on multi-word ATR can be found in [13].
  • The method we present and evaluate in this paper extracts multi-word terms from English corpora combining linguistic and statistical information. It is divided into two parts: 1) the C-value, that aims to improve the extraction of nested multi-word terms [15], and 2) the NC-value that incorporates context information to the C-value method, aiming to improve multi-word term extraction in general [12,16]. The fi rst part, C-value has been also used for collocation extraction [14]. The second part incorporates a method for the extraction of term context words, which will be also presented and evaluated in this paper.
  • Since ATR methods are mostly empirical, [19], we evaluate the results of the method in terms of precision and recall, [29]. The results are compared with those produced with the most common statistical technique used for ATR to date, the frequency of occurrence of the candidate term, which was applied on the same corpus.

2.1 The linguistic part

In such a case the linguistic fi lter would not be needed. This approach has not yet been followed by us or by any other researchers in ATR. The reason is that the statistical information that is available, without any linguistic filtering, is not enough to produce useful results. Without any linguistic information, undesirable strings such as of the, is a, etc., would also be extracted.

  • Since most terms consist of nouns and adjectives, [27], and sometimes prepositions, [18], we use a linguistic filter that accepts these types of terms.

2.2 The statistical part

  • Consider the string soft contact lens. This is a term in ophthalmology. A method that uses frequency of occurrence would extract it given that it appears frequently enough in the corpus. Its substrings, soft contact and contact lens, would be also extracted since they would have frequencies at least as high as soft contact lens (and they satisfy the linguistic lter used for the extraction of soft contact lens). However, soft contact is not a term in ophthalmology.
  • Consider the following two sets of terms from computer science.

| real time clock | floating point arithmetic| | real time expert system | floating point constant| | real time image generation | floating point operation| | real time output | floating point routine| | real time systems| |

  • Both of these two sets contain nested terms. We call nested terms those that appear within other longer terms, and may or may not appear by themselves in the corpus. The rst set contains the term real time and the second the term floating point. Except expert system, all of the other substrings, time clock, time expert system, time image generation, image generation, time output, time systems, point arithmetic, point constant, point operation, point routine, are not terms. So substrings of terms may or may not be terms themselves.

References

  • [1] Sophia Ananiadou. (1988). “Towards a Methodology for Automatic Term Recognition. PhD thesis, University of Manchester Institute of Science and Technology.
  • [2] Sophia Ananiadou. (1994). “A Methodology for Automatic Term Recognition.” In: Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, pages 1034{1038, 1994.
  • [6] Ido Dagan, and Kenneth W. Church. (1995). “Termight: Identifying and translating technical terminology.” In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, (EACL 1995).
  • [8] Beatrice Daille, Eric Gaussier, and Jean-Marc Lange. (1994). “Towards Automatic Extraction of Monolingual and Bilingual Terminology.” In: Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, pages 515{521, 1994.
  • [21] Andy Lauriston. (1996). “Automatic Term Recognition: performance of Linguistic and Statistical Techniques. PhD thesis, University of Manchester Institute of Science and Technology.
  • [27] Juan C. Sager. (1990). “A Practical Course in Terminology Processing.” John Benjamins Publishing Company.
  • [28] Juan C. Sager, David Dungworth, and Peter F. McDonald. (1980). “English Special Languages: principles and practice in science and technology. Oscar Brandstetter Verlag.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 AutomaticRecogOfMultiWordTermsSophia Ananiadou
Katerina Frantzi
Hideki Mima
Automatic Recognition of Multi-Word Terms: The Cvalue/NC-value methodInternational Journal on Digital Libraries10.1007/s0079999000232000