2009 AProbabilisticFrmwrkForAutTermRecog

(Wong et al., 2009) ⇒ Wilson Wong, Wei Liu, and Mohammed Bennamoun. (2009). “A Probabilistic Framework for Automatic Term Recognition.” In: Intelligent Data Analysis, 13(4). doi:10.3233/IDA-2009-0379

Subject Headings: Term Recognition Task.

Notes

Quotes

Author Keywords

Term recognition, termhood, term characteristic, Bayes' theorem, word distribution models, term extraction

Abstract

Term recognition identifies domain-relevant terms which are essential for discovering domain concepts and for the construction of terminologies required by a wide range of natural language applications. Many techniques have been developed in an attempt to numerically determine or quantify termhood based on term characteristics. Some of the apparent shortcomings of existing techniques are the ad-hoc combination of termhood evidence, mathematically-unfounded derivation of scores and implicit assumptions concerning term characteristics. We propose a probabilistic framework for formalising and combining qualitative evidence based on explicitly defined term characteristics to produce a new termhood measure. Our qualitative and quantitative evaluations demonstrate consistently better precision, recall and accuracy compared to three other existing ad-hoc measures.

1. Introduction

Technical terms, more commonly referred to as terms, are content-bearing lexical units which describe the various aspects of a particular domain. There are two types of terms, namely, simple terms (i.e. single-word terms) and complex terms (multi-word terms). In general, the task of identifying domain-relevant terms is referred to as automatic term recognition, term extraction or terminology mining. The broader scope of term recognition can also be viewed in terms of the computational problem of measuring termhood, which is the extent of a term's relevance to a particular domain [25]. Terms are particularly important for labeling or designation domain-specific concepts, and for contributing to the construction of terminologies, which are essentially enumerations of technical terms in a domain. Manual efforts in term recognition are not longer viable as more new terms come into use and new meaning is added to existing terms as a result of information explosion. Coupled with the significant of terminologies to a wide range of applications such as ontology learning, machine translation and thesaurus construction, automatic term recognition is the next logical solution.

Very often, term recognition is considered as similar or equivalent to named-entity recognition, information retrieval and term relatedness measurement. An obvious dissimilarity between named-entity recognition and term recognitions is that the form is a deterministic problem of classification whereas the later involves the subject measurement of relevance and ranking. Hence, unlike the evaluation of named-entity recognition where various platforms such as the BioCreAtIvE Task 1 [19] and the Massage Understanding Conference (MUC) [9] are readily available, determining the performance of term recognition remains an extremely subjective problem. Having closer resemblance to information retrieval in that both involve relevance ranking, term recognition does have its unique requirements [25]. Unlike information retrieval where information relevant can be evaluated base on user information needs, term recognition does not have user queries as evidence for deciding on the domain relevance of terms. In general, term recognition can be performed with or without initial seedterms as evidence. The seedterms enable term recognition to be conducted in a controlled environment and offer more predictable outcomes. This approach of term recognition using seedterms, also referred to as guided term recognition is in some aspects similar to measuring term relatedness. The relevance of terms to a domain in guided term recognition is determined in terms of their semantic relatedness with the domain seedterms. Therefore, existing semantic similarity or relatedness measures based on lexical information (e.g. WordNet [41], Wikipedia [54]), corpus statistics (e.g. Web corpus [12]), or the combination of both [23] are available for use. Without using seedterms, term recognition relies on term characteristics as evidence. This term recognition approach is far more difficult and faces numerous challenges. The focus of this paper is on term recognition without seedterms.

…

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 AProbabilisticFrmwrkForAutTermRecog	Wilson Wong Wei Liu Mohammed Bennamoun			A Probabilistic Framework for Automatic Term Recognition		Intelligent Data Analysis	http://goanna.cs.rmit.edu.au/~e87368/paper/233281941.pdf	10.3233/IDA-2009-0379		2009