1999 LanguageIndepNERCombMorphAndCntxtEvid

Jump to navigation Jump to search

Subject Headings: Semi-Supervised Named Entity Recognition Algorithm, Named Entity Recognizer.


Cited By



Identifying and classifying personal, geographic, institutional or other names in a text is an important task for numerous applications. This paper describes and evaluates a language-independent bootstrapping algorithm based on iterative learning and re-estimation of contextual and morphological patterns captured in hierarchically smoothed trie models. The algorithm learns from unannotated text and achieves competitive performance when trained on a very short labelled name list with no other required language-specific information, tokenizers or tools.

1 Introduction

The ability to determine the named entities in a text has been established as an important task for several natural language processing areas, including information retrieval, machine translation, information extraction and language understanding. For the 1995 Message Understanding Conference (MUC-6), a separate named entity recognition task was developed and the best systems achieved impressive accuracy (with an F-measure approaching 95%). What should be underlined here is that these systems were trained for a specific domain and a particular langnage (English), typically making use of hand-coded rules, taggers, parsers and semantic lexicons. Indeed, most named entity recognizers that have been published either use tagged text, perform syntactical and morphological analysis or use semantic information for contextual clues. Even the systems that do not make use of extensive knowledge about a particular language, such as Nominator (Choi et al., 1997), still typically use large data files containing lists of names, exceptions, personal and organizational identifiers.

Our aim has been to build a maximally language independent system for both named-entity identification and classification, using minimal information about the source language. The applicability of AI-style algorithms and supervised methods is limited in the multilingual case because of the cost of knowledge databases and manually annotated corpora. Therefore, a much more suitable approach is to consider an EM-style bootstrapping algorithm. In terms of world knowledge, the simplest and most relevant resource for this task is a database of known names. For each entity class to be recognized and tagged, it is assumed that the user can provide a short list (order of one hundred) of unambiguous examples (seeds). Of course the more examples provided, the better the results, but what we try to prove is that even with minimal knowledge good results can be achieved. Additionally some basic particularities of the language should be known: capitalization (if it exists and is relevant - some languages do not make use of capitalization; in others, such as German, the capitalization is not of great help), allowable word separators (if they exist), and a few frequent exceptions (like the pronoun "/" in English). Although such information can be utilised if present, it is not required, and no other assumptions are made in the general model.

6 Conclusion

This paper has presented an algorithm for the minimally supervised learning of named entity recognizers given short name lists as seed data (typically 40-100 example words per entity class). The algorithm uses hierarchically smoothed trie structures for modeling morphological and contextual probabilities effectively in a language independent framework, overcoming the need for fixed token boundaries or history lengths. Th e combination of relatively independent morphological and contextual evidence sources in an iterative bootstrapping framework converges upon a successful named entity recognizer, achieving a competitive 70.5%-75.4% F-measure (measuring both named entity identification and classification) when applied to Romanian text. Fixed k-way classification accuracy on given entities ranges between 73%-79% on 5 diverse languages for a difficult firstname/lastname/place partition, and approaches 92% accuracy for the simpler person/place discrimination. These results were achieved using only unannotated training texts, with absolutely no required language-specific information, tokenizers or other tools, and requiring no more than 15 minutes total human effort in training (for short wordlist creation) The observed robust and consistent performance and very rapid, low cost rampup across 5 quite different languages shows the potential for further successful and diverse applications of this work to new languages and domains.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 LanguageIndepNERCombMorphAndCntxtEvidDavid YarowskyLanguage Independent Named Entity Recognition Combining Morphological and Contextual Evidencehttp://acl.ldc.upenn.edu/W/W99/W99-0612.pdf