2006 BootstrappingNERwAutoGenGazLists

Jump to: navigation, search

Subject Headings: Bootstrapped Named Entity Recognition Algorithm, Term Gazetteer.



Current Named Entity Recognition systems suffer from the lack of hand-tagged data as well as degradation when moving to other domain. This paper explores two aspects: the [[automatic generation of gazetteer lists from unlabeled data]]; and the building of a Named Entity Recognition system with labeled and unlabeled data.

1 Introduction

Automatic information extraction and information retrieval concerning particular person, location, organization, title of movie or book, juxtaposes to the Named Entity Recognition (NER) task. NER consists in detecting the most silent and informative elements in a text such as names of people, company names, location, monetary currencies, dates. Early NER systems (Fisher et al., 1997), (Black et al., 1998) etc., participating in Message Understanding Conferences (MUC), used linguistic tools and gazetteer lists. However these are difficult to develop and domain sensitive.

To surmount these obstacles, application of machine learning approaches to NER became a research subject. Various state-of-the-art machine learning algorithms such as Maximum Entropy (Borthwick, 1999), AdaBoost(Carreras et al., 2002), Hidden Markov Models (Bikel et al., ), Memory-based Based learning (Tjong Kim Sang, 2002b), have been used1. (Klein et al., 2003), (Mayfield et al., 2003), (Wu et al., 2003), (Kozareva et al., 2005c) among others, combined several classifiers to obtain better named entity coverage rate.

Nevertheless all these machine learning algorithms rely on previously hand-labeled training data. Obtaining such data is labor-intensive, time consuming and even might not be present for languages with limited funding. Resource limitation, directed NER research (Collins and Singer, 1999), (Carreras et al., 2003), (Kozareva et al., 2005a) toward the usage of semi-supervised techniques. These techniques are needed, as we live in a multilingual society and access to information from various language sources is reality. The development of NER systems for languages other than English commenced.

This paper presents the development of a Spanish Named Recognition system based on machine learning approach. For it no morphologic or syntactic information was used. However, we propose and incorporate a very simple method for automatic gazetteer2 construction. Such method can be easily adapted to other languages and it is low-costly obtained as it relies on n-gram extraction from unlabeled data. We compare the performance of our NER system when labeled and unlabeled training data is present.

6 Conclusions and future work

In this paper we proposed and implemented a pattern validation search in an unlabeled corpus though which gazetteer lists were automatically generated. The gazetteers were used as features by a Named Entity Recognition system. The performance of this NER system, when labeled and unlabeled training data was available, was measured. A comparative study for the information contributed by the gazetteers in the entity classification process was shown.

In the future we intend to develop automatic gazetteers for organization and product names. It is also of interest to divide location gazetteers in subcategories as countries, cities, rivers, mountains as they are useful for Geographic Information Retrieval systems. To explore the behavior of named entity bootstrapping, other domains as bioinformatics will be explored.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 BootstrappingNERwAutoGenGazListsZornitsa KozarevaBootstrapping Named Entity Recognition with Automatically Generated Gazetteer ListsProceedings of EACLhttp://acl.ldc.upenn.edu/eacl2006/companion/srw/02a sw 10 kozareva.pdf2006