2005 AutomaticEntityDisambiguation

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Resolution Task, Information Extraction, Link Analysis, HUMINT, OSINT.

Notes

Cited By

Quotes

Abstract

  • Entity disambiguation resolves the many-to-many correspondence between mentions of entities in text and unique real-world entities. Entity disambiguation can bring to bear global (corpus level) statistics to improve the performance of named entity recognition systems. More importantly, intelligence analysts are keenly interested in relationships between real-world entities. Entity disambiguation makes possible additional types of relation assertions and affects relation extraction performance assessment. Finally, link analysis and inference inherently operate at the level of entities, not text strings. Thus, entity disambiguation is a prerequisite to carrying out these higher-level operations on information extracted from plain text. This paper describes Fair Isaac’s automatic entity disambiguation capability and its performance.

Introduction

  • Spoken and written text consists of characters, words, names, sentences, documents, conversations, and so on, but the world that the text describes consists of distinct objects and events. Intelligence analysts have access to an enormous amount of text, but are ultimately interested in actual persons and organizations that interact in the real world. Entity disambiguation (sometimes also referred to as entity tracking) is the process of determining which names, words, or phrases in text correspond to distinct persons, organizations, locations, or other entities. This determination is absolutely essential for reasoning, inference, and the examination of social network structures based on information derived from text.
  • We use the term entity to mean an object or set of objects in the world. A mention is a reference to an entity such as a word or phrase in a document. Entities may be referenced by their name, indicated by a common noun or noun phrase, or represented by a pronoun. Mentions may aggregate with other mentions that refer to the same specific real-world object, and, taken together, the aggregated mentions model an entity. These corpus-wide aggregated models of entities are of primary importance to the analyst, while the individual mentions of an entity are still of secondary importance (Mitchell et al. 2004).
  • Entity disambiguation inherently involves resolving many-to-many relationships. Multiple distinct strings, such as “Abdul Khan”, “Dr. Khan”, and “’Abd al-Qadir Khan”, may refer to the same entity. Simultaneously, multiple identical mentions refer to distinct entities.

Methodology

  • In unstructured text, each document provides a natural context for entity disambiguation. Within a document, two mentions of “Abdul Khan” probably do refer to the same person unless there is evidence to the contrary. Similarly, “NFC” and “National Fertilisers Corporation” probably refer to the same organization if they occur in the same document, barring other evidence. Thus, we first carry out within-document co-reference resolution, aggregating information about each entity mentioned in each document. We then use these entity attributes as features in determining which documents deal with the same entity.

Within-Document Disambiguation

  • When dealing with unstructured text, a named entity recognition (NER) system provides the input to the entity disambiguation. We currently use two NER systems in parallel. One is based on supervised training of hidden Markov models (Bikel 1997) followed by name list and rule-based postprocessing. The other utilizes a set of engineered regular expressions. Both provide mention start and stop boundaries, entity type assertions, and confidence values. The two systems complement one another in that the former uses local context and provides higher coverage, whereas the latter is more accurate, especially for numeric types such as dates and monetary amounts.

Cross-Document Disambiguation

  • Key characteristics of this cross-document entity disambiguation algorithm, especially relative to other such methods (Bagga and Baldwin 1998; Gooi and Allan 2004; Huang et al. 2003; Kalashnikov and Mehrotra 2005; Mann and Yarowsky 2003; Mihalcea 2003; Ravin and Kazi 1999) are:
    • Recognizes when identical names correspond to distinct entities.
    • Recognizes when different names (including spelling and transliteration variations) correspond to a single entity.
    • Uses many different sources of context as evidence.
    • High disambiguation performance.
    • High computational throughput.
  • Bagga and Baldwin 1998;
  • Gooi and Allan 2004;
  • Huang et al. 2003;
  • Kalashnikov and Mehrotra 2005;
  • Mann and Yarowsky 2003;
  • Mihalcea 2003;
  • Ravin and Kazi 1999

Benefits to NER and Subsequent Processing

Named Entity Recognition

  • Most NER systems utilize only locally-available information for determining entity boundaries and entity types. Information from outside the current sentence or document is ignored. Entity disambiguation makes it possible to utilize this information to refine the NER results.
  • For example, if the entity disambiguation system determines that two mentions, “Dean Smith” and “Michael Smith” (whose title is “Dean”), correspond to the same entity, it is possible to correct the NER output to recognize the first “Dean” as a title rather than a given name.
  • Our system explicitly carries out this process at the stage of within-document entity disambiguation and in some cases (e.g. Table 4) also implicitly achieves this effect via cross-document entity disambiguation.

References

  • Bagga, A. and Baldwin, B. (1998). Entity-based Crossdocument Coreferencing Using the Vector Space Model. 17th International Conference on Computational Linguistics (CoLing-ACL). Montreal, Canada. 10-14 August, 1998, 79-85.
  • Bikel, D. M.; Miller, S.; Schwartz, R. and Weischedel, R. (1997). Nymble: a High-Performance Learning Name-finder. Fifth Conference on Applied Natural Language Processing. Washington, D.C. 31 March - 3 April, 1997, 194-201.
  • Caid, W. and Oing, P. (1997). System and Method of Context Vector Generation and Retrieval. U.S. Patent No. 5,619,709.
  • Gooi, C. H. and Allan, J. (2004). Cross-Document Coreference on a Large Scale Corpus. Human Language Technology Conference (HLT-NAACL). Boston, Massachusetts. 2-7 May, 2004, 9-16.
  • Hobbs, J. 1978. Resolving Pronoun References. Lingua 44(4): 311-338.
  • Huang, F.; Vogel, S. and Waibel, A. (2003). Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization. ACL-03 Workshop on Multilingual and Mixed-language Named Entity Recognition. Sapporo, Japan. 12 July, 2003, 9-16.
  • Kalashnikov, D. V. and Mehrotra, S. (2005). A Probabilistic Model for Entity Disambiguation Using Relationships. SIAM International Conference on Data Mining (SDM). Newport Beach, California. 21-23 April, 2005.
  • Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley Professional.
  • Mann, G. S. and Yarowsky, D. (2003). Unsupervised Personal Name Disambiguation. Conference on Computational Natural Language Learning (CoNLL). Edmonton, Canada. 31 May - 1 June, 2003, 33-40.
  • Rada Mihalcea (2003). The Role of Non-Ambiguous Words in Natural Language Disambiguation. Conference on Recent Advances in Natural Language Processing (RANLP). Borovetz, Bulgaria. 10-12 September, 2003, .
  • Mitchell, A.; Strassel, S.; Przybocki, P.; Davis, J. K.; George Doddington; Grishman, R.; Meyers, A.; Brunstein, A.; Ferro, L. and Sundheim, B. (2004). Annotation Guidelines for Entity Detection and Tracking (EDT), Version 4.2.6. http://www.ldc.upenn.edu/Projects/ACE/.
  • Ravin, Y. and Kazi, Z. (1999). Is Hillary Rodham Clinton the President? Disambiguating Names across Documents. ACL 1999 Workshop on Coreference and Its Applications. College Park, Maryland. 22 June, 1999, 9-1,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 AutomaticEntityDisambiguationMatthias BlumeAutomatic Entity Disambiguation: Benefits to NER, Relation Extraction, Link Analysis, and InferenceInthttps://analysis.mitre.org/proceedings/Final Papers Files/12 Camera Ready Paper.pdf2005