2004 IdentificationandTracingofAmbig

Jump to: navigation, search

Subject Headings: Entity Mention Coreference Resolution, Ontology-based Semantic Annotation Task.


Cited By



A given entity – representing a person, a location or an organization – may be mentioned in text in multiple, ambiguous ways. Understanding natural language requires identifying whether different mentions of a name, within and across documents, represent the same entity.

We present two machine learning approaches to this problem, which we call the “Robust Reading” problem. Our first approach is a discriminative approach, trained in a supervised way. Our second approach is a generative model, at the heart of which is a view on how documents are generated and how names (of different entity types) are “sprinkled” into them. In its most general form, our model assumes: (1) a joint distribution over entities (e.g., a document that mentions “President Kennedy” ismore likely tomention “Oswald” or “White House” than “Roger Clemens”), (2) an “author” model, that assumes that at least one mention of an entity in a document is easily identifiable, and then generates other mentions via (3) an appearance model, governing how mentions are transformed from the “representative” mention.

We show that both approaches perform very accurately, in the range of 90% − 95% F1 measure for different entity types, much better than previous approaches to (some aspects of) this problem. Our extensive experiments exhibit the contribution of relational and structural features and, somewhat surprisingly, that the assumptions made within our generative model are strong enough to yield a very powerful approach, that performs better than a supervised approach with limited supervised information.


Reading and understanding text requires the ability to disambiguate at several levels, abstracting away details and using background knowledge in a variety of ways. One of the difficulties that humans resolve instantaneously and unconsciously is that of reading names. Most names of people, locations, organizations and others, have multiple writings that are being used freely within and across documents.

The variability in writing a given concept, along with the fact that different concepts may have very similar writings, poses a significant challenge to progress in natural language processing. Consider, for example, an open domain question answering system (Voorhees 2002) that attempts, given a question like: “When was President Kennedy born?” to search a large collection of articles in order to pinpoint the concise answer: “onMay 29, 1917.” The sentence, and even the document that contains the answer, may not contain the name “President Kennedy”; it may refer to this entity as “Kennedy”, “JFK” or “John Fitzgerald Kennedy”. Other documents may state that “John F. Kennedy, Jr. was born on November 25, 1960”, but this fact refers to our target entity’s son. Other mentions, such as “Senator Kennedy” or “Mrs. Kennedy” are even “closer” to the writing of the target entity, but clearly refer to different entities. Even the statement “John Kennedy, born 5-29-1941” turns out to refer to a different entity, as one can tell observing that the document discusses Kennedy’s batting statistics. A similar problem exists for other entity types, such as locations and organizations. Ad hoc solutions to this problem, as we show, fail to provide a reliable and accurate solution.

This paper presents two learning approaches to the problem of Robust Reading, compares them to existing approaches and shows that an unsupervised learning approach can be applied very successfully to this problem, provided that it is used along with strong but realistic assumptions on the nature of the use of names in documents.

Our first model is a discriminative approach that models the problem as that of deciding whether any two names mentioned in a collection of documents represent the same entity. This straightforward modelling of the problem results in a classification problem – as has been done by several other authors (Cohen, Ravikumar, & Fienberg 2003; Bilenko & Mooney 2003) – allowing us to compare our results with these. This is a standard pairwise classification task, under a supervised learning protocol; our main contribution in this part is to show how relational (string and token-level) features and structural features, representing transformations between names, can improve the performance of this classifier.

Several attempts have been made in the literature to improve the results by performing some global optimization, with the above mentioned pairwise classifier as the similarity metric. The results of these attempts were not conclusive and we provide some explanation for why that is. We prove that, assuming optimal clustering, this approach reduces the error of a pairwise classifier in the case of two classes (corresponding to two entities); however, in the more general case, when the number of entities (classes) is greater than 2, global optimization mechanisms could be worse than pairwise classification. Our experiments concur with this. However, as we show, if one first splits the data in some coherent way – e.g., to groups of documents originated at about the same time period – this can aid clustering significantly. This observation motivates our second model. We developed a global probabilistic model for Robust Reading, detailed in (Li,Morie, & Roth 2004). Here we briefly illustrate one of its instantiations and concentrate on its experimental study and a comparison to the discriminative models. At the heart of this approach is a view on how documents are generated and how names (of different types) are “sprinkled” into them. In its most general form, our model assumes: (1) a joint distribution over entities, so that a document that mentions “President Kennedy” is more likely to mention “Oswald” or “White House” than “Roger Clemens”; (2) an “author” model, that makes sure that at least one mention of a name in a document is easily identifiable (after all, that’s the author’s goal), and then generates other mentions via (3) an appearance model, governing how mentions are transformed from the “representative” mention.

Our goal is to learn the model from a large corpus and use it to support Robust Reading. Given a collection of documents we learn the model in an unsupervised way; that is, the system is not told during training whether two mentions represent the same entity. We only assume, as in the discriminative model above, the ability to recognize names, using a named entity recognizer run as a preprocessor.


The paper presents two learning approaches to the “robust reading” problem – cross-document identification of names despite ambiguous writings. In addition to a standard modelling of the problem as a classification task, we developed a model that describes the natural generation process of a document and the process of how names are “sprinkled” into them, taking into account dependencies between entities across types and an “author” model. We have shown that both models gain a lot from incorporating structural and relational information – features in the classification model; coherent data splits for clustering and the natural generation process in the case of the probabilistic model.

The robust reading problem is a very significant barrier on our way toward supporting robust natural language processing, and the reliable and accurate results we show are thus very encouraging. Also important is the fact that our unsupervised model performs so well, since the availability of annotated data is, in many cases, a significant obstacle to good performance in NLP tasks.

In addition to further studies of the discriminative model, including going beyond the current noisy supervision (given at a global annotation level, although learning is done locally). exploring how much data is needed for a supervised model to perform as well as the unsupervised model, and whether the initialization of the unsupervised model can gain from supervision, there are several other critical issues we would like to address from the robust reading perspective. These include (1) incorporating more contextual information (like time and place) related to the target entities, both to support a better model and to allow temporal tracing of entities; (2) studying an incremental approach to learning the model and (3) integrating this work with other aspect of coreference resolution (e.g., pronouns).



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 IdentificationandTracingofAmbigXin Li
Paul Morie
Dan Roth
Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches2004
AuthorXin Li +, Paul Morie + and Dan Roth +
titleIdentification and Tracing of Ambiguous Names: Discriminative and Generative Approaches +
year2004 +