2007 LargeScaleNEDisambigBasedOnWikiped

Jump to: navigation, search

Subject Headings: Named Entity Disambiguation, Wikipedia-based Word Mention Normalization Task, Cucerzan Larg-Scale Named Entity Disambiguation System.


Cited By


  • (Kulkarni et al., 2009) ⇒ Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, Soumen Chakrabarti. (2009). “Collective Annotation of Wikipedia Entities in Web Text.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557073.
    • QUOTE: To our knowledge, Cucerzan [4] was the first to recognize general interdependence between entity labels in the context of Wikipedia annotations.
  • (Medelyan et al., 2009) ⇒ Olena Medelyan, David Milne, Catherine Legg, and Ian H. Witten. (2009). “Mining Meaning from Wikipedia.” In: International Journal of Human-Computer Studies, 67(9). doi:10.1016/j.ijhcs.2009.05.004
    • QUOTE: Cucerzan [2007] identifies and disambiguates named entities in text. Like [[Bunescu and Paşca [2006]]], he first extracts a vocabulary from Wikipedia. It is divided into two parts, the first containing surface forms and the second the associated entities along with contextual information. The surface forms are titles of articles, redirects, and disambiguation pages, and anchor text used in links. This yields 1.4M entities, with an average of 2.4 surface forms each. Further <named entity, tag> pairs are extracted from Wikipedia list pages — e.g., Texas (band) receives a tag LIST_band name etymologies, because it appears in the list with this title — yielding a further 540,000 entries. Categories assigned to Wikipedia articles describing named entities serve as tags too, yielding 2.6M entries. Finally a context for each named entity is collected — e.g., parenthetical expressions in its title, phrases that appear as link anchors in the article’s first paragraph of the article, etc. — yielding 38M <named entity, context> pairs.

      To identify named entities in text, capitalization rules indicate which phrases are surface forms of named entities. Co-occurrence statistics generated from the web by a search engine help to identify boundaries between them (e.g. Whitney Museum of American Art is a single entity, whereas Whitney Museum in New York contains two). Lexical analysis is used to collate identical entities (e.g., Mr. Brown and Brown), and entities are tagged with their type (e.g., location, person) based on statistics collected from manually annotated data. Disambiguation is performed by comparing the similarity of the document in which the surface form appears with Wikipedia articles that represent all named entities that have been identified in it, and their context terms, and choosing the best match. Cucerzan [2007] achieves 88% accuracy on 5,000 entities appearing in Wikipedia articles, and 91% on 750 entities appearing in news stories.


  • (Vercoustre et al., 2008) ⇒ Anne-Marie Vercoustre, James A. Thom, and Jovan Pehcevski. (2008). “Entity ranking in Wikipedia.” In: Proceedings of the 2008 ACM Symposium on Applied Computing.
    • Cucerzan [8] uses Wikipedia data for named entity disambiguation. He first pre-processed a version of the Wikipedia collection (September 2006), and extracted more than 1.4 millions entities with an average of 2.4 surface forms by entities. He also extracted more than one million (entities, category) pairs that were further filtered down to 540 thousand pairs. Lexico-syntactic patterns, such as titles, links, paragraphs and lists, are used to build coreferences of entities in limited contexts. The knowledge extracted from Wikipedia is then used for improving entity disambiguation in the context of web and news search.
  • (Jijkoun et al., 2008) ⇒ Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, and Maarten de Rijke\n. (2008). “Named Entity Normalization in User Generated Content.” In: Proceedings of the second workshop on Analytics for Noisy Unstructured Text Data (AND 2008).
    • Research on named entity extraction and normalization has been carried out in both restricted and open domains. … Cucerzan [4] considers the entity normalization task for news and encyclopedia articles; they use information extracted from Wikipedia combined with machine learning for context-aware name disambiguation; the baseline that we use in this paper (taken from [11]) is a modification (and improved version) of Cucerzan [4]’s baseline. Cucerzan [4] also presents an extensive literature overview on the problem.



This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process from Wikipedia. Through a process of maximizing the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities, the implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles.

1. Introduction and Related Work

The ability to identify the named entities (such as people and locations) has been established as an important task in several areas, including topic detection and tracking, machine translation, and information retrieval. Its goal is the identification of mentions of entities in text (also referred to as surface forms henceforth), and their labeling with one of several entity type labels. Note that an entity (such as George W. Bush, the current president of the U.S.) can be referred to by multiple surface forms (e.g., “George Bush” and “Bush”) and a surface form (e.g., “Bush”) can refer to multiple entities (e.g., two U.S. presidents, the football player Reggie Bush, and the rock band called Bush).

When it was introduced, in the 6th Message Understanding Conference (Grishman and Sundheim, 1996), the named entity recognition task comprised three entity identification and labeling subtasks: ENAMEX (proper names and acronyms designating persons, locations, and organizations), TIMEX (absolute temporal terms) and NUMEX (numeric expressions, monetary expressions, and percentages). Since 1995, other similar named entity recognition tasks have been defined, among which CoNLL (e.g., Tjong Kim Sang and De Meulder, 2003) and ACE (Doddington et al., 2004). In addition to structural disambiguation (e.g., does “the Alliance for Democracy in Mali” mention one, two, or three entities?) and entity labeling (e.g., does “Washington went ahead” mention a person, a place, or an organization?), MUC and ACE also included a within document coreference task, of grouping all the mentions of an entity in a document together (Hirschman and Chinchor, 1997).

When breaking the document boundary and scaling entity tracking to a large document collection or the Web, resolving semantic ambiguity becomes of central importance, as many surface forms turn out to be ambiguous. For example, the surface form “Texas” is used to refer to more than twenty different named entities in Wikipedia. In the context “former Texas quarterback James Street”, Texas refers to the University of Texas at Austin; in the context “in 2000, Texas released a greatest hits album”, Texas refers to the British pop band; in the context “Texas borders Oklahoma on the north”, it refers to the U.S. state; while in the context “the characters in Texas include both real and fictional explorers”, the same surface form refers to the novel written by James A. Michener.

Bagga and Baldwin (1998) tackled the problem of cross-document coreference by comparing, for any pair of entities in two documents, the word vectors built from all the sentences containing mentions of the targeted entities. Ravin and Kazi (1999) further refined the method of solving coreference through measuring context similarity and integrated it into Nominator (Wacholder et al., 1997), which was one of the first successful systems for named entity recognition and co-reference resolution. However, both studies targeted the clustering of all mentions of an entity across a given document collection rather than the mapping of these mentions to a given reference list of entities.

A body of work that did employ reference entity lists targeted the resolution of geographic names in text. Woodruff and Plaunt (1994) used a list of 80k geographic entities and achieved a disambiguation precision of 75%. Kanada (1999) employed a list of 96k entities and reported 96% precision for geographic name disambiguation in Japanese text. Smith and Crane (2002) used the Cruchley’s and the Getty thesauri, in conjunction with heuristics inspired from the Nominator work, and obtained between 74% and 93% precision at recall levels of 89-99% on five different history text corpora. Overell and Rüger (2006) also employed the Getty thesaurus as reference and used Wikipedia to develop a co-occurrence model and to test their system.

In many respects, the problem of resolving ambiguous surface forms based on a reference list of entities is similar to the lexical sample task in word sense disambiguation (WSD) . This task, which has supported large-scale evaluations – SENSEVAL 1-3 (Kilgarriff and Rosenzweig, 2000; Edmonds and Cotton, 2001; Mihalcea et al., 2004) – aims to assign dictionary meanings to all the instances of a predetermined set of polysemous words in a corpus (for example, choose whether the word “church” refers to a building or an institution in a given context). However, these evaluations did not include proper noun disambiguation and omitted named entity meanings from the targeted semantic labels and the development and test contexts (e.g., “Church and Gale showed that the frequency [..]”).

The problem of resolving ambiguous names also arises naturally in Web search. For queries such as “Jim Clark” or “Michael Jordan”, search engines return blended sets of results referring to many different people. Mann and Yarowsky (2003) addressed the task of clustering the Web search results for a set of ambiguous personal names by employing a rich feature space of biographic facts obtained via bootstrapped extraction patterns. They reported 88% precision and 73% recall in a three-way classification (most common, secondary, and other uses). Raghavan et al. (2004). explored the use of entity language models for tasks such as clustering entities by profession and classifying politicians as liberal or conservative. To build the models, they recognized the named entities in the TREC-8 corpus and computed the probability distributions over words occurring within a certain distance of any instance labeled as Person of the canonical surface form of 162 famous people.

Our aim has been to build a named entity recognition and disambiguation system that employs a comprehensive list of entities and a vast amount of world knowledge. Thus, we turned our attention to the Wikipedia collection, the largest organized knowledge repository on the Web (Remy, 2002).

Wikipedia was successfully employed previously by Strube and Ponzetto (2006) and Gabrilovich and Markovitch (2007) to devise methods for computing semantic relatedness of documents, WikiRelate! and Explicit Semantic Analysis (ESA), respectively. For any pair of words, WikiRelate! attempts to find a pair of articles with titles that contain those words and then computes their relatedness from the word-based similarity of the articles and the distance between the articles’ categories in the Wikipedia category tree. ESA works by first building an inverted index from words to all Wikipedia articles that contain them. Then, it estimates a relatedness score for any two documents by using the inverted index to build a vector over Wikipedia articles for each document and by computing the cosine similarity between the two vectors.

The most similar work to date was published by Bunescu and Paşca (2006). They employed several of the disambiguation resources discussed in this paper (Wikipedia entity pages, redirection pages, categories, and hyperlinks) and built a context article cosine similarity model and an SVM based on a taxonomy kernel. They evaluated their models for person name disambiguation over 110, 540, and 2,847 categories, reporting accuracies between 55.4% and 84.8% on (55-word context, entity) pairs extracted from Wikipedia, depending on the model and the development/test data employed.

The system discussed in this paper performs both named entity identification and disambiguation. The entity identification and in-document coreference components resemble the Nominator system (Wacholder et al., 1997). However, while Nominator made heavy use of heuristics and lexical clues to solve the structural ambiguity of entity mentions, we employ statistics extracted from Wikipedia and Web search results. The disambiguation component, which constitutes the main focus of the paper, employs a vast amount of contextual and category information automatically extracted from Wikipedia over a space of 1.4 million distinct entities/concepts, making extensive use of the highly interlinked structure of this collection. We augment the Wikipedia category information with information automatically extracted from Wikipedia list pages and use it in conjunction with the context information in a vectorial model that employs a novel disambiguation method.

2. The Disambiguation Paradigm

We present in this section an overview of the proposed disambiguation model and the world knowledge data employed in the instantiation of the model discussed in this paper. The formal model is discussed in detailed in Section 5. The world knowledge used includes the known entities (most articles in Wikipedia are associated to an entity/concept), their entity class when available (Person, Location, Organization, and Miscellaneous), their known surface forms (terms that are used to mention the entities in text), contextual evidence (words or other entities that describe or co-occur with an entity), and category tags (which describe topics to which an entity belongs to)

3. Information Extraction from Wikipedia

When processing the Wikipedia collection, we distinguish among four types of articles: entity pages, redirecting pages, disambiguation pages, and list pages.

6. Evaluation

In both settings, we computed a disambiguation baseline in the following manner: for each surface form, if there was an entity page or redirect page whose title matches exactly the surface form then we chose the corresponding entity as the baseline disambiguation; otherwise, we chose the entity most frequently mentioned in Wikipedia using that surface form.

7. Conclusions and Potential Impact

We presented a large scale named entity disambiguation system that employs a huge amount of information automatically extracted from Wikipedia over a space of more than 1.4 million entities. In tests on both real news data and Wikipedia text, the system obtained accuracies exceeding 91% and 88%. Because the entity recognition and disambiguation processes employed use very little language-dependent resources additional to Wikipedia, the system can be easily adapted to languages other than English.

The system described in this paper has been fully implemented as a Web browser (Figure 3), which can analyze any Web page or client text document. The application on a large scale of such an entity extraction and disambiguation system could result in a move from the current space of words to a space of concepts, which enables several paradigm shifts and opens new research directions, which we are currently investigating, from entity-based indexing and searching of document collections to personalized views of the Web through entitybased user bookmarks.



  • Bagga, A. and B. Baldwin. (1998). Entity-based crossdocument coreferencing using the vector space model. In: Proceedings of COLING-ACL, 79-85.
  • (Bunescu and Paşca, 2006) ⇒ Razvan C. Bunescu and Marius Paşca. (2006). “Using Encyclopedic Knowledge for Named Entity Disambiguation" In: Proceedings of EACL-2006.
  • Cederberg, S. and D. Widdows. (2003). Using LSA and noun coordination information to improve the precision and recall of hyponymy extraction. In: Proceedings of CoNLL, 111-118.
  • George Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, and R. Weischedel. (2004). ACE program – task definitions and performance measures. In: Proceedings of LREC, 837-840.
  • Edmonds, P. and S. Cotton. (2001). Senseval-2 overview. In: Proceedings of SENSEVAL-2, 1-6.
  • Evgeniy Gabrilovich and S. Markovitch. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of IJCAI, 1606-1611.
  • Gale, W., Kenneth W. Church, and David Yarowsky. (1992). One sense per discourse. In: Proceedings of the 4th DARPA SNL Workshop, 233-237.
  • Grishman, R. and B. Sundheim. (1996). Message Understanding Conference - 6: A brief history. In: Proceedings of COLING, 466-471.
  • Hearst, M. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of COLING, 539-545.
  • Lynette Hirschman and N. Chinchor. (1997). MUC-7 Coreference Task Definition. In: Proceedings of MUC-7.
  • Kanada, Y. (1999). A method of geographical name extraction from Japanese text. In: Proceedings of CIKM, 46-54.
  • Kilgarriff, A. and J. Rosenzweig. (2000). Framework and results for English Senseval. Computers and Humanities, Special Issue on SENSEVAL, 15-48.
  • Lapata, M. and F. Keller. (2004). The Web as a Baseline: Evaluating the Performance of Unsupervised Webbased Models for a Range of NLP Tasks. In: Proceedings of HLT, 121-128.
  • Mann, G. S. and David Yarowsky. (2003). Unsupervised Personal Name Disambiguation. In: Proceedings of CoNLL, 33-40.
  • Rada Mihalcea, T. Chklovski, and A. Kilgarriff. The Senseval-3 English lexical sample task. In: Proceedings of SENSEVAL-3, 25-28.
  • Overell, S., and S. Rüger. 2006 Identifying and grounding descriptions of places. In SIGIR Workshop on Geographic Information Retrieval.
  • Raghavan, H., J. Allan, and Andrew McCallum. (2004). An exploration of entity models, collective classification and relation description. In KDD Workshop on Link Analysis and Group Detection.
  • Ravin, Y. and Z. Kazi. (1999). Is Hillary Rodham Clinton the President? In ACL Workshop on Coreference and it's Applications.
  • Remy, M. (2002). Wikipedia: The free encyclopedia. In Online Information Review, 26(6): 434.
  • Roark, B. and Eugene Charniak. (1998). Noun-phrase cooccurrence statistics for semi-automatic semantic lexicon construction. In: Proceedings of COLINGACL, 1110-1116.
  • Gerard M. Salton. (1989). Automatic Text Processing. Addison-Wesley.
  • Smith, D. A. and G. Crane. (2002). Disambiguating geographic names in a historic digital library. In: Proceedings of ECDL, 127-136.
  • Michael Strube and S. P. Ponzeto. (2006). WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of AAAI, 1419-1424.
  • Tjong Kim Sang, E. F. and F. De Meulder. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL, 142-147.
  • Wacholder, N., Y. Ravin, and M. Choi. (1997). Disambiguation of proper names in text. In: Proceedings of ANLP, 202-208.
  • Woodruff, A. G. and C. Paunt. GIPSY:Automatic geographic indexing of documents. Journal of the American Society for Information Science and Technology, 45(9):645-655.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 LargeScaleNEDisambigBasedOnWikipedSilviu CucerzanLarge-Scale Named Entity Disambiguation Based on Wikipedia DataProceedings of Empirical Methods in Natural Language Processing Conferencehttp://acl.ldc.upenn.edu/D/D07/D07-1074.pdf2007