2006 UsingEncyclKnowForNEDisambig

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Resolution Algorithm

Notes

  • Its Presentation Deck can be found at http://www.cs.utexas.edu/~razvan/papers/eacl2006.ppt
  • It performs disambiguation into one of the known possible classes for a NE (determined from Wikipedia disambiguation pages).
  • Its contexts for training and testing are acquired from Wikipedia pages (as opposed to general text).
  • It uses vectors of co-occurring terms for disambigaution
  • It uses a taxonomy-based kernel that integrates word-category correlations.
  • It evaluates prediction for a given NE in a Wikipedia page context
    • the correct class from among its known classes
    • It includes one experiment that included 10% of out-of-Wikipedia entities.
  • It category space is restricted to Person Occupation, with 8,202 subclasses.
  • Its experiments consider:
    • 110 broad classes
    • 540 highly populated classes (w/o out-of-Wikipedia entities)
    • 2,847 classes including less populated ones.
  • classification is performed in context
  • it does not evaluate recognition.

Cited By

2014

2009

Quotes

Abstract

We present a new method for detecting and disambiguating named entities in open domain text. A disambiguation SVM kernel is trained to exploit the high coverage and rich structure of the knowledge encoded in an online encyclopedia. The resulting model significantly outperforms a less informed baseline.

Introduction

Whenever the queries search for pinpointed, factual information, the burden of filling the gap between the output granularity (whole documents) and the targeted information (a set of sentences or relevant phrases) stays with the users, by browsing the returned documents in order to find the actually relevant bits of information.

We organize all named entities from Wikipedia into a dictionary structure [math]\displaystyle{ D }[/math], where each string entry [math]\displaystyle{ d }[/math] in [math]\displaystyle{ D }[/math] is mapped to the set of entities d.E that can be denoted by [math]\displaystyle{ d }[/math] in Wikipedia.

The first step is to identify named entities, i.e. entities with a proper name title. Because every title in Wikipedia must begin with a capital letter, the decision whether a title is a proper name relies on the following sequence of heuristic steps:

  • 1. If ....... is a multiword title, check the capitalization of all content words, i.e. words other than prepositions, determiners, con­junctions, relative pronouns or negations. Consider a named entity if and only if all . content words are capitalized.
  • 2. If ....... is a one word title that contains at least two capital letters, then . is a named en­tity. Otherwise, go to step 3.
  • 3. Count how many times ....... occurs in the text of the article, in positions other than at the beginning of sentences. If at least … of these occurrences are capitalized, then . is a named entity.

Named Entity Disambiguation

We use the term query to denote the occurrence of a proper name inside a Wikipedia article. If there is a dictionary entry matching the proper name in the query such that the set of …

Presentation

1) Classification:

  • Train a classifier for each proper name in the dictionary D.
  • Not feasible: 500K proper names  need 500K classifiers!

2) Ranking:

  • Design a scoring function score(q,ek) that computes the compatibility between the context of the proper name occurring in a query q, and any of the entities ek q.E that may be referred by that proper name.
  • For a given named entity query q, select the highest ranking entity:
  • Use cosine similarity between query context and article, based on the tf x idf formulation:

References

  • Ricardo Baeza-Yates and Berthier Ribeiro-Neto. (1999). Modern Information Retrieval. ACM Press, New York.
  • Massimiliano Ciaramita, Thomas Hofmann, and Mark Johnson. (2003). Hierarchical semantic classification: Word sense disambiguation with world knowledge. In The 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico.
  • Robert Dale. (2003). Computational linguistics. Special Issue on the Web as a Corpus, 29(3), September.
  • Gottlob Frege. (1999). On sense and reference. In Maria Baghramian, editor, Modern Philosophy of Language, pages 3–25. Counterpoint Press.
  • Chung Heong Gooi and James Allan. (2004). Cross-document coreference on a large scale corpus. In: Proceedings of Human Language Technology Conference / North American Association for Computational Linguistics Annual Meeting, Boston, MA.
  • Thorsten Joachims. (1999). Making large-scale SVM learning practical. In Bernhard Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169–184. MIT Press.
  • Thorsten Joachims. (2002). Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 133–142.
  • Andrew McCallum, R. Rosenfeld, Tom M. Mitchell, and A. Y. Ng. (1998). Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML-98), pages 359–367, Madison, WI.
  • M. Remy. (2002). Wikipedia: The free encyclopedia. Online Information Review, 26(6):434. www.wikipedia.org. Vladimir N. Vapnik. (1998). Statistical Learning Theory. John Wiley & Sons.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 UsingEncyclKnowForNEDisambigRazvan C. Bunescu
Marius Paşca
Using Encyclopedic Knowledge for Named Entity DisambiguationProceedings of the 11th Conference of the European Chapter of the Association for Computational Linguisticshttp://www.cs.utexas.edu/~razvan/papers/eacl2006.pdf2006