2008 EntityRankingInWikipedia

Jump to: navigation, search

Subject Headings: Entity Ranking Algorithm, Entity Ranking Task, Wikipedia, XML Retrieval, Test Collection, INEX Wikipedia Corpus, Entity Mention Normalization Algorithm.


Cited By



The traditional entity extraction problem lies in the ability of extracting named entities from plain text using natural language processing techniques and intensive training from large document collections. Examples of named entities include organisations, people, locations, or dates. There are many research activities involving named entities; we are interested in entity ranking in the field of information retrieval. In this paper, we describe our approach to identifying and ranking entities from the INEX Wikipedia document collection. Wikipedia offers a number of interesting features for entity identification and ranking that we first introduce. We then describe the principles and the architecture of our entity ranking system, and introduce our methodology for evaluation. Our preliminary results show that the use of categories and the link structure of Wikipedia, together with entity examples, can significantly improve retrieval effectiveness.


  • 1 Brad Adelberg, Matthew Denny, Nodose version 2.0, Proceedings of the 1999 ACM SIGMOD Conference, p.559-561, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
  • 2 D. Awang Iskandar, J. Pehcevski, J. A. Thom, and S. M. M. Tahaghoghi. Social media retrieval using image features and structured text. In Comparative Evaluation of XML Information Retrieval Systems: 5th Workshop of the INitiative for the Evaluation of XML Retrieval, INEX 2006, volume 4518 of LNCS, pages 358--372, 2007.
  • 3 E. Blanchard, M. Harzallah, and P. K. Henri Briand. A typology of ontology-based semantic measures. In EMOI-INTEROP'05, Proceedings of Open Interop Workshop on Enterprise Modelling and Ontologies for Interoperability, Porto, Portugal, 2005.
  • 4 E. Blanchard, P. Kuntz, M. Harzallah, and H. Briand. A tree-based similarity for evaluating concept proximities in an ontology. In: Proceedings of 10th conference of the International Fedederation of Classification Societies, pages 3--11, Ljubljana, Slovenia, 2006.
  • 5 Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine, Proceedings of the seventh International Conference on World Wide Web 7, p.107-117, April 1998, Brisbane, Australia
  • 6 Jamie Callan, Teruko Mitamura, Knowledge-based extraction of named entities, Proceedings of the eleventh International Conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA doi:10.1145/584792.584880
  • 7 (Cohen & Sarawagi, 2004) ⇒ William W. Cohen, and Sunita Sarawagi. (2004). “Exploiting Dictionaries in Named Entity Extraction: Combining semi-Markov extraction processes and data integration methods.” In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004) doi:10.1145/1014052.1014065
  • 8 (Cucerzan, 2007) ⇒ Silviu Cucerzan. (2007). “Large-Scale Named Entity Disambiguation Based on Wikipedia Data." (2007). In: Proceedings of EMNLP-CoNLL-2007
  • 9 S. Cucerzan and David Yarowsky. Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of 1999 Joint SIGDAT Conference on EMNLP and VLC, pages 90--99, Maryland, MD, 1999.
  • 10 Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, GATE: an architecture for development of robust HLT applications, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania doi:10.3115/1073083.1073112
  • 11 A. P. de Vries and N. Craswell. Entity ranking -- guidelines. In INEX 2006 Workshop Pre-Proceedings, pages 413--414, 2006.
  • 12 A. P. de Vries, J. A. Thom, A.-M. Vercoustre, N. Craswell, and M. Lalmas. INEX 2007 Entity ranking track guidelines. In INEX 2007 Workshop Pre-Proceedings, 2007 (to appear).
  • 13 Ludovic Denoyer, Patrick Gallinari, The Wikipedia XML corpus, ACM SIGIR Forum, v.40 n.1, June 2006 doi:10.1145/1147197.1147210
  • 14 (HassellAA, 2006) ⇒ Joseph Hassell, Boanerges Aleman-Meza, and I. Budak Arpinar. "Ontology-driven automatic entity disambiguation in unstructured text.” In: Proceedings of the 5th International Semantic Web Conference (ISWC). (PowerPoint)
  • 15 Jon Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM (JACM), v.46 n.5, p.604-632, Sept. 1999 doi:10.1145/324133.324140
  • 16 Nicholas Kushmerick, Wrapper induction: efficiency and expressiveness, Artificial Intelligence, v.118 n.1-2, p.15-68, April 2000 doi:10.1016/S0004-3702(99)00100-9
  • 17 K. Lerman, S. N. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, (2003).
  • 18 Bing Liu, Robert L. Grossman, Yanhong Zhai, Mining data records in Web pages, Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, D.C. doi:10.1145/956750.956826
  • 19 S. Malik, A. Trotman, and M. Lalmas. Overview of INEX (2006). In Comparative Evaluation of XML Information Retrieval Systems: 5th Workshop of the INitiative for the Evaluation of XML Retrieval, INEX 2006, volume 4518 of LNCS, pages 1--11, 2007.
  • 20 Paul McNamee, James Mayfield, Entity extraction without language-specific resources, proceedings of the 6th conference on Natural language learning, p.1-4, August 31, 2002 doi:10.3115/1118853.1118873
  • 21 NIST Speech Group. The ACE 2006 evaluation plan: Evaluation of the detection and recognition of ACE entities, values, temporal expressions, relations, and events, (2006). http://www.nist.gov/speech/tests/ace/ace06/doc/ace06-evalplan.pdf.
  • 22 Jovan Pehcevski, James A. Thom, Anne-Marie Vercoustre, Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database, Information Retrieval, v.8 n.4, p.571-600, December 2005 doi:10.1007/s10791-005-0748-1
  • 23 Borislav Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, and M. Goranov. Towards semantic web information extraction. In 2nd International Semantic Web Conference: Workshop on Human Language Technology for the Semantic Web and Web Services, (2003). http://gate.ac.uk/conferences/iswc2003/proceedings/popov.pdf.
  • 24 Arnaud Sahuguet, Fabien Azavant, Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F, Proceedings of the 25th International Conference on Very Large Data Bases, p.738-741, September 07-10, 1999
  • 25 Satoshi Sekine. Named entity: History and future. Technical report, Proteus Project Report, (2004). http://cs.nyu.edu/sekine/papers/NEsurvey200402.pdf.
  • 26 B. Sundheim, editor. Proceedings of 3rd Message Understanding Conference (MUC), Los Altos, CA, (1991). Morgan Kaufmann.
  • 27 S. Tenier, A. Napoli, X. Polanco, and Y. Toussaint. Annotation semantique de pages web. In 6mes journes francophones "Extraction et Gestion de Connaissances" - EGC 2006, 2006.
  • 28 A.-M. Vercoustre and F. Paradis. A descriptive language for information object reuse through virtual documents. In 4th International Conference on Object-Oriented Information Systems (OOIS'97), pages 299--311, Brisbane, Australia, 1997.
  • 29 Ellen M. Voorhees, Donna K. Harman, TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing), The MIT Press, 2005
  • 30 Jonathan Yu, James A. Thom, Audrey Tam, Ontology evaluation using wikipedia categories for browsing, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal doi:10.1145/1321440.1321474,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 EntityRankingInWikipediaAnne-Marie Vercoustre
James A. Thom
Jovan Pehcevski
Entity Ranking in WikipediaProceedings of the 2008 ACM Symposium on Applied Computinghttp://arxiv.org/pdf/0711.312810.1145/1363686.13639432008