2014 RobustEntityLinkingviaRandomWal

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Data-Driven Entity Mention Linking.

Notes

Cited By

Quotes

Author Keywords

Entity linking, relatedness measure, random walk

Abstract

Entity Linking is the task of assigning entities from a Knowledge Base to textual mentions of such entities in a document. State-of-the-art approaches rely on lexical and statistical features which are abundant for popular entities but sparse for unpopular ones, resulting in a clear bias towards popular entities and poor accuracy for less popular ones. In this work, we present a novel approach that is guided by a natural notion of semantic similarity which is less amenable to such bias. We adopt a unified semantic representation for entities and documents - the probability distribution obtained from a random walk on a subgraph of the knowledge base - which can overcome the feature sparsity issue that affects previous work. Our algorithm continuously updates the semantic signature of the document as mentions are disambiguated, thus focusing the search based on context. Our experimental evaluation uses well-known benchmarks and different samples of a Wikipedia-based benchmark with varying entity popularity; the results illustrate well the bias of previous methods and the superiority of our approach, especially for the less popular entities.

1. Introduction

Entity Linking (EL) is the task of assigning unique identifiers of entities in a Knowledge Base (KB) to mentions of named entities in a text document. EL is key for Information Extraction (IE) and many other applications. For instance, it enables expanding or correcting a KB with facts extracted from documents - a task called Knowledge Base Population [15]. Another application is Semantic Search, the emerging paradigm of Web search that combines Information Retrieval approaches over document corpora with KB-style query answering and reasoning to offer more accurate and concise answers to Web searches.

EL is challenging due to the inherent ambiguity of natural language: most entities can be referred to in different ways, and the same mention may refer to multiple real-world entities, as illustrated in the following examples:

Example 1. Saban, previously a head coach of NFL's Miami, is now coaching Crimson Tide. His achievements include leading LSU to the BCS National Championship once and Alabama three times.
Example 2. After his departure from Buffalo, Saban returned to coach college football teams including Miami, Army and UCF.

In Example 1, both Crimson Tide and Alabama refer to the same football team, the Alabama Crimson Tide of the University of Alabama. On the other hand, Saban refers to two different coaches: Nick Saban in Example 1 and Lou Saban in Example 2. Similarly, Miami in Example 1 refers to the NFL football team Miami Dolphins, while in Example 2 it refers to the college football team Miami Hurricanes.

The EL task can be cast as an all-against-all matching problem: given m mentions in a document and n entities in a KB, perform the m \times n comparisons and pick the ones with the highest similarity (one for each mention, of course). This is prohibitively expensive and unnecessary, however. Most methods, including ours, operate in two stages: the first is to select a suitable set of candidate entities for each mention, and the second is to perform the actual mention disambiguation. Selecting candidate entities is done, primarily, by consulting alias dictionaries (e.g., produced from a Wikipedia corpus). In the examples above, both coaches Lou Saban and Nick Saban would be picked as candidates for the mention Saban, as would the companies Saban Capital Group and Saban Entertainment. As for the disambiguation phase, most methods in the literature can be divided into two main groups, discussed next.

7. CONCLUSION

Entity linking mainly involves measuring the compatibility and semantic relatedness between mentions and entities, for which the semantic representation plays a critical role. The lexical and statistical features used in traditional local approaches are limited by the feature sparsity issue and cannot capture the semantics of entities. Recently proposed keyphrase and link-based representations provide richer feature sets for popular entities, but are still challenged for less popular ones. In this work, we proposed the use of the probability distribution resulting from a random walk with restart over a suitable entity graph to represent the semantics of entities and documents in a unified way. Our semantic representation uses all relevant entities from the knowledge base as features, thus reducing the effect of feature sparsity. Also the random walk with restart helps capture the semantics of unpopular entities from the entity graph.

Our experimental evaluation compared our method to 6 leading competitor systems and 2 very strong baselines, revealing the superiority and robustness of our entity linking system in a variety of settings, including 4 public benchmarks. Our method was particularly stronger when disambiguating unpopular entities, making it a good candidate to address the "long tail" in Information Extraction.

Future work.

A number of opportunities for future work exist. First, several disambiguation mistakes are still possible. For instance, when we have two Miamis in the same sentence, one referring to the city in Florida and the other to Miami Dolphins, our method might mistakenly link both to the same entity. In this case, simple type information like location and sports team, which are provided by named entity recognition systems can help with the disambiguation. Combining our current approach with typing information or other kinds of constraints would help advance the field.

Our system performs multiple random walk computations, making it time consuming if implemented naively, as in our current implementation. On a standard entry-level server, the average time to disambiguate a document in our benchmarks (in the order of tens of mentions) is in the order of a few minutes. Therefore, designing proper system infrastructure with the appropriate indexes and/or parallel computing infrastructure to optimize these computations would be interesting. Moreover, other state-of-the-art systems perform other expensive operations as well, such as accessing the Web. Thus, designing objective and fair benchmarks for comparing these different approaches in a more holistic way would be of great value.

Finally, our approach, like most other systems, has many application-specific parameters (recall Section 2) and depends on specific similarity measures (e.g., to filter candidate entities). Further studies are needed to understand how the accuracy of our approach is affected by the choice of similarity measures and configuration of parameters.

References

  • 1. E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. A Study on Similarity and Relatedness Using Distributional and Wordnet-based Approaches. In HLT-NAACL, Pages 19--27, 2009.
  • 2. A. Bagga and B. Baldwin. Entity-based Cross-document Coreferencing Using the Vector Space Model. In COLING-ACL, Pages 79--85, 1998.
  • 3. R. C. Bunescu and M. Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation. In EACL, 2006.
  • 4. Z. Cai, K. Zhao, K. Q. Zhu, and H. Wang. Wikification via Link Co-occurrence. In CIKM, Pages 1087--1096, 2013.
  • 5. X. Cheng and D. Roth. Relational Inference for Wikification. In EMNLP, Pages 1787--1796, 2013.
  • 6. S. Cucerzan. Large-scale Named Entity Disambiguation based on Wikipedia Data. In EMNLP-CoNLL, Pages 708--716, 2007.
  • 7. M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. Entity Disambiguation for Knowledge Base Population. In COLING, Pages 277--285, 2010.
  • 8. Z. Guo and D. Barbosa. Entity Linking with a Unified Semantic Representation. In WWW (Companion Volume), Pages 1305--1310, 2014.
  • 9. X. Han and L. Sun. An Entity-topic Model for Entity Linking. In EMNLP-CoNLL, Pages 105--115, 2012.
  • 10. X. Han, L. Sun, and J. Zhao. Collective Entity Linking in Web Text: A Graph-based Method. In SIGIR, Pages 765--774, 2011.
  • 11. T. H. Haveliwala. Topic-sensitive Pagerank: A Context-sensitive Ranking Algorithm for Web Search. IEEE Trans. Knowl. Data Eng., 15(4):784--796, 2003.
  • 12. J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. Kore: Keyphrase Overlap Relatedness for Entity Disambiguation. In CIKM, Pages 545--554, 2012.
  • 13. J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust Disambiguation of Named Entities in Text. In EMNLP, Pages 782--792, 2011.
  • 14. T. Hughes and D. Ramage. Lexical Semantic Relatedness with Random Graph Walks. In EMNLP-CoNLL, Pages 581--589, 2007.
  • 15. H. Ji and R. Grishman. Knowledge Base Population: Successful Approaches and Challenges. In ACL, Pages 1148--1158, 2011.
  • 16. S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective Annotation of Wikipedia Entities in Web Text. In KDD, Pages 457--466, 2009.
  • 17. Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining Evidences for Named Entity Disambiguation. In KDD, Pages 1070--1078, 2013.
  • 18. R. Mihalcea and A. Csomai. Wikify!: Linking Documents to Encyclopedic Knowledge. In CIKM, Pages 233--242, 2007.
  • 19. G. A. Miller. Wordnet: A Lexical Database for English. Commun. ACM, 38(11):39--41, 1995.
  • 20. D. Milne and I. H. Witten. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In: Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI, 2008), 2008.
  • 21. D. Milne and I. H. Witten. Learning to Link with Wikipedia. In CIKM, Pages 509--518, 2008.
  • 22. M. T. Pilehvar, D. Jurgens, and R. Navigli. Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity. In ACL, Pages 1341--1351, 2013.
  • 23. L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and Global Algorithms for Disambiguation to Wikipedia. In ACL, Pages 1375--1384, 2011.
  • 24. H. Tong, C. Faloutsos, and J.-Y. Pan. Fast Random Walk with Restart and Its Applications. In ICDM, Pages 613--622, 2006.
  • 25. W. Zhang, Y. C. Sim, J. Su, and C. L. Tan. Entity Linking with Effective Acronym Expansion, Instance Selection, and Topic Modeling. In IJCAI, Pages 1909--1914, 2011.
  • 26. W. Zhang, J. Su, C. L. Tan, and W. Wang. Entity Linking Leveraging Automatically Generated Annotation. In COLING, Pages 1290--1298, 2010.
  • 27. Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and [[S. Gaffney]. Resolving Surface Forms to Wikipedia Topics. In COLING, Pages 1335--1343, 2010.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2014 RobustEntityLinkingviaRandomWalZhaochen Guo
Denilson Barbosa
Robust Entity Linking via Random Walks10.1145/2661829.26618872014