2008 NamedEntityNormInUserGenContent

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization Algorithm, Wikipedia, Dutch Language.

Notes

Cited By

Quotes

Abstract

Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems.

A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references.

To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.

1. Introduction

The task of record linkage (RL) is to find entries that refer to the same entity in different data sources. This task has been investigated since the 1950s — usually, entries are considered with their attributes (e.g., person with phone, address) [21]. The task proved important because data sources have varying ways of referring to the same real-world entity due to, e.g., different naming conventions, misspellings or use of abbreviation. The task of reference normalization is to analyze and detect these different references [6, 7]. When we consider the special case of this problem for natural language texts, we have to recognize entities in a text and resolve these references either to entities that exist within the document or to real-world entities. These two steps constitute the named entity normalization (NEN) problem.

We consider the NEN task within the setting of user generated content (UGC), such as blogs, discussion forums, or comments left behind by readers of online documents. For this type of textual data, the NEN task is particularly important within the settings of media and reputation analysis (which motivated the work reported here) and of intelligence gathering. Many strategies deployed in these areas revolve around the idea of determining and tracking the impact of an event, i.e., determining the number, intensity and orientation of responses and identifying the stakeholders and other actors and entities involved.

The specific scenario on which we focus concerns the analysis of data that is increasingly common: online texts decorated with unedited comments left behind by web users. Examples include news sites (such as BBC news), and discussion and collaboration forums (such as linuxforum.com). These comments contain valuable information that complements the original text that triggered them, but the sheer volume and their (usually) flat organization makes them hard to comprehend. Hence, tools are needed that help organize the list of comments, by clustering them, summarizing them, computing aggregate information, creating hyperlinks between them, etc.

Let’s consider an example. In the data set that we use for evaluation purposes in this paper (see Section 4 for details), one of the news stories is about racing driver Michael Schumacher (Figure 1). To normalize the named entities in this news item and the comments it triggers, we need to resolve them to real world entities. We notice that there are two types of reference in the data. One is within-document reference, e.g., in the comments in Figure 1, Michael and Schumi are used to refer to Michael Schumacher as mentioned earlier. The second kind of references are references to real-world entities, e.g., in the second comment, DC is used to refer to David Coulthard. Notice that resolving references of the latter type involves named entity disambiguation: in the context of Figure 1, DC is not used to refer to “Daimler-Chrysler,” “direct current” or the number 600.(See http://en.wikipedia.org/wiki/DC)

Figure 1: An excerpt from a BBC news article, with the excerpts of three comments (out of total of 39). Named entities are underlined.

  • News item: Michael Schumacher wins his sixth victory in eight races — and tightens his grip on another Championship title. Do you think the title race is over? Have Your Say. Michael Schumacher extended his lead to 43 points after Juan Pablo Montoya’s Williams broke down with 12 laps to go. (. . . )
  • Comments:
    • 1. Ferrrari and Schumacher are now beyond the point where anyone can stop them (...)
    • 2. (. . . ) Ralf, Montoya or DC need to win all the remaining seven races without Michael getting any points (...)
    • 3. (. . . ) Both Williams' drivers could be giving Schumi more of a challenge if their cars were reliable, as could Coulthard at McLaren (...)
  • Reference: http://news.bbc.co.uk/sport2/low/sports_talk/2031010.stm

The main challenge in normalizing named entities (NEs) occurring in the comments on a news story is that commentators often do not use the full name of an already mentioned NE, use nicknames, misspell words or creatively pun with them. For example, in one of the examples in our Dutch data set (a news article with 90 comments), singer Anneke Grönloh is referred to in 11 different ways, including variants such as Anneke Grohnloh, anneke gr ?hnloh, Mw. Gronloh, Anneke Kreunlo, Mevrouw G., etc. Other examples of creative language use include G@@Gle and Bu$h. Besides, commentators often introduce additional NEs not even mentioned in the triggering news story, and some of the NEs used may actually refer to earlier comments. All in all, this turns NEN on UGC into a challenging problem.

NEN has been considered before, on structured data and on edited content. Of particular relevance to us is recent work by Cucerzan [4], comparing methods for NEN on edited content. In the present paper we apply similar methods to user generated content. We find several main sources of errors: NE recognition errors (incorrect boundaries of named entities as well as missing NEs), multiple ways of referring to the same entity, and ambiguous (out of context) references. We present five improvements to the baseline NEN algorithm to address these error types, namely: trimming, joining and ngramming NEs, approximate name matching, identification of missing references and name disambiguation. We assess the overall performance of the improved system as well as the individual contributions of the improvements. In this paper, we aim to create a named entity normalization algorithm for use in Dutch/English language media and reputation analysis settings that performs well on user generated content (UGC). We use Wikipedia, the largest encyclopedia to date, to assign unique identifiers to real world entities in the entity normalization process. For NEs not found in Wikipedia, we use the most complete variant of the name found in the text as the identifier.

The main contributions of the paper are: presentation and analysis of the problem of NEN in UGC, an algorithm for addressing the problem, and evaluation and analysis of the algorithm. Our algorithm was developed for Dutch and using Dutch data, but experiments with an English data set indicate that it is well applicable to other languages. Moreover, the algorithm is computationally efficient.

The remainder of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we present a baseline algorithm for named entity normalization in user generated content, and describe an improved version based on an error analysis. In Section 4 we present our experimental setup and in Section 5 we present and analyze the results of our evaluation. A set of conclusions in Section 6 completes the paper.

2. Related Work

Name matching and disambiguation has been recognized as an important problem in various domains. Borgman and Siegfried [2] present an overview of motivations, applications, common problems and techniques for name matching in the art domain; see [17] for recent experiments with classification for name matching. A dual problem, personal name disambiguation has also attracted a lot of attention, and a number of unsupervised methods were proposed [14, 18]. A similar problem has been known in the database community for over five decades as the record linkage or the record matching problem [8, 21]. However, there the task is more general: matching arbitrary types of records, not just person names or other types of named entities. Another type of research focuses on identification, disambiguation and matching of text objects other than named entities, specifically, temporal expressions [1].

Related problems occur in a different task: discovering links in text. Like NEN, this task involves identifying and disambiguating references to entities, and has also attracted attention of the research community [10, 15].

Research on named entity extraction and normalization has been carried out in both restricted and open domains. For example, for the case of scientific articles on genomics, where gene and protein names can be both synonymous and ambiguous, Cohen [3] normalizes entities using dictionaries automatically extracted from gene databases. For the news domain, Magdy et al. [13] address cross-document Arabic person name normalization using a machine learning approach, a dictionary of person names and frequency information for names in a collection. Cucerzan [4] considers the entity normalization task for news and encyclopedia articles; they use information extracted from Wikipedia combined with machine learning for context-aware name disambiguation; the baseline that we use in this paper (taken from [11]) is a modification (and improved version) of Cucerzan [4]’s baseline. Cucerzan [4] also presents an extensive literature overview on the problem.

Recent research has also examined the impact of normalizing entities in text on specific information access tasks. Zhou et al. [22] show that appropriate use of domain-specific knowledge base (i.e., synonyms, hypernyms, etc.) yields significant improvement in passage retrieval in the biomedical domain. Similarly, Khalid et al. [11] demonstrate that NEN based on Wikipedia helps text retrieval in the context of Question Answering in the news domain.

Finally, in recent years, there has been a steady increase in the development or adaptation of language technology for UGC. Most of the attention has gone to blogs (see [16] for a recent survey on text analytics for blogs). Online discussion fora are more closely related to the data with which we work; recent research includes work on finding authoritative answers in forum threads [9, 12], as well as attempts to assess the quality of forum posts [20]. To the best of our knowledge, discussion threads as triggered by news stories of the kind considered here have not been studied before.

3. An Algorithm for Named Entity Normalization in User Generated Content

In this section we present a baseline algorithm for NEnormalization based on [4, 11], perform an error analysis and describe five improvements to the baseline, each accounting for a specific type of error identified.

3.1 Baseline algorithm

Algorithm 1, our baseline algorithm, takes as input a pair hA,Ri where A is the triggering news article and R is the list of comments on A in reverse chronological order. It returns an entity model, i.e., a list of triples hs, [math]\displaystyle{ n }[/math], pi, where s is a surface form (i.e., a named entity as it occurs in text), n is the normalized form of s (e.g., a title of the corresponding Wikipedia article), and p is the character position of s in the document. For example, one of the entity triples from the text in Figure 1 is hSchumi, Michael Schumacher, 57i. Line 1 of Algorithm 1 performs the NE recognition, i.e., it identifies NEs of types PERSON, LOCATION, ORGANIZATION or MISC (miscellaneous). Lines 2 and 4 do the preprocessing: we remove all noisy NEs, i.e., short (length 2 characters) or stopword-only NEs, with the exception of (capitalized) abbreviations, and remove diacritics (e.g., replacing ¨o with o).

Next, on line 5 we normalize each found NE using the function shown as Algorithm 2. This normalization algorithm treats NEs that are person names differently from other NE types (lines 2 and 3). Specifically, for persons we further remove common titles (such as Mr, Mrs) and perform within-document reference resolution, as detailed in Algorithm 3.

Algorithm 2 continues (line 5) by trying to link the NE to a Wikipedia article, calling the function findWikiEntity shown in Algorithm 4. If even after this step the NE is not normalized, we take the string itself as its normalized form (lines 6–7 of Algorithm 1).

The function ResolveRefInDoc, described in Algorithm 3, examines the list of entities already found and normalized earlier in the document, and finds matches based on first or last names.

The function findWikiEntity, described in Algorithm 4, first tries to match the input reference string with a Wikipedia article title (either exact match or case-insensitive, in this order).2 If we find a matched Wikipedia page title, WT, we check whether the page is a Wikipedia redirection page (line 2). In case of a redirect, we take the title of the target Wikipedia page instead. Then we check if WT refers to a Wikipedia disambiguation page,3 i.e., it lists a number of possible candidate pages for a given term. If this is the case, we select one of them, disambiguating between candidates using a heuristic from [11]: we select the candidate that has the highest number of incoming links in Wikipedia.

Algorithm 1 Compute the entity model of a document Require: a text document DOC 1: REFSdoc <= NE-Recognition(DOC) {REFSdoc: list of hNE, type, positioni triples} 2: REFSdoc <= Remove-NoinsyNEs(REFSdoc) 3: for each hNE, type, positioni 2 REFSdoc do 4: REF <= Removing diacritics(NE) 5: REF-norm (NormalizeNE( hREF, typei, REF-NORMSdoc) {see Algorithm 2} 6: if REF-norm = NULL then 7: REF-norm (NE 8: end if 9: REF-NORMSdoc (hREF, REF-norm, positioni 10: end for 11: return REF-NORMSdoc

Algorithm 2 NormalizeNE: named entity normalization Require: a pair hNAME, Typei, a list REF-NORMS 1: if Type = PERSON then 2: NAME <= RemoveTitles(NAME) 3: REF-norm <= ResolveRef-InDoc( NAME, REFNORMS) {see Algorithm 3} 4: end if 5: if REF-norm = NULL then 6: REF-norm <= findWikiEntity(REF) {see Algorithm 4} 7: end if 8: return REF-norm

6. CONCLUSION

Our aim in this paper was to create a named entity normalization algorithm for use in Dutch language media and reputation analysis settings that performs well on user generated content (UGC). For this purpose we started with a baseline NEN system from the literature, and found that it performed much worse on UGC than on edited news: 65% vs. 80% accurracy on a Dutch language data set and 77% vs. 94% accuracy on an English language data set.

We identified the following main sources of errors of the baseline system when applied to UGC: NE recognition errors (incorrect boundaries of named entities or missing NEs), multiple ways of referring to the same entity, and ambiguous (out of context) references. We addressed these issues by proposing five improvements to the baseline NEN algorithm. Our experimental results showed that all improvements are important in increasing recall, precision and accuracy of the algorithm. While helpful in increasing the recall, the improvement that we introduced to cover missing NEs is expensive in terms of running time. The overall system can run on multiple languages, and the main source of differences in performance between languages seems to be the size of the underlying corpus against which named entities are normalized, Wikipedia.

In future work we will attempt to further improve the performance of our NEN algorithm by using context-aware named entity disambiguation, creating small entity-specific language models. In addition, we want to improve the underlying NER tools we use and to consider other measures of string similarity than we have used so far (edit distance) to handle misspellings of the person names better.

References

  • 1 D. Ahn, J. van Rantwijk, and M. de Rijke. A cascaded machine learning approach to interpreting temporal expressions. In Human Language Technologies 2007]]: Proceedings of ACL 2007, pages 420--427, 2007.
  • 2 C. L. Borgman and S. L. Siegfried. Getty's synoname and its cousins: A survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science, 43(7):459--476, 1992.
  • 3 A. M. Cohen. Unsupervised gene/protein named entity normalization using automatically extracted dictionaries. In ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases, pages 17--24, 2005.
  • 4 (Cucerzan, 2007) ⇒ Silviu Cucerzan. (2007). “Large-Scale Named Entity Disambiguation Based on Wikipedia Data.” In: Proceedings of EMNLP-CoNLL-2007.
  • 5 Fien De Meulder, Walter Daelemans, Memory-based named entity recognition using unannotated data, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.208-211, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119211
  • 6 AnHai Doan, Alon Y. Halevy, Semantic-integration research in the database community, AI Magazine, v.26 n.1, p.83-94, March 2005
  • 7 Xin Dong, Alon Halevy, Jayant Madhavan, Reference reconciliation in complex information spaces, Proceedings of the 2005 ACM SIGMOD Conference, June 14-16, 2005, Baltimore, Maryland doi:10.1145/1066157.1066168
  • 8 Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, v.19 n.1, p.1-16, January 2007 doi:10.1109/TKDE.2007.9
  • 9 Donghui Feng, Erin Shaw, Jihie Kim, Eduard Hovy, Learning to detect conversation focus of threaded discussions, Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.208-215, June 04-09, 2006, New York, New York doi:10.3115/1220835.1220862
  • 10 Sisay Fissaha Adafre, Maarten de Rijke\n, Discovering missing links in Wikipedia, Proceedings of the 3rd international workshop on Link discovery, p.90-97, August 21-25, 2005, Chicago, Illinois doi:10.1145/1134271.1134284
  • 11 M. Khalid, Valentin Jijkoun, and M. de Rijke. The impact of named entity normalization on information retrieval for question answering. In: Proceedings of ECIR 2008, 2008.
  • 12 J. Kim, G. Chern, D. Feng, E. Shaw, and Eduard Hovy. Mining and assessing discussions on the web through speech act analysis. In: Proceedings of the Workshop on Web Content Mining with Human Language Technologies at the 5th International Semantic Web Conference, 2006.
  • 13 W. Magdy, K. Darwish, O. Emam, and H. Hassan. Arabic cross-document person name normalization. In CASL Workshop '07, pages 25--32, 2007.
  • 14 Gideon S. Mann, David Yarowsky, Unsupervised personal name disambiguation, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119181
  • 15 Rada Mihalcea, Andras Csomai, Wikify!: linking documents to encyclopedic knowledge, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal doi:10.1145/1321440.1321475
  • 16 G. Mishne. Applied Text Analytics for Blogs. PhD thesis, University of Amsterdam, Amsterdam, 2007.
  • 17 C. Phua, V. Lee, and K. Smith. The personal name problem and a recommended data mining solution. In Encyclopedia of Data Warehousing and Mining (2nd Edition). 2006.
  • 18 Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles, Efficient topic-based unsupervised name disambiguation, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada doi:10.1145/1255175.1255243
  • 19 Erik F. Tjong Kim Sang, Memory-based named entity recognition, proceedings of the 6th conference on Natural language learning, p.1-4, August 31, 2002 doi:10.3115/1118853.1118878
  • 20 M. Weimer, I. Gurevych, and M. Mehlhauser. Automatically assessing the post quality in online discussions on software. In: Proceedings of the ACL 2007 Demo and Poster Sessions, pages 125--128, 2007.
  • 21 W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC., 1999.
  • 22 Wei Zhou, Clement Yu, Neil Smalheiser, Vetle Torvik, Jie Hong, Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature, Proceedings of the 30th ACM SIGIR Conference retrieval, July 23-27, 2007, Amsterdam, The Netherlands doi:10.1145/1277741.1277853,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 NamedEntityNormInUserGenContentMahboob Alam Khalid
Valentin Jijkoun
Maarten Marx
Maarten de Rijke
Named Entity Normalization in User Generated ContentProceedings of the Second Workshop on Analytics for Noisy Unstructured Text Datahttp://staff.science.uva.nl/~mdr/Publications/Files/sigir2008-and-nen.pdf10.1145/1390749.13907552008