2006 OntolDrivenAutoEntDisambigInUnstrText

Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization Algorithm, Person Normalization Task, Ontology, Semantic Web, DBLP, DBWorld.


Cited By




  • Precisely identifying entities in web documents is essential for document indexing, web search and data integration. Entity disambiguation is the challenge of determining the correct entity out of various candidate entities. Our novel method utilizes background knowledge in the form of a populated ontology. Additionally, it does not rely on the existence of any structure in a document or the appearance of data items that can provide strong evidence, such as email addresses, for disambiguating person names. Originality of our method is demonstrated in the way it uses different relationships in a document as well as from the ontology to provide clues in determining the correct entity. We demonstrate the applicability of our method by disambiguating names of researchers appearing in an collection of DBWorld posts using a large scale, real-world ontology extracted from the DBLP bibliography website. The precision and recall measurements provide encouraging results.


  • A significant problem with the World Wide Web today is that there is no explicit semantic information about the data and objects being presented in the web pages. Most of the content encoded in HTML format serves its purpose of describing the presentation of the information to be displayed to human users. HTML lacks the ability to semantically express or indicate that specific pieces of content refer to real-world named entities or concepts. For instance, if “George Bush” is mentioned on a web page, there is no way for a computer to identify which “George Bush” the document is referring to or even if “George Bush” is the name of a person.
  • The Semantic Web aims at solving this problem by providing an underlying mechanism to add semantic metadata on any content, such as web pages. However, an issue that the Semantic Web currently faces is that there is not enough semantically annotated web content available. The addition of semantic metadata can be in the form of an explicit relationship from each appearance of named entities within a document to some identifier or reference to the entity itself. The architecture of the Semantic Web relies upon URIs [4] for this purpose. Examples of this would be the entity “UGA” pointing to http://www.uga.edu and “George Bush” pointing to a URL of his official web page at the White House. However, more benefit can be obtained by referring to actual entities of an ontology where such entities would be related to concepts and/or other entities. The problem that arises is that of entity disambiguation, which is concerned with determining the right entity within a document out of various possibilities due to same syntactical name match. For example, “A. Joshi” is ambiguous due to various real-world entities (i.e. computer scientists) having the same name.
  • Entity disambiguation is an important research area within Computer Science. The more information that is gathered and merged, the more important it is for this information to accurately reflect the objects they are referring to. It is a challenge in part due to the difficulty of exploiting, or lack of background knowledge about the entities involved. If a human is asked to determine the correct entities mentioned within a document, s/he would have to rely upon some background knowledge accumulated over time from other documents, experiences, etc. The research problem that we are addressing is how to exploit background knowledge for entity disambiguation, which is quite complicated particularly when the only available information is an initial and last name of a person. In fact, this type of information is already available on the World Wide Web in databases, ontologies or other forms of knowledge bases. Our method utilizes background knowledge stored in the form of an ontology to pinpoint, with high accuracy, the correct object in the ontology that a document refers to. Consider a web page with a “Call for Papers” announcement where various researchers are listed as part of the Program Committee. The name of each of them can be linked to their respective homepage or other known identifiers maintained elsewhere, such as the DBLP bibliography server. Our approach for entity disambiguation is targeted at solving this type of problem, as opposed to entity disambiguation in databases which aims at determining similarity of attributes from different database schemas to be merged and identifying which record instances refer to the same entity (e.g., [7]).
  • The contributions of our work are two-fold: (1) a novel method to disambiguate entities within unstructured text by using clues in the text and exploiting metadata from an ontology; (2) an implementation of our method that uses a very large, realworld ontology to demonstrate effective entity disambiguation in the domain of Computer Science researchers. According to our knowledge, our method is the first work of its type to exploit an ontology and use relations within this ontology to recognize entities without relying on structure of the document. We show that our method can determine the correct entities mentioned in a document with high accuracy by comparing to a manually created and disambiguated dataset.

Related Work

  • Research on the problem of entity disambiguation has taken place using a variety of techniques. Some techniques only work on structured parts of a document. The applicability of disambiguating peoples’ names is evident when finding citations within documents. Han et al [13] provides an assessment of several techniques used to disambiguate citations within a document. These methods use string similarity techniques and do not consider various candidate entities that may have the same name.
  • Our method differs from other approaches by a few important features. First, our method performs well on unstructured text. Second, by exploiting background knowledge in the form of a populated ontology, the process of spotting entities within the text is more focused and reduces the need for string similarity computations. Third, our method does not require any training data, as all of the data that is necessary for disambiguation is straightforward and provided in the ontology. Last but not least, our method exploits the capability provided by relationships among entities in the ontology to go beyond techniques traditionally based on syntactical matches.
  • The iterative step in our work is similar in spirit to a recent work on entity reconciliation [8]. In such an approach, the results of disambiguated entities are propagated to other ambiguous entities, which could then be reconciled based on recently reconciled entities. That method is part of a Personal Information Management system that works with a user’s desktop environment to facilitate access and querying of a user’s email address book, personal word documents, spreadsheets, etc. Thus, it makes use of predictable structures such as fields that contain known types of data (i.e., emails, dates and person names) whereas in our method we do not make any assumptions about the structure of the text. This is a key difference as the characteristics of the data to be disambiguated pose different challenges. Our method uses an ontology and runs on un-structured text, an approach that theirs does not consider.
  • Citation matching is a related problem aiming at deciding the right citation referring to a publication [11]. In our work, we do not assume the existence of citation information such as publication venue and date. However, we believe that our method is a significant step to the Identity Uncertainty problem [16] by automatically determining unique identifiers for person names with respect to a populated ontology.
  • The SCORE system for management of semantic metadata (and data extraction) also contains a component for resolving ambiguities [18]. SCORE uses associations from a knowledgebase to determine the best match from candidate entities but detailed implementation is not available from this commercial system.
  • In ESpotter, named entities are recognized using a lexicon and/or atterns [20]. Ambiguities are resolved by using the URI of the webpage to determine the most likely domain of the term (probabilities are computed using hit count of search-engine results). The main difference with our work is our method uses only named entities within the domain of a specific populated ontology.
  • Finally, our approach is different to that of disambiguating word senses [2, 12, 15]. Instead, our focus is to disambiguate named entities such as peoples’ names, which has recently gained attention for its applicability in Social Networks [3, 1]. Thus, instead of exploiting homonymy, synonymy, etc., our method works on relationships that real-world entities have such as affiliation of a researcher and his/her topics.


  • We proposed a new ontology-driven solution to the entity disambiguation problem in unstructured text. In particular, our method uses relationships between entities in the ontology to go beyond traditional syntactic-based disambiguation techniques. The output of our method consists of a list of spotted entity names, each with an entity disambiguation score CS. We demonstrated the effectiveness of our approach through evaluations against a manually disambiguated document set containing over 700 entities. This evaluation was performed over DBWorld announcements using an ontology created from DBLP (consisting of over one million entities). The results of this evaluation lead us to claim that our method has successfully demonstrated its applicability to scenarios involving real-world data. To the best of our knowledge, this work is among the first which successfully uses a large, populated ontology for identifying entities in text without relying on the structure of the text.
  • In future work, we plan to integrate the results of entity disambiguation into a more robust platform such as UIMA [10]. The work we presented can be combined with other existing work so that the results may be more useful in certain scenarios. For example, the results of entity-disambiguation can be included within a document using initiatives such as Microformats (microformats.org) and RDFa (w3.org/TR/xhtmlrdfa-primer/).


  • 1. Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A., Arpinar, I. B., Joshi, A., Finin, T.: Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. 15th International World Wide
  • 2. Roberto Basili, Rocca, M. D., Pazienza, M. T.: Contextual Word Sense Tuning and Disambiguation. Applied Artificial Intelligence, 11(3) (1997) 235-262
  • 3. (BekkermanM, 2005) ⇒ Ron Bekkerman, and Andrew McCallum. (2005). “Disambiguating Web Appearance of People in a Social Network.” In: Proceedings of the 14th International World Wide Web Conference. (WWW 2005).
  • 4. Berners-Lee, T., Fielding R., Masinter, L.: Uniform Resource Identifier (URI): Generic Syntax. RFC 3986, IETF, (2005)
  • 5. Bilenko, M., Mooney, R., William W. Cohen, Ravikumar, P., Fienberg, S.: Adaptive Name Matching in Information Integration. IEEE Intelligent Systems, 18(5). (2003). 16-23
  • 6. DBWorld. http://www.cs.wisc.edu/dbworld/ April 9, 2006
  • 7. Dey, D., Sarkar, S., De, P.: A Distance-based Approach to Entity Reconciliation in Heterogeneous Databases. IEEE Transactions on Knowledge and Data Engineering, 14(3) (May 2002) 567-582
  • 8. Dong, X. L., Halevy, A., Madhaven, J.: Reference Reconciliation in Complex Information Spaces. Proceedings of SIGMOD, Baltimore, MD. (2005)
  • 9. Embley, D. W., Jiang, Y. S., Ng, Y.: Record-Boundary Discovery in Web Documents. Proceedings of SIGMOD, Philadelphia, Pennsylvania (1999) 467-478
  • 10. Ferrucci, D., Lally, A.: UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3- 4) (2004) 327-348
  • 11. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: An Automatic Citation Indexing System. Proceedings of the 3rd ACM International Conference on Digital Libraries, Pittsburgh, PA, (June 23-26, 1998) 89-98
  • 12. Gomes, P., Fernando Pereira, Paiva, P., Seco, N., Carreiro, P., Ferreira, J. L., Bento, C.: Noun Sense Disambiguation with WordNet for Software Design Retrieval. Proceedings of the 16th Conference of the Canadian Society for Computational Studies of Intelligence (AI 2003), Halifax, Canada (June 11-13, 2003) 537-543
  • 13. Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two Supervised Learning Approaches for Name Disambiguation in Author Citations. Proceedings of ACM/IEEE Joint Conf on Digital Libraries, Tucson, Arizona (2004)
  • 14. Ley, M.: The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives. Proceedings of the 9th International Symposium on String Processing and Information Retrieval, Lisbon, Portugal (Sept. 2002) 1-10
  • 15. Roberto Navigli, Paola Velardi: Structural Semantic Interconnections: A Knowledge-based Approach to Word Sense Disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7) (2005) 1075-1086
  • 16. Pasula, H., Marthi, B., Milch, B., Russell, S. J., Shpitser, I.: Identity Uncertainty and Citation Matching, Neural Information Processing Systems. Vancouver, British Columbia (2002) 1401-1408
  • 17. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM - Semantic Annotation Platform. Proceedings of the 2nd International Semantic Web Conference, Sanibel Island, Florida (2003)
  • 18. Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y.: Managing Semantic Content for the Web, IEEE Internet Computing, 6(4), (2002) 80-87
  • 19. Torvik, V. I., Weeber, M., Swanson, D. R., Smalheiser, N. R.: A Probabilistic Similarity Metric for Medline Records:



   author = {Joseph Hassell and Boanerges Aleman-meza and I. Budak Arpinar},
   title = {Ontology-Driven Automatic Entity Disambiguation in Unstructured Text},
   booktitle = {In International Semantic Web Conference},
   year = {2006},
   pages = {44--57}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 OntolDrivenAutoEntDisambigInUnstrTextJoseph Hassell
Boanerges Aleman-Meza
I. Budak Arpinar
Ontology-driven Automatic Entity Disambiguation in Unstructured TextProceedings of the 5th International Semantic Web Conferencehttp://webster.cs.uga.edu/~budak/papers/Arpinar2006pb.pdf10.1007/119260782006