2006 ALatentDirichletModelForUnsupEntRes

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Record Coreference Resolution Algorithm, Latent Dirichlet Allocation Model, Unsupervised Learning Algorithm.

Notes

Cited By

Quotes

Abstract

  • Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

1 Introduction

  • In many applications, there are a variety of ways of referring to the same underlying entity. Given a collection of entity references, or references for short, we would like to a) determine the collection of ‘true’ underlying entities and b) correctly map the references in the collection to these entities. This problem comes up in many guises throughout computer science. Examples include computer vision, where we need to figure out when regions in two different images refer to the same underlying object (the correspondence problem); natural language processing where we would like to determine which noun phrases refer to the same underlying entity (co-reference resolution); and databases, where, when merging two databases or cleaning a database, we need to determine when two records are referring to the same underlying individual (deduplication).
  • We are interested in resolving references when they are connected to each other via relational links, as in the bibliographic domain where author names in papers are connected by co-author links. Now entity resolution becomes collective in that resolution decisions depend on each other through the relational links. We show that collective entity resolution improves performance over independent pair-wise resolution.
  • There is a long history of work in both general and relational entity resolution. Recently, generative [22, 29] and discriminative [24, 28] probabilistic approaches have been proposed as well as non-probabilistic algorithms [20, 12]. Our model differs from most of the above in that it is unsupervised, does not assume the underlying entities to be known, does not make pairwise decisions and explicitly models relations between entities using group membership.
  • We introduce a generative probabilistic model for entity resolution that builds on the recently proposed Latent Dirichlet Allocation model (LDA) [6]. Unlike most existing models, we do not introduce a decision variable for each potential duplicate pair of references, but instead have an entity label for each reference. To model collaborative relations between entities, we introduce a group label for each reference, so that entities coming from the same collaborative group are more likely to be observed in a relation. For author resolution, this means that we model collaborative groups to explain co-authorship relations. The generative process in our model may be viewed as an extension of the Dirichlet Process mixture model: the group labels in our model influence the choice of entities for each author reference in a paper.
  • Another contribution of this paper is an unsupervised Gibbs sampling algorithm for collective entity resolution. It is unsupervised because we do not make use of a labeled training set and it is collective because the resolution decisions depend on each other through the group labels. Further, the number of entities is not fixed in our model, and we propose a novel sampling strategy to estimate the most likely number of entities given the references.

References

  • [1] R. Ananthakrishna, S. Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of The International Conference on Very Large Databases, (2002).
  • [2] C. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2:1152–1174, 1974.
  • [3] I. Bhattacharya and Lise Getoor. Iterative record linkage for cleaning and integration. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, (2004).
  • [4] M. Bilenko and Raymond Mooney. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of The International Conference on Knowledge Discovery and Data Mining, (2003).
  • [5]David M. Blei and Michael I. Jordan. Variational methods for the dirichlet process. In: Proceedings of The International Conference on Machine Learning, (2004).
  • [6]David M. Blei, A. Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:951– 991, Jan (2003).
  • [7] S. Chaudhuri, K. Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of The International Conference on Management of Data, (2003).
  • [8] William W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, (2003).
  • [9] William W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of The International Conference on Knowledge Discovery and Data Mining, (2002).
  • [10] Aron Culottaand Andrew McCallum. A conditional model of deduplication for multi-type relational data. Technical Report IR-443, University of Massachusetts, (2005).
  • [11] Anhai Doan, Ying Lu, Yoonkyong Lee, and Jiawei Han. (2003). “Object Matching for Data Integration: A profile-based approach.” In: IJCAI Workshop on Information Integration on the Web.
  • [12] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In: Proceedings of The International Conference on Management of Data, (2005).
  • [13] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183– 1210, 1969.
  • [14] T. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230, 1973.
  • [15] C. L. Giles, K. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Conference on Digital Libraries, (1998).
  • [16] T. Griffiths and Mark Steyvers. Finding scientific topics. In: Proceedings of the National Academy of Sciences, (2004).
  • [17] M. A. Hern´andez and S. J. Stolfo. The merge/purge problem for large databases. In: Proceedings of The International Conference on Management of Data, (1995).
  • [18] T. Hofmann. Probabilistic latent semantic analysis. In Uncertainty in Artificial Intelligence, (1999).
  • [19] H. D. III and D. Marcu. A bayesian model for supervised clustering with the dirichlet process prior. Journal of Machine Learning Research, 6:1551–1577, Sep (2005).
  • [20] D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM International Conference on Data Mining, (2005).
  • [21] J. Kubica, A. Moore, J. Schneider, and Y. Yang. Stochastic link and group detection. In National Conference on Artificial Intelligence, (2002).
  • [22] X. Li, P. Morie, and Dan Roth. Robust reading: Identification and tracing of ambiguous names. In Human Language Technology Conference / NAACL, (2004).
  • [23] Andrew McCallum, K. Nigam, and L. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proceedings of The International Conference On Knowledge Discovery and Data Mining, (2000).
  • [24] Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In Neural Information Processing Systems, (2004).
  • [25] T. P. Minka. Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence, (2001).
  • [26] Alvaro E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of The International Conference on Knowledge Discovery and Data Mining, (1996).
  • [27] R. M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9, (2000).
  • [28] Parag and Pedro Domingos. Multi-relational record linkage. In ACM SIGKDD Workshop on Multi-Relational Data Mining, (2004).
  • [29] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Neural Information Processing Systems, (2003).
  • [30] P. Ravikumar and William W. Cohen. A hierarchical graphical model for record linkage. In Uncertainty in Artificial Intelligence, (2004).
  • [31] M. Rosen-Zvi, T. Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In Uncertainty in Artificial Intelligence, (2004).
  • [32] Sunita Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In: Proceedings of The International Conference on Knowledge Discovery and Data Mining, (2002).
  • [33] S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems Journal, 26(8):635–656, (2001).
  • [34] W. E. Winkler. Methods for record linkage and Bayesian networks. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, (2002).

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 ALatentDirichletModelForUnsupEntResA Latent Dirichlet Model for Unsupervised Entity Resolutionhttp://www.cs.umd.edu/~getoor/Publications/bhattacharya-sdm06.pdf