- (Hall et al., 2008) ⇒ Rob Hall, Charles Sutton, and Andrew McCallum. (2008). “Unsupervised Deduplication Using Cross-field Dependencies.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008). doi:10.1145/1401890.1401931
- It is based on a Technical Report: R. Hall, C. Sutton, and Andrew McCallum, “Unsupervised coreference of publication venues,” University of Massachusetts, Amherst, Amherst, MA, Tech. Rep., 2007.
- (Wick et al., 2009) ⇒ Michael Wick, Aron Culotta, Khashayar Rohanimanesh, and Andrew McCallum. (2009). “An Entity Based Model for Coreference Resolution.” In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009).
- Hall et al. [30, 31] also propose a directed model, but for the task of venue coreference. Their approach incorporates distortion models between strings that discovers patterns of heterogeneous representations in a similar spirit to canonicalization. In contrast to previous unsupervised methods, they explicitly model dependencies between their two attributes of interest: venue and titles. Their results reveal that modeling this dependency is important; however, in a directed framework, adding additional dependencies between attributes requires blowing up the model. In the citation matching task, we can have up to a dozen attributes, and modeling all these cross-attribute dependencies begins to become prohibitively expensive. In contrast, because discriminative training methods model the conditional distribution, the complexity of our model stays the same when adding additional cross-field dependencies.
Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent -- because venues tend to focus on a few research areas -- but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance -- a 58% reduction in error over a standard Dirichlet process mixture.
|2008 UnsupervisedDeduplicationUsingC||Rob Hall|
|Unsupervised Deduplication Using Cross-field Dependencies||KDD-2008 Proceedings||http://www.cs.umass.edu/~mccallum/papers/kdd289-hall.pdf||10.1145/1401890.1401931||2008|
|Author||Rob Hall +, Charles Sutton + and Andrew McCallum +|
|journal||Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +|
|title||Unsupervised Deduplication Using Cross-field Dependencies +|