2005 DisambiguatingWebAppearancesOfPeople

Jump to: navigation, search

Subject Headings: Entity Mention Clustering Algorithm, Person NER


Cited By




Author Keywords

Web appearance, name disambiguation, social network, document clustering, link structure, information bottleneck.


Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC) --- an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure.


  • V. N. Anh and A. Moffat. Homepage finding and topic distillation using a common retrieval strategy. In: Proceedings of TREC-11, 2002.
  • Amit Bagga, Breck Baldwin, Entity-based cross-document coreferencing using the Vector Space Model, Proceedings of the 17th International Conference on Computational linguistics, August 10-14, 1998, Montreal, Quebec, Canada
  • R. Bekkerman, R. El-Yaniv, and Andrew McCallum. Multi-way distributional clustering via pairwise interactions. Submitted.
  • Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, Yoad Winter, On feature distributional clustering for text categorization, Proceedings of the 24th ACM SIGIR Conference retrieval, p.146-153, September 2001, New Orleans, Louisiana, United States doi:10.1145/383952.383976
  • A. Culotta, R. Bekkerman, and Andrew McCallum. Extracting social networks and contact information from email and the web. In: Proceedings of CEAS-1, 2004.
  • R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In: Proceedings of NIPS-14, 2002.
  • M. B. Fleischman and Eduard Hovy. Multi-document person name resolution. In: Proceedings of ACL-42, Reference Resolution Workshop, 2004.
  • Nir Friedman, Ori Mosenzon, Noam Slonim, Naftali Tishby, Multivariate Information Bottleneck, Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, p.152-161, August 02-05, 2001
  • C. H. Gooi and J. Allan. Cross-document coreference on a large scale corpus. In: Proceedings of HLT/NAACL, 2004.
  • Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA doi:10.1145/996350.996419
  • Thorsten Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, Kluwer Academic Publishers, Norwell, MA, 2002
  • 12. John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001
  • (Mann and Yarowsky, 2003) ⇒ Gideon S. Mann, and David Yarowsky. (2003). “Unsupervised personal name disambiguation.” In: Proceedings of HLT-NAACL (2003).
  • W. Mark and R. Perrault. CALO: a cognitive agent that learns and organizes. https://www.calo.sri.com.
  • Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In: Proceedings of NIPS-17, 2005.
  • Vincent Ng, Claire Cardie, Improving machine learning approaches to coreference resolution, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania doi:10.3115/1073083.1073102
  • Jonathan Shakes, Marc Langheinrich, Oren Etzioni, Dynamic reference sifting: a case study in the homepage domain, Selected papers from the sixth International Conference on World Wide Web, p.1193-1204, September 1997, Santa Clara, California, United States
  • Noam Slonim, Nir Friedman, Naftali Tishby, Unsupervised document classification using sequential information maximization, Proceedings of the 25th ACM SIGIR Conference retrieval, August 11-15, 2002, Tampere, Finland doi:10.1145/564376.564401
  • N. Slonim and N. Tishby. Agglomerative information bottleneck. In: Proceedings of NIPS-12, 2000.
  • Noam Slonim, Naftali Tishby, Document clustering using word clusters via the information bottleneck method, Proceedings of the 23rd ACM SIGIR Conference retrieval, p.208-215, July 24-28, 2000, Athens, Greece doi:10.1145/345508.345578
  • N. Tishby, Fernando Pereira, and W. Bialek. The information bottleneck method, (1999). Invited paper to the 37th annual Allerton Conference.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 DisambiguatingWebAppearancesOfPeopleRon Bekkerman
Andrew McCallum
Disambiguating Web Appearances of People in a Social NetworkProceedings of the 14th International World Wide Web Conferencehttp://dx.doi.org/10.1145/1060745.106081310.1145/1060745.10608132005