2005 EffectiveScaleableCitationProblemsInDLs

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Record Deduplication Task, Citation Deduplication

Notes

Cited By

Quotes

Abstract

In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale. After formally defining the problems and accompanying challenges, we present an effective solution that is based on the state-of-the-art sampling-based approximate join algorithm. Our claim is verified through preliminary experimental results.

References

  • R. Ananthakrishna, S. Chaudhuri, and Venkatesh Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, (2002).
  • Indrajit Bhattacharya, Lise Getoor, Iterative record linkage for cleaning and integration, Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13, 2004, Paris, France doi:10.1145/1008694.1008697
  • Mikhail Bilenko, Raymond Mooney W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, (2003).
  • Vinayak Borkar, Kaustubh Deshmukh, Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD Conference, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
  • Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani, Robust and efficient fuzzy match for online data cleaning, Proceedings of the 2003 ACM SIGMOD Conference, June 09-12, 2003, San Diego, California doi:10.1145/872757.872796
  • W. Cohen, P. Ravikumar, and S. Fienberg. "A Comparison of String Distance Metrics for Name-matching tasks". In IIWeb Workshop held in conjunction with IJCAI, 2003.
  • Nello Cristianini, John Shawe-Taylor, An introduction to support Vector Machines: and other kernel-based learning methods, Cambridge University Press, New York, NY, 1999
  • José Manuel Barrueco Cruz, Markus J. R. Klink, Thomas Krichel, Personal Data in a Large Digital Library, Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, p.127, September 18-20, 2000
  • Peter T. Davis, David K. Elson, Judith L. Klavans, Methods for precise named entity matching in digital collections, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
  • 10 I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.
  • Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava, Text joins in an RDBMS for web data integration, Proceedings of the 12th International Conference on World Wide Web, May 20-24, 2003, Budapest, Hungary doi:10.1145/775152.775166
  • Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA doi:10.1145/996350.996419
  • Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
  • Y. Hong, B.-W. On, and D. Lee. "System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach". In ECDL, 2004.
  • J. A. Hylton, Identifying and Merging Related Bibliographic Records, Massachusetts Institute of Technology, Cambridge, MA, 1996
  • M. A. Jaro. "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida". J. of the American Statistical Association, 84(406), 1989.
  • R. P. Kelley. "Blocking Considerations for Record Linkage Under Conditions of Uncertainty.” In: Proceedings of Social Statistics Section, pages 602--605, 1984.
  • Steve Lawrence, C. Lee Giles, Kurt Bollacker, Digital Libraries and Autonomous Citation Indexing, Computer, v.32 n.6, p.67-71, June 1999 doi:10.1109/2.769447
  • Andrew McCallum, Kamal Nigam, Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347123
  • Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA doi:10.1145/1065385.1065463
  • H. Pasula et al. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, (2003).
  • Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
  • M. M. M. Snyman, M. Jansen van Rensburg, Revolutionizing name authority control, Proceedings of the fifth ACM conference on Digital libraries, p.185-194, June 02-07, 2000, San Antonio, Texas, United States doi:10.1145/336597.336660
  • James W. Warnner, Elizabeth W. Brown, Automated name authority control, Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, p.21-22, January 2001, Roanoke, Virginia, United States doi:10.1145/379437.379441
  • W. E. Winkler and Y. Thibaudeau. "An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census". Technical report, US Bureau of the Census, 1991.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 EffectiveScaleableCitationProblemsInDLsDongwon Lee
Byung-Won On
Jaewoo Kang
Sanghyun Park
Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital LibrariesProceedings of the 2nd International Workshop on Information Quality in Information Systemshttp://infos.korea.ac.kr/pubs/iqis05a.pdf10.1145/1077501.10775142005