- (Cohen & Sarawagi, 2004) ⇒ William W. Cohen, Sunita Sarawagi. (2004). “Exploiting Dictionaries in Named Entity Extraction: Combining semi-Markov extraction processes and data integration methods.” In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004) doi:10.1145/1014052.1014065
Subject Headings: Named Entity Extraction
- ~135 http://scholar.google.com/scholar?q=%22Exploiting+Dictionaries+in+Named+Entity+Extraction%3A+Combining+semi-Markov+extraction+processes+and+data+integration+methods%22+2004
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries --- more specifically, the problem of extending state-of-the-art NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most high-performance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semi-Markov extraction process, which is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling high-performance NER methods and high-performance similarity functions, this formalism also allows the direct use of other useful entity-level features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance over previous methods for using external dictionaries in NER.
- 1 Eugene Agichtein, Luis Gravano, Snowball: extracting relations from large plain-text collections, Proceedings of the fifth ACM conference on Digital libraries, p.85-94, June 02-07, 2000, San Antonio, Texas, United States doi:10.1145/336597.336644
- 2. Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In: Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.
- 3. Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel, An Algorithm that Learns What‘s in a Name, Machine Learning, v.34 n.1-3, p.211-231, Feb. 1999
- 4. Vinayak Borkar, Kaustubh Deshmukh, Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD Conference, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
- 5. A. Borthwick, J. Sterling, Eugene Agichtein, and Ralph Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Association for Computational Linguistics., 1998.
- 6. Razvan C. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, Raymond Mooney, A. K. Ramani, and Y. W. Wong. Learning to extract proteins and their interactions from medline abstracts. Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.
- 7. Razvan C. Bunescu, R. Ge, Raymond Mooney, E. Marcotte, and A. K. Ramani. Extracting gene and protein names from biomedical abstracts. Unpublished Technical Note, Available from http://www.cs.utexas.edu/users/ml/publication/ie.html, 2002.
- 8. Mary Elaine Califf, Raymond Mooney, Bottom-up relational learning of pattern matching rules for information extraction, The Journal of Machine Learning Research, 4, p.177-210, 12/1/2003
- 9. William W. Cohen and P. Ravikumar. Secondstring: An open-source Java toolkit of approximate string-matching techniques. Project web page, http://secondstring.sourceforge.net, 2003.
- 10. William W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003.
- 11. Michael Collins, Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, p.1-8, July 06, 2002 doi:10.3115/1118693.1118694
- 12. Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP99), College Park, MD, 1999.
- 13. Koby Crammer, Yoram Singer, Ultraconservative online algorithms for multiclass problems, The Journal of Machine Learning Research, 3, p.951-991, 3/1/2003
- 14. Mark Craven, Johan Kumlien, Constructing Biological Knowledge Bases by Extracting Information from Text Sources, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, p.77-86, August 06-10, 1999
- 15. R. Durban, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis - Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, 1998.
- 16. Dayne Freitag, Multistrategy Learning for Information Extraction, Proceedings of the Fifteenth International Conference on Machine Learning, p.161-169, July 24-27, 1998
- 17. Yoav Freund, Robert E. Schapire. (1998). “Large Margin Classification Using the Perceptron Algorithm.” In: Proceedings of the eleventh annual conference on Computational learning theory doi:10.1145/279943.279985
- 18. Xianping Ge, Padhraic Smyth, Segmental semi-markov models and applications to sequence analysis, 2002
- 19. D. Hanisch, J. Fluck, H. Mevissen, and R. Zimmer. Playing biology's name game: identifying protein names in scientific text. In Pac Symp Biocomput, pages 403--14, 2003.
- 20. K. Humphreys, G. Demetriou, and R. Gaizauskas. Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. In: Proceedings of 2000 the Pacific Symposium on Biocomputing (PSB-2000), pages 502--513, 2000.
- 21. Dan Klein, Christopher D. Manning, Conditional structure versus conditional estimation in NLP models, Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, p.9-16, July 06, 2002 doi:10.3115/1118693.1118695
- 22. R. E. Kraut, S. R. Fussell, F. J. Lerch, and J. A. Espinosa. Coordination in teams: evi-dence from a simulated management game. To appear in the Journal of Organizational Behavior, 2004.
- 23. Michael Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman. Using blast for identifying gene and protein names in journal articles. Gene, 259(1-2):245--52, 2000.
- 24. John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001
- 25. Steve Lawrence, C. Lee Giles, Kurt Bollacker, Digital Libraries and Autonomous Citation Indexing, Computer, v.32 n.6, p.67-71, June 1999 doi:10.1109/2.769447
- 26. Nick Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm, Machine Learning, v.2 n.4, p.285-318, April 1988 doi:10.1023/A:1022869011914
- 27. Robert Malouf, Markov models for language-independent named entity recognition, proceeding of the 6th conference on Natural language learning, p.1-4, August 31, 2002 doi:10.3115/1118853.1118872
- 28. Andrew McCallum, Dayne Freitag, Fernando C. N. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation, Proceedings of the Seventeenth International Conference on Machine Learning, p.591-598, June 29-July 02, 2000
- 29. Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore, Automating the Construction of Internet Portals with Machine Learning, Information Retrieval, v.3 n.2, p.127-163, July 2000 doi:10.1023/A:1009953814988
- 30. Adwait Ratnaparkhi, Learning to Parse Natural Language with Maximum Entropy Models, Machine Learning, v.34 n.1-3, p.151-175, Feb. 1999
- 31. Ellen Riloff, Rosie Jones, Learning dictionaries for information extraction by multi-level bootstrapping, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.474-479, July 18-22, 1999, Orlando, Florida, United States
- 32. Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
- 33. K. Seymore, Andrew McCallum, and R. Rosenfeld. Learning Hidden Markov Model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.
- 34. Fei Sha, Fernando Pereira, Shallow parsing with conditional random fields, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.134-141, May 27-June 01, 2003, Edmonton, Canada doi:10.3115/1073445.1073473
- 35. Richard S. Sutton, Integrated architecture for learning, planning, and reacting based on approximating dynamic programming, Proceedings of the seventh international conference (1990) on Machine learning, p.216-224, June 1990, Austin, Texas, United States
- 36. L. Sweeney. Finding lists of people on the web. Technical Report CMU-CS-03-168, CMU-ISRI-03-104, Carnegie Mellon University School of Computer Science, (2003). Available from: http://privacy.cs.cmu.edu/dataprivacy/projects/rosterfinder/.
- 37. W. E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.
- 38. R. Y. Winston Lin and Ralph Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In: Proceedings of the ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, D.C, August (2003).
|2004 ExploitingDictionariesInNamedEntityExtraction||William W. Cohen|
|Exploiting Dictionaries in Named Entity Extraction: Combining semi-Markov extraction processes and data integration methods||KDD-2004 Conference||http://www.it.iitb.ac.in/~sunita/papers/kdd04.pdf||10.1145/1014052.1014065||2004|