2005 MiningKnowledgeFromTextUsingIE

From GM-RKB
Jump to: navigation, search

Subject Headings: Information Extraction Task

Notes

Cited By

Quotes

Abstract

An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.

References

  • Rakesh Agrawal and R. Srikant. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Databases (VLDB-94), pages 487–499, Santiago, Chile, Sept. 1994.
  • Ricardo Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
  • S. W. Bennett, C. Aone, and C. Lovell. Learning to tag multilingual texts through observation. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 109–116, Providence, RI, 1997.
  • D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what’s in a name. Machine Learning, 34:211–232, 1999.
  • Mikhail Bilenko, Raymond Mooney W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16–23, 2003.
  • M. Bilenko and Raymond Mooney. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 39–48, Washington, DC, Aug. 2003.
  • Christian Blaschke and A. Valencia. Can bibliographic pointers for known biological data be found automatically? protein interactions as a case study. Comparative and Functional Genomics, 2:196–206, 2001.
  • Christian Blaschke and A. Valencia. The frame-based module of the Suiseki information extraction system. IEEE Intelligent Systems, 17:14–20, 2002.
  • Eric D. Brill. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565, 1995.
  • Razvan C. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, Raymond Mooney, A. K. Ramani, and Y. W. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine (special issue on summarization and Information Extraction from Medical Documents), 33(2):139155, 2005.
  • Razvan C. Bunescu and R. J.Mooney. Collective information extraction with relational Markov networks. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 439–446, Barcelona, Spain, July 2004.
  • M. E. Cali® and Raymond Mooney. Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pages 328–334, Orlando, FL, July 1999.
  • M. E. Cali® and Raymond Mooney. Bottom-up relational learning of pattern matching rules for information extraction. Journal of machine Learning Research, 4:177210, 2003.
  • C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65–79, 1997.
  • X. Carreras, L. M`arquez, and L. Padr´o. A simple named entity extractor using AdaBoost. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
  • Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann.
  • H. L. Chieu and H. T. Ng. Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pages 160–163, Edmonton, Canada, 2003.
  • Kenneth W. Church. A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143, Austin, TX, (1988). Association for Computational Linguistics.
  • Fabio Ciravegna, Alexiei Dingli, D. Guthrie, and Y.Wilks. Mining web sites using unsupervised adaptive information extraction. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Apr. 2003.
  • William W. Cohen. Fast e®ective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 115–123, San Francisco, CA, 1995.
  • Michael Collins. Three generative, lexicalised models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), pages 16–23, 1997. SIGKDD Explorations. Volume 7, Issue 1 - Page 8
  • M. Craven, D. DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 509–516, Madison, WI, July 1998.
  • M. Craven and J. Kumlien. Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-1999), pages 77–86, Heidelberg, Germany, 1999.
  • Aron Culottaand J. Sorensen. Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, July 2004.
  • DARPA, editor. Proceedings of the Seventh Message Understanding Evaluation and Conference (MUC-98), Fairfax, VA, Apr. (1998). Morgan Kaufmann.
  • Pedro Domingos. Unifying instance-based and rule-based induction. Machine Learning, 24:141–168, 1996.
  • R. B. Doorenbos, Oren Etzioni, and D. S.Weld. A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the First International Conference on Autonomous Agents (Agents-97), pages 39–48, Marina del Rey, CA, Feb. 1997.
  • C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.
  • Dayne Freitag. Toward general-purpose learning for information extraction. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98 (ACL/COLING-98), pages 404408, Montreal, Quebec, 1998.
  • Dayne Freitag and N. Kushmerick. Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 577–583, Austin, TX, July (2000). AAAI Press / The MIT Press.
  • Dayne Freitag and Andrew McCallum. Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Austin, TX, (2000). AAAI Press / The MIT Press.
  • C. Friedman, P. Kra, H. Yu, Michael Krauthammer, and A. Rzhetsky. GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17:S74–S82, (2001). Supplement 1.
  • K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi. Information extraction: Identifying protein names from biological papers. In: Proceedings of the 3rd Pacific Symposium on Biocomputing, pages 707–718, 1998.
  • R. Ghani, R. Jones, D. Mladeni´c, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the Web. In D. Mladeni´c, editor, Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 29–36, Boston, MA, Aug. 2000.
  • D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997.
  • T. Hasegawa, Satoshi Sekine, and Ralph Grishman. Discovering relations among entities from large corpora. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 416–423, Barcelona, Spain, July 2004.
  • N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pages 729–735, Nagoya, Japan, 1997.
  • J. La®erty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning (ICML-2001), pages 282–289, Williamstown, MA, 2001.
  • E. Marcotte, I. Xenarios, and D. Eisenberg. Mining literature for protein-protein interactions. Bioinformatics, Apr;17(4):359–363, 2001.
  • J. Mayfield, P. McNamee, and C. Piatko. Named entity recognition using hundreds of thousands of features. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
  • Andrew McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional probability, relational models. In: Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico, Aug. 2003.
  • Andrew McCallum, S. Tejada, and D. Quass, editors. Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation,Washington, DC, Aug. 2003.
  • F. D. Meulder and W. Daelemans. Memory-based named entity recognition using unannotated data. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
  • Raymond Mooney and L. Roy. Content-based book recommending using learning for text categorization. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pages 195–204, San Antonio, TX, June 2000.
  • S. H. Muggleton, editor. Inductive Logic Programming. Academic Press, New York, NY, 1992.* U. Y. Nahm. Text Mining with Information Extraction. PhD thesis, Department of Computer Sciences, University of Texas, Austin, TX, Aug. (2004). SIGKDD Explorations. Volume 7, Issue 1 - Page 9
  • U. Y. Nahm, Mikhail Bilenko, and Raymond Mooney. Two approaches to handling noisy variation in text mining. In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pages 18–27, Sydney, Australia, July 2002.
  • U. Y. Nahm and Raymond Mooney. A mutually beneficial integration of data mining and information extraction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), pages 627–632, Austin, TX, July 2000.
  • U. Y. Nahm and Raymond Mooney. Using information extraction to aid the discovery of prediction rules from texts. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, pages 51–58, Boston, MA, Aug. 2000.
  • U. Y. Nahm and Raymond Mooney. Mining soft-matching rules from textual data. In: Proceedings of the Seven-teenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 979–984, Seattle, WA, July 2001.
  • U. Y. Nahm and Raymond Mooney. Mining soft-matching association rules. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM-2002), pages 681–683, McLean,VA, Nov. 2002.
  • U. Y. Nahm and Raymond Mooney. Using soft-matching mined rules to improve information extraction. In: Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), pages 27–32, San Jose, CA, July 2004.
  • National Institute of Standards and Technology. ACE - Automatic Content Extraction. http://www.nist.gov/speech/tests/ace/.
  • F. Peng and Andrew McCallum. Accurate information extraction from research papers using conditional random fields. In: Proceedings of Human Language Technology Conference / North American Association for Computational Linguistics Annual Meeting (HLT-NAACL 2004), Boston, MA, 2004.
  • C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Association of genes to genetically inherited diseases using data mining. Nature Genetics, 31(3):316–319, July 2002.
  • J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,CA, 1993.
  • Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
  • A. K. Ramani, Razvan C. Bunescu, Raymond Mooney, and E. M. Marcotte. Consolidating the set of know human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biology, 6(5):r40, 2005.
  • L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In: Proceedings of the Third Workshop on Very Large Corpora, 1995.
  • E. M. Rasmussen. Clustering algorithms. In W. B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval. Prentice Hall, Englewood Cli®s, NJ, 1992.
  • S. Ray and M. Craven. Representing sentence structure in hidden Markov models for information extraction. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), pages 1273–1279, Seattle, WA, 2001.
  • E. Rilo®. Automatically generating extraction patterns from untagged text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049, Portland, OR, 1996.
  • Gerard M. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
  • E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.
  • Sunita Sarawagi and William W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, Vancouver, Canada, 2005.
  • F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
  • S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34:233–272, 1999.
  • L. Tanabe and W. J. Wilbur. Tagging gene and protein names in biomedical text. Bioinformatics, 18(8):1124–1132, 2002.
  • Ben Taskar, P. Abbeel, and Daphne Koller. Discriminative probabilistic models for relational data. In: Proceedings of 18th Conference on Uncertainty in Artificial Intelligence (UAI-2002), pages 485–492, Edmonton, Canada, 2002.
  • C. A. Thompson, M. E. Cali®, and Raymond Mooney. Active learning for natural language parsing and information extraction. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 406–414, Bled, Slovenia, June 1999.
  • A. J. Viterbi. Error bounds for convolutional codes and and asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967.
  • L. Wall, T. Christiansen, and R. L. Schwartz. Programming Perl. O’Reilly and Associates, Sebastopol, CA, 1996.
  • D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106, 2003.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 MiningKnowledgeFromTextUsingIERaymond J. Mooney
Razvan C. Bunescu
Mining knowledge from text using information extractionSIGKDD Explorations Newsletterhttp://www.acm.org/sigs/sigkdd/explorations/issues/7-1-2005-06/2-Mooney.pdf2005