2008 InformationExtraction

From GM-RKB
Jump to: navigation, search

Subject Headings: Information Extraction Task, Named Entity Mention Recognition Task, Named Entity Mention Recognition Algorithm, Binary Relation Mention Recognition Task, Binary Relation Mention Recognition Algorithm.

Notes

Cited By

Quotes

Abstract

The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured datastructured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem.

This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.

1 Introduction

Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources. This enables much richer forms of queries on the abundant unstructured sources than possible with keyword searches alone. When structured and unstructured data co-exist, information extraction makes it possible to integrate the two types of sources and pose queries spanning them.

...


References

  • [1] (2004). ACE. Annotation guidelines for entity detection and tracking.
  • [2] Eugene Agichtein, “Extracting relations from large text collections,” PhD thesis, Columbia University, (2005).
  • [3] Eugene Agichtein and Venkatesh Ganti, “Mining reference tables for automatic text segmentation,” In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA, (2004).
  • [4] Eugene Agichtein and L. Gravano, “Snowball: Extracting relations from large plaintext collections,” In: Proceedings of the 5th ACM International Conference on Digital Libraries, (2000).
  • [5] Eugene Agichtein and L. Gravano, “Querying text databases for efficient information extraction,” in ICDE, (2003).
  • [6] Rakesh Agrawal, Heikki Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, (Usama M. Fayyad, G. Piatetsky-Shapiro, Padhraic Smyth, and R. Uthurusamy, eds.), ch. 12, pp. 307–328, AAAI/MIT Press, (1996).
  • [7] J. Aitken, “Learning information extraction rules: An inductive logic programming approach,” In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359, (2002).
  • [8] R. Ananthakrishna, S. chaudhuri, and Venkatesh Ganti, “Eliminating fuzzy duplicates in data warehouses,” in International Conference on Very Large Databases (VLDB), (2002).
  • [9] R. Ando and T. Zhang, “A framework for learning predictive structures from multiple tasks and unlabeled data,” Journal of Machine Learning Research, vol. 6, pp. 1817–1853, (2005).
  • [10] Douglas E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson, “Fastus: A finite-state processor for information extraction from real-world text,” in IJCAI, pp. 1172–1178, (1993).
  • [11] A. Arasu, H. Garcia-Molina, and S. University, “Extracting structured data from web pages,” in SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD Conference, pp. 337–348, (2003).
  • [12] S. Argamon-Engelson and I. Dagan, “Committee-based sample selection for probabilistic classifiers,” Journal of Artificial Intelligence Research, vol. 11, pp. 335–360, (1999).
  • [13] M.-F. Balcan, A. Beygelzimer, and J. Langford, “Agnostic active learning,” in ICML, pp. 65–72, (2006).
  • [14] Michele Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and Oren Etzioni, “Open information extraction from the web,” in IJCAI, pp. 2670–2676, (2007).
  • [15] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” in FOCS ’02: Proceedings of the 43rd Symposium on Foundations of Computer Science, USA, Washington, DC: IEEE Computer Society, (2002).
  • [16] G. Barish, Y.-S. Chen, D. DiPasquo, C. A. Knoblock, S. Minton, I. Muslea, and C. Shahabi, “Theaterloc: Using information integration technology to rapidly build virtual applications,” in International Conference on Data Engineering (ICDE), pp. 681–682, (2000).
  • [17] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual web information extraction with lixto,” in VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128, USA, San Francisco, CA: Morgan Kaufmann Publishers Inc, (2001).
  • [18] M. Berland and Eugene Charniak, “Finding parts in very large corpora,” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 57–64, (1999).
  • [19] M. Bhide, A. Gupta, R. Gupta, P. Roy, M. K. Mohania, and Z. Ichhaporia, “Liptus: Associating structured and unstructured information in a banking environment,” in SIGMOD Conference, pp. 915–924, (2007).
  • [20] D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel, “Nymble: A highperformance learning name-finder,” In: Proceedings of ANLP-97, pp. 194–201, (1997).
  • [21] Mikhail Bilenko, Raymond Mooney W. Cohen, P. Ravikumar, and S. Fienberg, “Adaptive name-matching in information integration,” IEEE Intelligent Systems, (2003).
  • [22] (2006). Biocreative — critical assessment for information extraction in biology. http://biocreative.sourceforge.net/.
  • [23] J. Blitzer, R. McDonald, and Fernando Pereira, “Domain adaptation with structural correspondence learning,” In: Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), (2006).
  • [24] A. Bordes, L. Bottou, P. Gallinari, and J. Weston, “Solving multiclass support vector machines with larank,” in ICML, pp. 89–96, (2007).
  • [25] V. R. Borkar, K. Deshmukh, and Sunita Sarawagi, “Automatic text segmentation for extracting structured records,” In: Proceedings of ACM SIGMOD Conference, Santa Barabara, USA, (2001).
  • [26] A. Borthwick, J. Sterling, Eugene Agichtein, and R. Grishman, “Exploiting diverse knowledge sources via maximum entropy in named entity recognition,” in Sixth Workshop on Very Large Corpora New Brunswick, New Jersey, Association for Computational Linguistics, (1998).
  • [27] L. Bottou, “Stochastic learning,” in Advanced Lectures on Machine Learning, number LNAI 3176 in Lecture Notes in Artificial Intelligence, (O. Bousquet and U. von Luxburg, eds.), pp. 146–168, Springer Verlag, (2004).
  • [28] J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu, “Mystiq: A system for finding more answers by using probabilities,” in ACM SIGMOD, (2005).
  • [29] A. Z. Broder, M. Fontoura, V. Josifovski, and L. Riedel, “A semantic approach to contextual advertising,” in SIGIR, pp. 559–566, (2007).
  • [30] Razvan C. Bunescu and Raymond Mooney “Learning to extract relations from the web using minimal supervision,” In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 576–583, June (2007).
  • [31] Razvan C. Bunescu and Raymond Mooney, “Collective information extraction with relational markov networks,” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 439–446, (2004).
  • [32] Razvan C. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, Raymond Mooney, A. K. Ramani, and Y. W. Wong, “Comparative experiments on learning information extractors for proteins and their interactions,” Artificial Intelligence in Medicine, vol. 33, pp. 139–155, (2005).
  • [33] Razvan C. Bunescu and Raymond Mooney, “A shortest path dependency kernel for relation extraction,” in HLT ’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724–731, USA, Morristown, NJ: Association for Computational Linguistics, (2005).
  • [34] D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan, “OLAP over uncertain and imprecise data,” In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 970–981, VLDB Endowment, (2005).
  • [35] D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan, “Olap over imprecise data with domain constraints,” in VLDB, pp. 39–50, (2007).
  • [36] Michael J. Cafarella, N. Khoussainova, D. Wang, E. Wu, Y. Zhang, and A. Halevy, “Uncovering the relational web,” in WebDB, (2008).
  • [37] M. J. Cafarella, D. Downey, S. Soderland, and Oren Etzioni, “KnowItNow: Fast, scalable information extraction from the web,” in Conference on Human Language Technologies (HLT/EMNLP), (2005).
  • [38] M. J. Cafarella and Oren Etzioni, “A search engine for natural language applications,” in WWW, pp. 442–452, (2005).
  • [39] M. J. Cafarella, C. Re, D. Suciu, and Oren Etzioni, “Structured querying of web text data: A technical challenge,” in CIDR, pp. 225–234, (2007).
  • [40] D. Cai, ShipengYu, Ji-RongWen, and W.-Y. Ma, “Vips: A vision based page segmentation algorithm,” Technical Report MSR-TR-2003-79, Microsoft, (2004).
  • [41] Y. Cai, X. L. Dong, A. Y. Halevy, J. M. Liu, and J. Madhavan, “Personal information management with semex,” in SIGMOD Conference, pp. 921–923, (2005).
  • [42] M. Califf and Raymond Mooney Bottom-up Relational Learning of Pattern Matching Rules for Information Extraction, (2003).
  • [43] M. E. Califf and Raymond Mooney, “Relational learning of pattern-match rules for information extraction,” In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pp. 328–334, July (1999).
  • [44] Venkatesan T. Chakaravarthy, H. Gupta, P. Roy, and M. K. Mohania, “Efficiently linking text documents with relevant structured information,” in VLDB, pp. 667– 678, (2006).
  • [45] Soumen Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, (2002).
  • [46] Soumen Chakrabarti, J. Mirchandani, and A. Nandi. (2005). “Spin: searching personal information networks.” In: Proceedings of SIGIR 2005.
  • [47] Soumen Chakrabarti, Kunal Punera, and M. Subramanyam. (2002). “Accelerated Focused Crawling Through Online Relevance Feedback.” In: Proceedings of WWW 2002.
  • [48] Soumen Chakrabarti, K. Puniyani, and S. Das. (2006). “Optimizing Scoring Functions and Indexes for Proximity

Search in Type-Annotated Corpora.” In: Proceedings of WWW 2006.

  • [49] A. Chandel, P. Nagesh, and Sunita Sarawagi, “Efficient batch top-k search for dictionary-based entity recognition,” In: Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE), (2006).
  • [50] M. Charikar, V. Guruswami, and A. Wirth, “Clustering with qualitative information,” Journal of Computer and Systems Sciences, vol. 71, pp. 360–383, (2005).
  • [51] S. Chaudhuri, K. Ganjam, Venkatesh Ganti, and Rajeev Motwani, “Robust and efficient fuzzy match for online data cleaning,” in SIGMOD, (2003).
  • [52] Chelba and Acero, “Adaptation of maximum entropy capitalizer: Little data can help a lot,” in EMNLP, (2004).
  • [53] F. Chen, A. Doan, J. Yang, and R. Ramakrishnan, “Efficient information extraction over evolving text data,” in ICDE, (2008).
  • [54] D. Cheng, R. Kannan, S. Vempala, and G. Wang, “A divide-and-merge methodology for clustering,” ACM Transactions on Database Systems, vol. 31, pp. 1499–1525, (2006).
  • [55] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Evaluating probabilistic queries over imprecise data,” in SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD Conference, pp. 551–562, USA, New York, NY: ACM Press, (2003).
  • [56] B. Chidlovskii, B. Roustant, and M. Brette, “Documentum eci self-repairing wrappers: Performance analysis,” in SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD Conference, pp. 708– 717, USA, New York, NY: ACM, (2006).
  • [57] (1998). N. A. Chinchor, Overview of MUC-7/MET-2.
  • [58] J. Cho and S. Rajagopalan, “A fast regular expression indexing engine,” in ICDE, pp. 419–430, (2002).
  • [59] Y. Choi, C. Cardie, Ellen Riloff, and S. Patwardhan, “Identifying sources of opinions with conditional random fields and extraction patterns,” in HLT/EMNLP, (2005).
  • [60] Fabio Ciravegna, “Adaptive information extraction from text by rule induction and generalisation,” In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI2001), (2001).
  • [61] W. Cohen and J. Richman, “Learning to match and cluster entity names,” in ACM SIGIR’ 01 Workshop on Mathematical/Formal Methods in Information Retrieval, (2001).
  • [62] William W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in html documents,” In: Proceedings of the 11th World Wide Web Conference (WWW2002), (2002).
  • [63] William W. Cohen, E. Minkov, and A. Tomasic, “Learning to understand web site update requests,” in IJCAI, pp. 1028–1033, (2005).
  • [64] William W. Cohen, P. Ravikumar, and S. E. Fienberg, “A comparison of string distance metrics for name-matching tasks,” In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), (2003). (To appear).
  • [65] William W. Cohen and Sunita Sarawagi, “Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods,” In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2004).
  • [66] D. A. Cohn, Zoubin Ghahramani, and Michael I. Jordan, “Active learning with statistical models,” in Advances in Neural Information Processing Systems, (G. Tesauro, D. Touretzky, and T. Leen, eds.), pp. 705–712, The MIT Press, (1995).
  • [67] V. Crescenzi, G. Mecca, P. Merialdo, and P. Missier, “An automatic data grabber for large web sites,” in vldb’2004: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 1321–1324, (2004).
  • [68] A. Culotta, T. T. Kristjansson, Andrew McCallum, and P. A. Viola, “Corrective feedback and persistent learning for information extraction,” Artificial Intelligence, vol. 170, nos. 14–15, pp. 1101–1122, (2006).
  • [69] A. Culotta and J. Sorensen, “Dependency tree kernels for relation extraction,” In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pp. 423–429, Barcelona, Spain, July (2004).
  • [70] C. Cumby and Dan Roth, “Feature extraction languages for propositionalzed relational learning,” in Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-2003), (L. Getoor and D. Jensen, eds.), pp. 24–31, Acapulco, Mexico, August 11, (2003).
  • [71] H. Cunningham, “Information extraction, automatic,” Encyclopedia of Language and Linguistics, (2005). second ed.
  • [72] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “Gate: A framework and graphical development environment for robust nlp tools and applications,” In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, (2002).
  • [73] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A framework and graphical development environment for robust nlp tools and applications,” In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), Philadelphia, (2002).
  • [74] E. Cutrell and S. T. Dumais, “Exploring personal information,” Communications on ACM, vol. 49, pp. 50–51, (2006).
  • [75] N. N. Dalvi and D. Suciu, “Efficient query evaluation on probabilistic databases,” in VLDB, pp. 864–875, (2004).
  • [76] S. Dasgupta, “Coarse sample complexity bounds for active learning,” in NIPS, (2005).
  • [77] H. Daum´e III, “Frustratingly easy domain adaptation,” in Conference of the Association for Computational Linguistics (ACL), Prague, Czech Republic, (2007).
  • [78] P. DeRose, W. Shen, F. C. 0002, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan, “Dblife: A community information management platform for the database research community (demo),” in CIDR, pp. 169–172, (2007).
  • [79] T. Dietterich, “Machine learning for sequential data: A review,” in Structural, Syntactic and Statistical Pattern Recognition; Lecture Notes in Computer Science, (T. Caelli, ed.), Vol. 2396, pp. 15–30, Springer-Verlag, (2002).
  • [80] Pedro Domingos, “Metacost: A general method for making classifiers costsensitive,” In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99), (1999).
  • [81] D. Downey, M. Broadhead, and Oren Etzioni, “Locating complex named entities in web text,” in IJCAI, pp. 2733–2739, (2007).
  • [82] D. Downey, Oren Etzioni, and S. Soderland, “A probabilistic model of redundancy in information extraction,” in IJCAI, (2005).
  • [83] D. Downey, S. Schoenmackers, and Oren Etzioni, “Sparse information extraction: Unsupervised language models to the rescue,” in ACL, (2007).
  • [84] D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy, “Table-processing paradigms: A research survey,” IJDAR, vol. 8, nos. 2–3, pp. 66–86, (2006).
  • [85] D. W. Embley, Y. S. Jiang, and Y.-K. Ng, “Record-boundary discovery in web documents,” in SIGMOD 1999, Proceedings of ACM SIGMOD Conference, June 1–3, 1999, pp. 467–478, Philadephia, Pennsylvania, USA, (1999).
  • [86] Oren Etzioni, Michael J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Web-scale information extraction in KnowItAll: (preliminary results),” in WWW, pp. 100–110, (2004).
  • [87] Oren Etzioni, B. Doorenbos, and D. Weld, “A scalable comparison shopping agent for the world-wide web,” In: Proceedings of the International Conference on Autonomous Agents, (1997).
  • [88] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithms for middleware,” Journal of Computer and System Sciences, vol. 66, nos. 614, 656, September (2001).
  • [89] Ronen Feldman, B. Rosenfeld, and M. Fresko, “Teg-a hybrid approach to information extraction,” Journal of Knowledge and Information Systems, vol. 9, pp. 1–18, (2006). References 369
  • [90] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal of the American Statistical Society, vol. 64, pp. 1183–1210, 1969.
  • [91] D. Ferrucci and A. Lally, “Uima: An architectural approach to unstructured information processing in the corporate research environment,” Natural Language Engineering, vol. 10, nos. 3–4, pp. 327–348, (2004).
  • [92] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” In: Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), (2005).
  • [93] J. R. Finkel, T. Grenager, and Christopher D. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in ACL, (2005).
  • [94] G. W. Flake, E. J. Glover, S. Lawrence, and C. L. Giles, “Extracting query modifications from nonlinear svms,” in WWW, pp. 317–324, (2002).
  • [95] Yoav Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective sampling using the query by committee algorithm,” Machine Learning, vol. 28, nos. 2–3, pp. 133–168, (1997).
  • [96] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak, “Towards domain-independent information extraction from web tables,” in WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 71– 80, ACM, (2007).
  • [97] R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano, “Text mining for product attribute extraction,” SIGKDD Explorations Newsletter, vol. 8, pp. 41–48, (2006).
  • [98] R. Grishman, “Information extraction: Techniques and challenges,” in SCIE, (1997).
  • [99] R. Grishman, S. Huttunen, and R. Yangarber, “Information extraction for enhanced access to disease outbreak reports,” Journal of Biomedical Informatics, vol. 35, pp. 236–246, (2002).
  • [100] R. Grishman and B. Sundheim, “Message understanding conference-6: A brief history,” In: Proceedings of the 16th Conference on Computational Linguistics, pp. 466–471, USA, Morristown, NJ: Association for Computational Linguistics, (1996).
  • [101] R. Gupta, A. A. Diwan, and Sunita Sarawagi, “Efficient inference with cardinalitybased clique potentials,” In: Proceedings of the 24th International Conference on Machine Learning (ICML), USA, (2007).
  • [102] R. Gupta and Sunita Sarawagi, “Curating probabilistic databases from information extraction models,” In: Proceedings of the 32nd International Conference on Very Large Databases (VLDB), (2006).
  • [103] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo, “Extracting semistructure information from the web,” in Workshop on Mangement of Semistructured Data, (1997).
  • [104] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,” Communications on ACM, vol. 50, pp. 94–101, (2007).
  • [105] Marti Hearst, “Automatic acquisition of hyponyms from large text corpora,” In: Proceedings of the 14th Conference on Computational Linguistics, pp. 539– 545, (1992).
  • [106] C.-N. Hsu and M.-T. Dung, “Generating finite-state transducers for semistructured data extraction from the web,” Information Systems Special Issue on Semistructured Data, vol. 23, (1998).
  • [107] J. Huang, T. Chen, A. Doan, and J. Naughton, On the Provenance of Nonanswers to Queries Over Extracted Data.
  • [108] J. Huang, Alexander J. Smola, A. Gretton, K. Borgwardt, and Bernhard Schölkopf, “Correcting sample selection bias by unlabeled data,” in Advances in Neural Information Processing Systems 20, Cambridge, MA: MIT Press, (2007).
  • [109] M. Hurst, “The interpretation of tables in texts,” PhD thesis, University of Edinburgh, School of Cognitive Science, Informatics, University of Edinburgh, (2000).
  • [110] Panagiotis G. Ipeirotis, Eugene Agichtein, P. Jain, and L. Gravano, “Towards a query optimizer for text-centric tasks,” ACM Transactions on Database Systems, vol. 32, (2007).
  • [111] N. Ireson, Fabio Ciravegna, M. E. Califf, Dayne Freitag, N. Kushmerick, and A. Lavelli, “Evaluating machine learning for information extraction,” in ICML, pp. 345–352, (2005).
  • [112] M. Jansche and Steven P. Abney, “Information extraction from voicemail transcripts,” in EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp. 320–327, USA, Morristown, NJ: Association for Computational Linguistics, (2002).
  • [113] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu, “Avatar information extraction system,” IEEE Data Engineering Bulletin, vol. 29, pp. 40–48, (2006).
  • [114] J. Jiang and C. Zhai, “A systematic exploration of the feature space for relation extraction,” in Human Language Technologies 2007]]: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 113–120, (2007).
  • [115] Nanda Kambhatla, “Combining lexical, syntactic and semantic features with maximum entropy models for information extraction,” in The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, pp. 178–181, Barcelona, Spain: Association for Computational Linguistics, July (2004).
  • [116] S. Khaitan, G. Ramakrishnan, S. Joshi, and A. Chalamalla, “Rad: A scalable framework for annotator development,” in ICDE, pp. 1624–1627, (2008).
  • [117] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, “n-gram/2l: A space and time efficient two-level n-gram inverted index structure,” in VLDB ’05: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 325–336, (2005).
  • [118] D. Klein and Christopher D. Manning, “Conditional structure versus conditional estimation in NLP models,” in Workshop on Empirical Methods in Natural Language Processing (EMNLP), (2002).
  • [119] Daphne Koller and Nir Friedman, “Structured probabilistic models,” Under preparation, (2007).
  • [120] V. Krishnan and Christopher D. Manning, “An effective two-stage model for exploiting non-local dependencies in named entity recognition,” in ACL-COLING, (2006).
  • [121] N. Kushmerick, “Wrapper induction for information extraction,” PhD thesis, University of Washington, (1997).
  • [122] N. Kushmerick, “Regression testing for wrapper maintenance,” in AAAI/IAAI, pp. 74–79, (1999).
  • [123] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper induction for information extraction,” In: Proceedings of IJCAI, (1997).
  • [124] S. R. Labeling (2008). http://www.lsi.upc.es/ srlconll/refs.html.
  • [125] John D. Lafferty, Andrew McCallum, and Fernando Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” In: Proceedings of the International Conference on Machine Learning (ICML-2001), Williams, MA, (2001).
  • [126] S. Lawrence, C. L. Giles, and K. Bollacker, “Digital libraries and autonomous citation indexing,” IEEE Computer, vol. 32, pp. 67–71, (1999).
  • [127] W. Lehnert, J. McCarthy, S. Soderland, Ellen Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan, and S. Goldman, “Umass/hughes: Description of the circus system used for tipster text,” In: Proceedings of a Workshop on Held at Fredericksburg, Virginia, pp. 241–256, USA, Morristown, NJ: Association for Computational Linguistics, (1993).
  • [128] K. Lerman, S. Minton, and C. A. Knoblock, “Wrapper maintenance: A machine learning approach,” Journal of Artificial Intellgence Research (JAIR), vol. 18, pp. 149–181, (2003).
  • [129] X. Li and J. Bilmes, “A bayesian divergence prior for classifier adaptation,” Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS-2007), (2007).
  • [130] Y. Li and K. Bontcheva, “Hierarchical, perceptron-like learning for ontologybased information extraction,” in WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 777–786, ACM, (2007).
  • [131] Bing Liu, M. Hu, and J. Cheng, “Opinion observer: Analyzing and comparing opinions on the web,” in WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 342–351, (2005).
  • [132] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large-scale optimization,” Mathematic Programming, vol. 45, pp. 503–528, (1989).
  • [133] L. Liu, C. Pu, and W. Han, “Xwrap: An xml-enabled wrapper construction system for web information sources,” in International Conference on Data Engineering (ICDE), pp. 611–621, (2000).
  • [134] Y. Liu, K. Bai, P. Mitra, and C. L. Giles, “Tableseer: Automatic table Metadata Extraction and Searching in Digital Libraries,” in JCDL ’07: Proceedings of the 2007 Conference on Digital Libraries, pp. 91–100, USA, New York, NY: ACM, (2007).
  • [135] R. Malouf, “Markov models for language-independent named entity recognition,” In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), (2002).
  • [136] R. Malouf, “A comparison of algorithms for maximum entropy parameter estimation,” In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), pp. 49–55, (2002).
  • [137] Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press, (1999). 372 References
  • [138] I. Mansuri and Sunita Sarawagi, “A system for integrating unstructured data into relational databases,” In: Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE), (2006).
  • [139] S. Mao, A. Rosenfeld, and T. Kanungo, “Document structure analysis algorithms: A literature survey,” Document Recognition and Retrieval X, vol. 5010, pp. 197–207, (2003).
  • [140] B. Marthi, B. Milch, and S. Russell, “First-order probabilistic models for information extraction,” in Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-2003), (L. Getoor and D. Jensen, eds.), pp. 71–78, Acapulco, Mexico, August 11 (2003).
  • [141] D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and Y. Wilks, “Named entity recognition from diverse text types,” Recent Advances in Natural Language Processing 2001 Conference, Tzigov Chark, Bulgaria, (2001).
  • [142] Andrew McCallum, “Information extraction: Distilling structured data from unstructured text,” ACM Queue, vol. 3, pp. 48–57, (2005).
  • [143] Andrew McCallum, Dayne Freitag, and Fernando Pereira, “Maximum entropy markov models for information extraction and segmentation,” In: Proceedings of the International Conference on Machine Learning (ICML-2000), pp. 591–598, Palo Alto, CA, (2000).
  • [144] Andrew McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore, Cora: Computer Science Research Paper Search Engine, http://cora.whizbang.com/, (2000).
  • [145] Andrew McCallum and Ben Wellner, “Toward conditional models of identity uncertainty with application to proper noun coreference,” In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pp. 79–86, Acapulco, Mexico, August (2003).
  • [146] A. K. McCallum, Mallet: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu, (2002).
  • [147] D. McDonald, H. Chen, H. Su, and B. Marshall, “Extracting gene pathway relations using a hybrid grammar: The arizona relation parser,” Bioinformatics, vol. 20, pp. 3370–3378, (2004).
  • [148] R. McDonald, K. Crammer, and Fernando Pereira, “Flexible text segmentation with structured multilabel classification,” in HLT/EMNLP, (2005).
  • [149] G. Mecca, P. Merialdo, and P. Atzeni, “Araneus in the era of xml,” in IEEE Data Engineering Bullettin, Special Issue on XML, IEEE, September (1999).
  • [150] M. Michelson and C. A. Knoblock, “Semantic annotation of unstructured and ungrammatical text,” In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1091–1098, (2005).
  • [151] M. Michelson and C. A. Knoblock, “Creating relational data from unstructured and ungrammatical data sources,” Journal of Artificial Intelligence Research (JAIR), vol. 31, pp. 543–590, (2008).
  • [152] E. Minkov, R. C. Wang, and William W. Cohen, “Extracting personal names from email: Applying named entity recognition to informal text,” in HLT/EMNLP, (2005).
  • [153] Raymond Mooney and Razvan C. Bunescu, “Mining knowledge from text using information extraction,” SIGKDD Explorations, vol. 7, pp. 3–10, (2005). References 373
  • [154] I. Muslea, “Extraction patterns for information extraction tasks: A survey,” in The AAAI-99 Workshop on Machine Learning for Information Extraction, (1999).
  • [155] I. Muslea, S. Minton, and C. Knoblock, “Selective sampling with redundant views,” In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, AAAI-2000, pp. 621–626, (2000).
  • [156] I. Muslea, S. Minton, and C. A. Knoblock, “A hierarchical approach to wrapper induction,” In: Proceedings of the Third International Conference on Autonomous Agents, Seattle, WA, (1999).
  • [157] I. Muslea, S. Minton, and C. A. Knoblock, “Hierarchical wrapper induction for semistructured information sources,” Autonomous Agents and Multi-Agent Systems, vol. 4, nos. 1/2, pp. 93–114, (2001).
  • [158] A. Niculescu-Mizil and Rich Caruana, “Predicting good probabilities with supervised learning,” in ICML, (2005).
  • [159] NIST. Automatic content extraction (ACE) program. 1998–present.
  • [160] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, vol. 2, nos. 1–2, pp. 1–135, (2008).
  • [161] Parag and Pedro Domingos, “Multi-relational record linkage,” In: Proceedings of 3rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA, August (2004).
  • [162] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” in Advances in Neural Processing Systems 15, Vancouver, British Columbia: MIT Press, (2002).
  • [163] F. Peng and Andrew McCallum, “Accurate information extraction from research papers using conditional random fields,” in HLT-NAACL, pp. 329–336, (2004).
  • [164] D. Pinto, Andrew McCallum, X.Wei, and W. B. Croft, “Table extraction using conditional random fields,” in SIGIR ’03: Proceedings of the 26th ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 235–242, USA, New York, NY: ACM, (2003).
  • [165] A. Pivk, Philipp Cimiano, Y. Sure, M. Gams, V. Rajkovi?c, and R. Studer, “Transforming arbitrary tables into logical form with tartar,” Data Knowledge Engineering, vol. 60, pp. 567–595, (2007).
  • [166] C. Plake, T. Schiemann, M. Pankalla, Jörg Hakenberg, and U. Leser, “Alibaba: Pubmed as a graph,” Bioinformatics, vol. 22, pp. 2444–2445, (2006).
  • [167] (Popescu and Etzioni, 2005) ⇒ Ana-Maria Popescu, and Oren Etzioni. (2005). “Extracting Product Features and Opinions from Reviews.” In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT 2005).
  • [168] F. Popowich, “Using text mining and natural language processing for health care claims processing,” SIGKDD Explorartion Newsletter, vol. 7, pp. 59–66, (2005).
  • [169] K. Probst and R. Ghani, “Towards ‘interactive’ active learning in multi-view feature sets for information extraction,” in ECML, pp. 683–690, (2007).
  • [170] J. Ross Quinlan, “Learning logical definitions from examples,” Machine Learning, vol. 5, (1990).
  • [171] L. Rabiner, “A tutorial on Hidden Markov Models and selected applications in speech recognition,” In: Proceedings of the IEEE, vol. 77, (1989). 374 References
  • [172] G. Ramakrishnan, S. Balakrishnan, and S. Joshi, “Entity annotation based on inverse index operations,” in EMNLP, (2006).
  • [173] G. Ramakrishnan, S. Joshi, S. Balakrishnan, and A. Srinivasan, “Using ilp to construct features for information extraction from semi-structured text,” in ILP, (2007).
  • [174] L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis, “Automatic fragment detection in dynamic web pages and its impact on caching,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 859–874, (2005).
  • [175] L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis, “Automatic detection of fragments in dynamically generated web pages,” in WWW, pp. 443–454, (2004).
  • [176] J. Raposo, A. Pan, M. ´Alvarez, and N. ´Angel Vira, “Automatic wrapper maintenance for semi-structured web sources using results from previous queries,” in SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 654–659, ACM, (2005).
  • [177] Adwait Ratnaparkhi, “Learning to parse natural language with maximum entropy models,” Machine Learning, vol. 34, (1999).
  • [178] (Reeve & Han, 2005) ⇒ Lawrence Reeve, and Hyoil Han. (2005). “Survey of Semantic Annotation Platforms.” In: Proceedings of the 2005 ACM symposium on Applied computing [doi:10.1145/1066677.1067049].
  • [179] F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan, “An algebraic approach to rule-based information extraction,” in ICDE, (2008).
  • [180] Philip Resnik and A. Elkiss, “The linguist’s search engine: An overview (demonstration),” in ACL, (2005).
  • [181] Ellen Riloff, “Automatically constructing a dictionary for information extraction tasks,” in AAAI, pp. 811–816, (1993).
  • [182] B. Rosenfeld and Ronen Feldman, “Using corpus statistics on entities to improve semi-supervised relation extraction from the web,” In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 600–607, June (2007).
  • [183] R. Ross, V. S. Subrahmanian, and J. Grant, “Aggregate operators in probabilistic databases,” Journal of ACM, vol. 52, pp. 54–101, (2005).
  • [184] A. Sahuguet and F. Azavant, “Building light-weight wrappers for legacy web data-sources using w4f,” in International Conference on Very Large Databases (VLDB), (1999).
  • [185] Sunita Sarawagi, The CRF Project: A Java Implementation. http://crf. sourceforge.net, (2004).
  • [186] Sunita Sarawagi, “Efficient inference on sequence segmentation models,” In: Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, (2006).
  • [187] Sunita Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learning,” In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), Edmonton, Canada, July (2002).
  • [188] S. Satpal and Sunita Sarawagi, “Domain adaptation of conditional probability models via feature subsetting,” in ECML/PKDD, (2007). References 375
  • [189] K. Seymore, Andrew McCallum, and R. Rosenfeld, “Learning Hidden Markov Model structure for information extraction,” in Papers from the AAAI- 99 Workshop on Machine Learning for Information Extraction, pp. 37–42, (1999).
  • [190] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan, “Declarative information extraction using datalog with embedded extraction predicates,” in VLDB, pp. 1033–1044, (2007).
  • [191] M. Shilman, P. Liang, and P. Viola, “Learning non-generative grammatical models for document analysis,” ICCV, vol. 2, pp. 962–969, (2005).
  • [192] Y. Shinyama and Satoshi Sekine, “Preemptive information extraction using unrestricted relation discovery,” in HLT-NAACL, (2006).
  • [193] J. F. Silva, Z. Kozareva, V. Noncheva, and G. P. Lopes, “Extracting named entities. a statistical approach,” In: Proceedings of the XIme Confrence sur le Traitement des Langues Naturelles — TALN, 19–22 Avril, Fez, Marroco, (B. Bel and I. Merlien, eds.), pp. 347–351, ATALA — Association pour le Traitement Automatique des Langues, 04, (2004).
  • [194] P. Singla and Pedro Domingos, “Entity resolution with markov logic,” in ICDM, pp. 572–582, (2006).
  • [195] S. Soderland, “Learning information extraction rules for semi-structured and free text,” Machine Learning, vol. 34, (1999).
  • [196] F. M. Suchanek, G. Ifrim, and G. Weikum, “Combining linguistic and statistical analysis to extract relations from web documents,” in KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 712–717, (2006).
  • [197] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of semantic knowledge,” in WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706, (2007).
  • [198] B. M. Sundheim, “Overview of the third message understanding evaluation and conference,” In: Proceedings of the Third Message Understanding Conference (MUC-3), pp. 3–16, San Diego, CA, (1991).
  • [199] C. Sutton and Andrew McCallum, “Collective segmentation and labeling of distant entities in information extraction,” Technical Report TR # 04-49, University of Massachusetts Presented at ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields, July (2004).
  • [200] K. Takeuchi and N. Collier, “Use of support vector machines in extended named entity recognition,” in The 6th Conference on Natural Language Learning (CoNLL), (2002).
  • [201] Ben Taskar, “Learning structured prediction models: A large margin approach,” PhD Thesis, Stanford University, (2004).
  • [202] Ben Taskar, D. Klein, Michael Collins, Daphne Koller, and C. Manning, “Max-margin parsing,” in EMNLP, July (2004).
  • [203] Ben Taskar, S. Lacoste-Julien, and Michael I. Jordan, “Structured prediction, dual extragradient and bregman projections,” Journal on Machine Learning Research, vol. 7, pp. 1627–1653, (2006).
  • [204] M. Theobald, G. Weikum, and R. Schenkel, “Top-k query evaluation with probabilistic guarantees,” in VLDB, pp. 648–659, (2004). 376 References
  • [205] C. A. Thompson, M. E. Califf, and Raymond Mooney, “Active learning for natural language parsing and information extraction,” In: Proceedings of 16th International Conference on Machine Learning, pp. 406–414, Morgan Kaufmann, San Francisco, CA, (1999).
  • [206] E. F. Tjong Kim Sang and F. D. Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Seventh Conference on Natural Language Learning (CoNLL-03), (W. Daelemans and M. Osborne, eds.), pp. 142–147, Edmonton, Alberta, Canada: Association for Computational Linguistics, May 31–June 1, (2003). (In association with HLTNAACL, 2003).
  • [207] A. Troussov, B. O’Donovan, S. Koskenniemi, and N. Glushnev, “Per-node optimization of finite-state mechanisms for natural language processing,” in CICLing, pp. 221–224, (2003).
  • [208] I. Tsochantaridis, Thorsten Joachims, T. Hofmann, and Y. Altun, “Large margin methods for structured and interdependent output variables,” Journal of Machine Learning Research (JMLR), vol. 6, pp. 1453–1484, September (2005).
  • [209] J. Turmo, A. Ageno, and N. Catal`a, “Adaptive information extraction,” ACM Computer Services, vol. 38, p. 4, (2006).
  • [210] P. D. Turney, “Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm,” Journal of Artificial Intelligence Research, pp. 369–409, (1995).
  • [211] P. D. Turney, “Expressing implicit semantic relations without supervision,” in ACL, (2006).
  • [212] V. S. Uren, Philipp Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta, and Fabio Ciravegna, “Semantic annotation for knowledge management: Requirements and a survey of the state of the art,” Journal of Web Semantics, vol. 4, pp. 14–28, (2006).
  • [213] P. Viola and M. Narasimhan, “Learning to extract information from semistructured text using a discriminative context free grammar,” in SIGIR ’05: Proceedings of the 28th ACM SIGIR Conference Retrieval, pp. 330–337, USA, New York, NY: ACM, (2005).
  • [214] S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy, “Accelerated training of conditional random fields with stochastic gradient methods,” in ICML, pp. 969–976, (2006).
  • [215] M. Wang, “A re-examination of dependency path kernels for relation extraction,” In: Proceedings of IJCNLP, (2008).
  • [216] (Wang & Hu, 2002) ⇒ Yalin Wang, and Jianying Hu. (2002). “A Machine Learning Based Approach for Table Detection on the Web.” In: Proceedings of the Eleventh International World Wide Web Conference (WWW 2002). doi:10.1145/511446.511478
  • [217] Ben Wellner, Andrew McCallum, F. Peng, and M. Hay, “An integrated, conditional model of information extraction and coreference with application to citation matching,” in Conference on Uncertainty in Artificial Intelligence (UAI), (2004).
  • [218] M. Wick, A. Culotta, and Andrew McCallum, “Learning field compatibilities to extract database records from unstructured text,” In: Proceedings of the References 377 2006 Conference on Empirical Methods in Natural Language Processing, pp. 603–611, Sydney, Australia: Association for Computational Lingistics, July (2006).
  • [219] Ian H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, (1999).
  • [220] F. Wu and D. S. Weld, “Autonomously semantifying wikipedia,” in CIKM, pp. 41–50, (2007).
  • [221] B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabilities are both unknown,” In: Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD), (2001).
  • [222] R. Zanibbi, D. Blostein, and R. Cordy, “A survey of table recognition: Models, observations, transformations, and inferences,” International Journal on Document Analysis and Recognition, vol. 7, pp. 1–16, (2004).
  • [223] D. Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation extraction,” Journal of Machine Learning Research, vol. 3, pp. 1083–1106, (2003).
  • [224] M. Zhang, J. Zhang, J. Su, and G. Zhou, “A composite kernel to extract relations between entities with both flat and structured features,” In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 825– 832, Sydney, Australia: Association for Computational Linguistics, July (2006).
  • [225] S. Zhao and R. Grishman, “Extracting relations with integrated information using kernel methods,” in ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 419–426, (2005).
  • [226] G. Zhu, T. J. Bethea, and V. Krishna, “Extracting relevant named entities for automated expense reimbursement,” in KDD, pp. 1004–1012, (2007).,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 InformationExtractionSunita SarawagiInformation ExtractionFoundations and Trends in Databaseshttp://www.it.iitb.ac.in/~sunita/papers/ieSurvey.pdf10.1561/19000000032008