2005 ScalingInformationExtraction

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Information Extraction Task

Notes

Cited By

~20 http://scholar.google.com/scholar?cites=18397343179966760392

Quotes

Abstract

  • Information extraction and text mining applications are just beginning to tap the immense amounts of valuable textual information available online. In order to extract information from millions, and in some cases, billions of documents, different solutions to scalability emerged. We review key approaches for scaling up information extraction, including using general-purpose search engines as well as indexing techniques specialized for information extraction applications. Scalable information extraction is an active area of research, and we highlight some of the opportunities and challenges in this area that are relevant to the database community.

Background: Information Extraction

  • The general information extraction process is outlined in Figure 1 (adapted from [15]).
  • Figure 1
    • Text Document ⇒ Local Text Analysis (Lexical Analysis ⇒ Named Entity Recognition ⇒ Syntactic Analysis ⇒ Extraction Pattern Matching (RE)) ⇒ Discourse and Collection Analysis (Coreference Resolution ⇒ Deduplication/Disambiguraiton ⇒ Merging and Conflict Resolution) ⇒ Structured Object

Exploiting General-Purpose Search Engines

Using Specialized Indexes and Search Engines

  • General-purpose search engines are designed for short keyword queries and for retrieving relatively few results per query. In contrast, information extraction systems can submit sophisticated and specific queries and request many or all query results.

Conclusions

  • A dimension of information extraction scalability not addressed in this survey is a trade-off between domain independence and extraction accuracy. While named entity extraction technology is relatively mature and is generally accurate for common entity types (e.g., person and location names), domain-independent relation and event extraction techniques are still error-prone, and are an active area of natural language processing and text mining research. One interesting research direction is to apply probabilistic query processing techniques (reviewed in [30]) to derive usable query answers from the noisy information extracted from text.

References

  • Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In: Proceedings of the Fifth ACM Conference on Digital Libraries (DL 2000), 2000.
  • Eugene Agichtein and Luis Gravano. Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), 2003.
  • Abdullah Al-Hamdani and Gultekin Ozsoyoglu. Selecting topics for web resource discovery: Efficiency issues in a database approach. In: Proceedings of the DEXA Conference, 2003.
  • BrightPlanet.com LLC. The Deep Web: Surfacing hidden value. Available at http://www.completeplanet.com/Tutorials/DeepWeb/index.asp, July 2000.
  • Sergey Brin. Extracting patterns and relations from the world wide web. In: Proceedings of the First International Workshop on the Web and Databases, WebDB 1998, 1998.
  • Michael J. Cafarella, Doug Downey, Stephen Soderland, and Oren Etzioni. KnowItNow: Fast, scalable information extraction from the web. In Conference on Human Language Technologies (HLT/EMNLP), 2005.
  • Michael J. Cafarella and Oren Etzioni. A search engine for natural language applications. In: Proceedings of the World Wide Web Conference (WWW), 2005.
  • Soumen Chakrabarti, Martin van den Berg, and Byron Dom. (1999). “Focused Crawling: A new approach to topic-specific web resource discovery.” In: Computer Networks, 31(11-16).
  • Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum. Integrating db and ir technologies: What is the sound of one hand clapping? In Second Biennial Conference on Innovative Data Systems Research, 2005.
  • Jennifer Chu-Carroll, Krzysztof Czuba, John Prager, Abraham Ittycheria, and Sasha Blair-Goldensohn. IBM’s PIQUANT II in TREC (2004). In 13th Text Retrieval Conference (TREC), (2004). 7
  • William W Cohen, Matthew Hurst, and Lee S Jensen. A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the World Wide Web Conference (WWW), 2002.
  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation (OSDI), 2004.
  • Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, Ramanathan V. Guha, Anant Jhingran, Sridhar Rajagopalan Tapas Kanungo, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and SemSeeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the World Wide Web Conference (WWW), 2003.
  • Oren Etzioni, Michael J. Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S.Weld, and Alexander Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence,2005.
  • Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School, (SCIE-97), pages 10–27, 1997.
  • Ralph Grishman, Silja Huttunen, and Roman Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4):236–246, August 2002.
  • D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal, 2004.
  • Panagiotis G. Ipeirotis and Luis Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB), 2002.
  • Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami. Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the ACM SIGMOD Conference, 2001.
  • Quanzhong Li and Bongki Moon. Indexing and querying xml data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB), 2001.
  • Ken C. Litkowski. Question answering using xml- tagged documents. In The Eleventh Text REtrieval Conference (TREC), 2002.
  • Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999.
  • Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In: Proceedings of the International Conference on Machine Learning, 2000.
  • M. Mettler. TREC-II routing experiments with the TRW/Paracel Fast Data Finder. In: Proceedings of the Second Text REtrieval Conference (TREC-2), 1993.
  • Patrick Pantel, Deepak Ravichandran, and Eduard Hovy. Towards terascale knowledge acquisition. In Conference on Computational Linguistics (COLING), 2004.
  • John Prager, Eric Brown, and Anni Coden. Question-answering by predictive annotation. In: Proceedings of the 23rd ACM SIGIR Conference Retrieval (SIGIR), 2000.
  • Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. The terascale challenge. In KDD Workshop on Mining for and from the Semantic Web, 2004.
  • Philip Resnik and Aaron Elkiss. The linguist’s search engine: An overview (demonstration). In ACL, 2005.
  • Sunita Sarawagi and William W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, 2005.
  • Dan Suciu and Nilesh Dalvi. Foundations of probabilistic query answering. Tutorial at the ACM SIGMOD Conference, 2005.
  • Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4):35–43, December 2001.
  • Peter D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning (ECML), 2001.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ScalingInformationExtractionEugene AgichteinScaling Information Extraction to Large Document Collectionshttp://www.mathcs.emory.edu/~eugene/papers/DEB05-agichtein.pdf