Information Extraction (IE) Task
An Information Extraction (IE) Task is a data processing task that populates data structures with information extracted from non-fully-structured data sources.
- AKA: IE Task, Template/Slot/Semantic Frame Filling Task, Template Filling Task, Slot Filling Task.
- Context:
- Input: Information Artifacts.
- Output: Populated IE Data Structure.
- Measure(s): IE Performance Measures, such as:
- It can typically extract Named Entity Mentions through IE entity recognition.
- It can typically identify Semantic Relation Mentions between IE extracted entitys.
- It can typically populate Database Records with IE extracted values.
- It can typically process Natural Language Documents using IE NLP techniques.
- It can typically handle Multiple Domains through IE domain adaptation.
- ...
- It can often extract Event Mentions from IE source documents.
- It can often recognize Temporal Expressions in IE text content.
- It can often identify Entity Attributes for IE target entitys.
- It can often process Web Tables using IE table extraction.
- ...
- It can range from being a Human-Performed IE Task to being an Automated IE Task, depending on its IE automation level.
- It can range from being an Unstructured IE Task to being a Semi-Structured IE Task to being a Structured IE Task, depending on its IE source data structure.
- It can range from being a Manual IE Task to being an Automated IE Task, depending on its IE processing approach.
- It can range from being a Heuristic IE Task to being a Data-Driven IE Task, depending on its IE methodology type.
- It can range from being an Open IE Task to being a Closed IE Task, depending on its IE schema specification.
- It can range from being a Knowledge-Weak IE Task to being a Knowledge-Rich IE Task, depending on its IE knowledge base usage.
- ...
- It can be supported by Information Retrieval Tasks for IE document selection.
- It can be supported by Entity Mention Recognition Tasks for IE entity identification.
- It can be supported by Entity Mention Coreference Resolution Tasks for IE entity linking.
- It can be supported by Entity Mention Normalization Tasks for IE entity standardization.
- It can be supported by Semantic Relation Mention Recognition Tasks for IE relationship extraction.
- It can be supported by Duplicate Record Detection Tasks for IE data deduplication.
- It can be supported by Record Canonicalization Tasks for IE data normalization.
- ...
- It can support Question Answering Tasks through IE fact extraction.
- It can support Knowledge Extraction Tasks through IE knowledge harvesting.
- It can support Knowledge Graph Construction Tasks through IE triple extraction.
- It can be the topic of an Information Extraction Discipline.
- ...
- Example(s):
- Source-Type-Based IE Tasks, such as:
- Unstructured IE Tasks, such as:
- Semi-Structured IE Tasks, such as:
- Structured IE Tasks, such as:
- Domain-Specific IE Tasks, such as:
- Biomedical IE Tasks, such as:
- Financial IE Tasks, such as:
- Legal IE Tasks, such as:
- Application-Specific IE Tasks, such as:
- Web-Based IE Tasks, such as:
- Enterprise IE Tasks, such as:
- Methodology-Based IE Tasks, such as:
- Benchmark IE Tasks, such as:
- ...
- Source-Type-Based IE Tasks, such as:
- Counter-Example(s):
- Data Curation Task, which organizes existing data rather than extracting IE structured information.
- Semantic Annotation Task, which adds metadata rather than populating IE data structures.
- Record Linkage Task, which connects existing records rather than extracting IE new information.
- Information Retrieval Task, which finds relevant documents rather than extracting IE structured data.
- Text Summarization Task, which creates condensed text rather than populating IE templates.
- See: Information Extraction System, Information Extraction Algorithm, ACE Project, Message Understanding Conference, Named Entity Recognition Task, Relation Extraction Task, Event Extraction Task.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Information_extraction Retrieved:2023-6-1.
- Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). [1] Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: : [math]\displaystyle{ \mathrm{MergerBetween}(company_1, company_2, date) }[/math] , from an online news sentence such as: :"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.
Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template.
- Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). [1] Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: : [math]\displaystyle{ \mathrm{MergerBetween}(company_1, company_2, date) }[/math] , from an online news sentence such as: :"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.
2009
- http://en.wikipedia.org/wiki/Information_extraction#Tasks_and_subtasks
- Applying information extraction on text, is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:
- Named entity extraction which could include:
- Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims to detect entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", named entity detection would denote 'detecting that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (/or, "might be") the specific person whom that sentence is talking about.
- Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
- Relationship extraction: identification of relations between entities, such as:
- PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
- PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
- Semi-structured information extraction which may refer to any IE that tries to restore some kind information structure that has been lost through publication such as:
- Table extraction: finding and extracting tables from documents.
- Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
- Language and vocabulary analysis
- Terminology extraction: finding the relevant terms for a given corpus.
- Audio extraction
- Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance [2] time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.
- Named entity extraction which could include:
- Note this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.
IE on non-text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally lead to the fusion of extracted information from multiple kind of documents and sources.
- Applying information extraction on text, is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:
- ↑ name=Kariampuzha2023
- ↑ A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.
2008
- (Sarawagi, 2008) ⇒ Sunita Sarawagi. (2008). “Information extraction." FnT Databases, 1(3), 2008.
- Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources. This enables much richer forms of queries on the abundant unstructured sources than possible with keyword searches alone. When structured and unstructured data co-exist, information extraction makes it possible to integrate the two types of sources and pose queries spanning them.
2007
- (Banko et al., 2007) ⇒ Michele Banko, Michael J. Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. (2007). “Open Information Extraction from the Web.” In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-2007).
2006
- (Moens, 2006) ⇒ Marie-Francine Moens. (2006). “Information Extraction: Algorithms and Prospects in a Retrieval Context." Springer. ISBN:140204987
- QUOTE: … the process of selectively structuring and combining data that are explicitly stated or implied in one or more natural language documents …
2005
- (McCallum, 2005) ⇒ Andrew McCallum. (2005). “Information Extraction: Distilling Structured Data from Unstructured Text.” In: ACM Queue, 3(9).
- NOTES: It describes the Employment Posting Extraction Task.
2003
- (Grishman, 2003) ⇒ Ralph Grishman. (2003). “Information Extraction.” In: (Mitkov, 2003).
1999
- (Appelt and Israel, 1999) ⇒ Douglas E. Appelt and David Israel. (1999). “Introduction to Information Extraction Technology." Tutorial at IJCAI 1999.
1997
- (Grishman, 1997) ⇒ Ralph Grishman. (1997). “Information extraction: Techniques and challenges.” In: Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School, (SCIE-97), pages 10–27, 1997.
1993
- (Riloff, 2003) ⇒ Ellen Riloff (1993). “Automatically Constructing a Dictionary for Information Extraction Tasks.” In: Proceedings of AAAI-93.
- (Cardie, 1993) ⇒ Claire Cardie. (1993). “A Case-based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis.” In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93).