(Redirected from information extraction)

An information extraction (IE) task is a data processing task that requires the population of a data structure with the information contained in non-fully-structured data.

## References

### 2009

• (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Information_extraction
• Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.

Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from news wire reports of corporate mergers, such as denoted by the formal relation: :$MergerBetween(company_1, company_2, date)$, from an online news sentence such as:

"Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

• Applying information extraction on text, is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:
• Named entity extraction which could include:
• Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims to detect entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", named entity detection would denote 'detecting that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (/or, "might be") the specific person whom that sentence is talking about.
• Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
• Relationship extraction: identification of relations between entities, such as:
• PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
• PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
• Semi-structured information extraction which may refer to any IE that tries to restore some kind information structure that has been lost through publication such as:
• Table extraction: finding and extracting tables from documents.
• Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
• Language and vocabulary analysis
• Audio extraction
• Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance [1] time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.
• Note this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally lead to the fusion of extracted information from multiple kind of documents and sources.

1. A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.

### 2008

• (Sarawagi, 2008) ⇒ Sunita Sarawagi. (2008). “Information extraction." FnT Databases, 1(3), 2008.
• Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources. This enables much richer forms of queries on the abundant unstructured sources than possible with keyword searches alone. When structured and unstructured data co-exist, information extraction makes it possible to integrate the two types of sources and pose queries spanning them.