Jump to navigation Jump to search
- (McCallum, 2005) ⇒ Andrew McCallum. (2005). “Information Extraction: Distilling Structured Data from Unstructured Text.” In: [[journal::ACM Queue], 3(9).
- It defines Information Extraction as: the process of filling the fields and records of a database from unstructured or loosely formatted text.
- It divides IE into five Tasks:
- In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.
- Information extraction aims to do just this — it is the process of filling the fields and records of a database from unstructured or loosely formatted text. Thus (as shown in figure 1), it can be seen as a precursor to data mining: Information extraction populates a database from unstructured or loosely structured text; data mining then discovers patterns in that database. Information extraction involves five major subtasks (which are also illustrated in figure 2):
- Segmentation finds the starting and ending boundaries of the text snippets that will fill a database field. For example, in the U.S. Department of Labor’s continuing education extraction problem, the course title must be extracted, and segmentation must find the first and last words of the title, being careful not to include extra words (“Intro to Linguistics is taught”) or to chop off too many words (“Intro to”).
- Classification determines which database field is the correct destination for each text segment. For example, “Introduction to Bookkeeping” belongs in the course title field, “Dr. Dallan Quass” in the course instructor field, and “This course covers...” in the course description field. Often segmentation and classification are performed at the same time (using a finite-state machine, as described in a later section).
- Association determines which fields belong together in the same record. For example, some courses may be described by multiple paragraphs of text, and other courses by just one; extraction must determine which field values from which paragraphs are referring to the same course. In the course extraction example, association is a fairly coarse-grained operation, but, if you are extracting records about trade negotiation meetings from news articles, then determining which governmental minister met with which other representative to talk about trade between which two countries can involve fairly subtle linguistic cues about relations and associations. This step is sometimes referred to as relation extraction for the case in which two entities are being associated. Commercial products that do relation extraction are rarer than those that do only segmentation and classification.
- Normalization puts information in a standard format in which it can be reliably compared. For example, the times for one course may be given as “2-3pm”, another as “3pm-4:30pm”, and another as “1500-1630”, but we would like a search to be able to detect any overlap. Obviously, simple string comparisons will not do the job here; the data should be converted to a standard (likely numeric) representation. Normalization is relevant to string values also; for example, given the name “Wei Li” and “Li, Wei,” a standard ordering of first and last names should be chosen. Issues of normalization may often be intertwined with deduplication, the last subtask, described next.
- Deduplication collapses redundant information so you don’t get duplicate records in your database. For example, a course may be cross-listed in more than one department, and thus appear on more than one Web page; it will then be extracted multiple times, but we want only one record for it in our database. In news articles this may also involve determining that “Condoleezza Rice,” “the U.S. Secretary of State,” and “Rice” are all referring to the same person, but that “Secretary of State Powell” and “Rice, Wheat, and Beans” are referring to something else. Usually, commercial products for deduplication are offered separately from segmentation, classification, and association, although later I will argue that they should be integrated. It is somewhat of a joke in the community that this process of collapsing alternative names itself has so many different names. In the database community it is known as record linkage or record deduplication ; in natural language processing it is known as co-reference or anaphora resolution ; elsewhere it is known as identity uncertainty or object correspondence. In these different contexts the problem has different subtleties, but fundamentally they are all the same problem.
|Information Extraction: Distilling Structured Data from Unstructured Text