Subject Headings: Information Extraction from Text Algorithm, AutoSlog, Message Understanding Conference, Dictionary Construction Task.
- It uses heuristic Syntactic Patterns to associate an Entity with a property.
- e.g. … <victim> was murdered.
- e.g. … <perpetrator> bombed ...
- e.g. … <victim> was victim ...
- It proposes the manual removal of 'bad patterns'
- Knowledge-based natural language processing systems have achieved good success with certain tasks but they are often criticized because they depend on a domain-specific dictionary that requires a great deal of manual knowledge engineering. This knowledge engineering bottleneck makes knowledge-based NLP systems impractical for real-world applications because they cannot be easily scaled up or ported to new domains. In response to this problem, we developed a system called AutoSlog that automatically builds a domain-specific dictionary of concepts for extracting information from text. Using AutoSlog, we constructed a dictionary for the domain of terrorist event descriptions in only 5 person-hours. We then compared the AutoSlog dictionary with a hand-crafted dictionary that was built by two highly skilled graduate students and required approximately 1500 person-hours of effort. We evaluated the two dictionaries using two blind test sets of 100 texts each. Overall, the AutoSlog dictionary achieved 98% of the performance of the hand-crafted dictionary. On the first test set, the AutoSlog dictionary obtained 96.3% of the performance of the hand-crafted dictionary. On the second test set, the overall scores were virtually indistinguishable with the AutoSlog dictionary achieving 99.7% of the performance of the handcrafted dictionary.
Automated Dictionary Construction
- Given a set of training texts and their associated answer keys, AutoSlog proposes a set of concept node definitions that are capable of extracting the information in the answer keys from the texts. Since the concept node definitions are general in nature, we expect that many of them will be useful for extracting information from novel texts as well.
- A set of heuristics are applied to the clause to suggest a good conceptual anchor point for a concept node definition. If none of the heuristics is satisfied then AutoSlog searches for the next sentence in the text that contains the targeted information and the process is repeated. The conceptual anchor point heuristics are the most important part of AutoSlog. A conceptual anchor point is a word that should activate a concept;"
- The current version of AutoSlog contains 13 heuristics, each designed to recognize a specific linguistic pattern. These patterns are shown below, along with examples that illustrate how they might be found in a text. The bracketed item shows the syntactic constituent where the string was found which is used for the slot expectation (<dobj> is the direct object and <np> is the noun phrase following a preposition). In the examples on the right, the bracketed item is a slot name that might be associated with the filler (e.g., the subject is a victim). The underlined word is the conceptual anchor point that is used as the triggering word."
|Linguistic Pattern ||Example
|[subject] passive-verb ||[victim] was murdered
|[subject] active-verb ||[perpetrator] bombed
|[subject] verb infinitive ||[perpetrator] attempted to kill
|[subject] auxiliary noun ||[victim] was victim
|passive-verb [dobj] ||killed [victim]
|active-verb [dobj] ||bombed [target]
|infinitive [dobj] ||to kill [victim]
|verb infinitive [dobj] ||threatened to attack [target]
|gerund [dobj] ||killing [victim]
|noun auxiliary [dobj] ||fatality was [victim]
|noun prep [np] ||bomb against [target]
|active-verb prep [np] ||killed with [instrument]
|passive-verb prep [np] ||was aimed at [target]
The MUC-4 Task and Corpus
- In 1992, the natural language processing group at the Universityof Massachusetts participated in the Fourth Message Understanding Conference (MUC-4). MUC-4 was a competitive performance evaluation sponsored by DARPA to evaluate the state-of-the-art in text analysis systems. Seventeen sites from both industry and academia participated in MUC-4. The task was to extract information about terrorist events in Latin America from newswire articles. Given a text, each system was required to fill out a template for each terrorist event described in the text. If the text described multiple terrorist events, then one template had to be completed for each event. If the text did not mention any terrorist events, then no templates needed to be filled out.
- A template is essentially a large case frame with a set of pre-defined slots for each piece of information that should be extracted from the text. For example, the MUC-4 templates contained slots for perpetrators, human targets, physical targets, etc. A training corpus of 1500 texts and instantiated templates (answer keys) for each text were made available to the participants for development purposes. The texts were selected by keyword search from a database of newswire articles. Although each text contained a keyword associated with terrorism, only about half of the texts contained a specific reference to a relevant terrorist incident.
- The word “kidnapped” specifies the roles of the people in the kidnapping and is therefore the most appropriate word to trigger a concept node.
- AutoSlog relies on a small set of heuristics to determine which words and phrases are likely to activate useful concept nodes. In the next section, we will describe these heuristics and explain how AutoSlog generates complete concept node definitions."
- Since AutoSlog creates dictionary entries from scratch, our approach is related to one-shot learning. For example, explanation-based learning (EBL) systems [DeJong and Mooney, 1986; Mitchell et al., 1986] create complete concept representations in response to a single training instance. This is in contrast to learning techniques that incrementally build a concept representation in response to multiple training instances (e.g., [Cardie, 1992; Fisher, 1987; Utgoff, 1988]). However, explanation-based learning systems require an explicit domain theory which may not be available or practical to obtain. AutoSlog does not need any such domain theory, although it does require a few simple domain specifications to build domain-dependent concept nodes.
- On the other hand, AutoSlog is critically dependent on a training corpus of texts and targeted information.
- Our knowledge engineering demands can be met by anyone familiar with the domain. Knowledgebased NLP systems will be practical for real-world applications only when their domain-dependent dictionaries can be constructed automatically.
- Carbonell, J. G. 1979. Towards a Self-Extending Parser. In: Proceedings of the 17th Meeting of the Association for Computational Linguistics. 3–7.
- Cardie, C. (1992). Learning to Disambiguate Relative Pronouns. In: Proceedings of the Tenth National Conference on Artificial Intelligence. 38–43.
- DeJong, G. and Mooney, R. (1986). Explanation-based Learning: An Alternative View. Machine Learning 1:145–176.
- Fisher, D. H. (1987). Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning 2:139–172.
- Francis,W. and Kucera, H. (1982). Frequency Analysis of English Usage. Houghton Mifflin, Boston, MA.
- Granger, R. H. 1977. FOUL-UP: A Program that Figures Out Meanings of Words from Context. In: Proceedings of the Fifth International Joint Conference on Artificial Intelligence. 172–178.
- Jacobs, P. and Zernik, U. (1988). Acquiring Lexical Knowledge from Text: A Case Study. In: Proceedings of the Seventh National Conference on Artificial Intelligence. 739–744.
- Lehnert, W. (1990). Symbolic/Subsymbolic Sentence Analysis: Exploiting the Best of Two Worlds. In Barnden, J. and Pollack, J., editors 1990, Advances in Connectionist and Neural Computation Theory, Vol. 1. Ablex Publishers, Norwood, NJ. 135–164.
- Lehnert, W.; Cardie, C.; Fisher, D.; McCarthy, J.; Ellen Riloff; and Soderland, S. 1992a. University of Massachusetts: Description of the CIRCUSSystem as Used for MUC-4. In: Proceedings of the Fourth Message Understanding Conference (MUC-4). 282–288.
- Lehnert, W.; Cardie, C.; Fisher, D.; McCarthy, J.; Ellen Riloff; and Soderland, S. 1992b. University of Massachusetts: MUC-4 Test Results and Analysis. In: Proceedings of the Fourth Message Understanding Conference (MUC-4). 151–158.
- Lehnert, W. G. and Sundheim, B. (1991). A Performance Evaluation of Text Analysis Technologies. AI Magazine 12(3):81–94.
- Marcus, M.; Santorini, B.; and Marcinkiewicz, M. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. Forthcoming.
- Tom M. Mitchell; Keller, R.; and Kedar-Cabelli, S. (1986). Explanation-based Generalization: A Unifying View. Machine Learning 1:47–80.
- Proceedings of the Fourth Message Understanding Conference (MUC-4). (1992). Morgan Kaufmann, San Mateo, CA.
- Ellen Riloff and Lehnert, W. (1993). Automated Dictionary Construction for Information Extraction from Text. In: Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications. IEEE Computer Society Press. 93–99.
- Utgoff, P. (1988). ID5: An Incremental ID3. In: Proceedings of the Fifth International Conference on Machine Learning. 107–120.,