1998 AlgsThatLearnToExtrInfBBN

Jump to: navigation, search

Subject Headings: Relation Recognition from Text Algorithm, Supervised Machine Learning Algorithm


Cited By



  • For MUC-7, BBN has for the first time fielded a fully-trained system for NE, TE, and TR; results are all the

output of statistical language models trained on annotated data, rather than programs executing handwritten rules. Such trained systems have some significant advantages:

    • They can be easily ported to new domains by simply annotating data with semantic answers.
    • The complex interactions that make rule-based systems difficult to develop and maintain can here be

learned automatically from the training data.

  • We believe that the results in this evaluation are evidence that such trained systems, even at their current

level of development, can perform roughly on a par with rules hand-tailored by experts.

  • Since MUC-3, BBN has been steadily increasing the proportion of the information extraction process that is statistically trained. Already in MET-1, our name-finding results were the output of a fully statistical, HMM-based model, and that statistical Identifinder™ model was also used for the NE task in MUC-7. For the MUC-7 TE and TR tasks, BBN developed SIFT, a new model that represents a significant further step along this path, replacing PLUM, a system requiring handwritten patterns, with SIFT, a single integrated trained model.

Training Data

  • Our source for syntactically annotated training data was the Penn Treebank (Marcus et al., 1993).
  • To train SIFT for MUC-7, we annotated approximately 500,000 words of New York Times newswire text, covering the domains of air disasters and space technology. (We have not yet run experiments to see how performance varies with more/less training data.)

Statistical Model

  • In SIFT’s statistical model, augmented parse trees are generated according to a process similar to that described in Collins (1996, 1997).

Training the Model

  • Maximum likelihood estimates for all model probabilities are obtained by observing frequencies in the training corpus. However, because these estimates are too sparse to be relied upon, they must be smoothed by mixing in lower-dimensional estimates. We determine the mixture weights using the Witten-Bell smoothing method.

Searching the Model

  • Given a sentence to be analyzed, the search program must find the most likely semantic and syntactic interpretation. More concretely, it must find the most likely augmented parse tree. Although mathematically the model predicts tree elements in a top-down fashion, we search the space bottom-up using a chart based search. The search is kept tractable through a combination of CKY-style dynamic programming and pruning of low probability elements.

Cross-Sentence Model

  • The cross-sentence model uses structural and contextual clues to hypothesize template relations between two elements that are not mentioned within the same sentence. Since 80-90% of the relations found in the answer keys connect two elements that are mentioned in the same sentence, the cross sentence model has a narrow target to shoot for. Very few of the pairs of entities seen in different sentences turn out to be actually related. This model uses features extracted from related pairs in training data to try to identify those cases.
  • It is a classifier model that considers all pairs of entities in a message whose types are compatible with a given relation; for example, a Person and an Organization would suggest a possible Employee_Of. For the three Muc-7 relations, it turned out to be somewhat advantageous to build in a functional constraint, so that the model would not consider, for example, a possible Employee_Of relation for a person already known from the sentence-level model to be employed elsewhere.

Model Features

  • Two classes of features were used in this model: structural features that reflect properties of the text surrounding references to the entities involved in the suggested relation, and content features based on the actual entities and relations encountered in the training data.

Structural Features

  • The structural features exploit simple characteristics of the text surrounding references to the possiblyrelated entities. The most powerful structural feature, not surprisingly, was distance, reflecting the fact that related elements tend to be mentioned in close proximity, even when they are not mentioned in the same

sentence. Given a pair of entity references in the text, the distance between them was quantized into one of three possible values: 0=Within the same sentence; 1=Neighboring sentences; 2=More remote than neighboring sentences.

  • For each pair of possibly-related elements, the distance feature value was defined as the minimum distance between some reference in the text to the first element and some reference to the second.
  • A second structural feature grew out of the intuition that entities mentioned in the first sentence of an article often play a special topical role throughout the article. The “Topic Sentence” feature was defined to be true if some reference to one of the two entities involved in the suggested relation occurred in the first sentence of the text-field body of the article.
  • Other structural features that were considered but not implemented included the count of the number of references to each entity.

Content Features

  • While the structural features learn general facts about the patterns in which related references occur and the text that surrounds them, the content features learn about the actual names and descriptors of entities seen to be related in the training data. The three content features in current use test for a similar relationship in training by name or by descriptor or for a conflicting relationship in training by name.
  • The simplest content feature tests using names whether the entities in the proposed relationship have ever been seen before to be related. To test this feature, the model maintains a database of all the entities seen to be related in training, and of the names used to refer to them. The “by name” content feature is true if, for example, a person in some training message who shared at least one name string with the person in the proposed relationship was employed in that training message by an organization that shared at least one name string with the organization in the proposed relationship.
  • A somewhat weaker feature makes the same kind of test for a previously seen relationship using descriptor strings. This feature fires when an entity that shares a descriptor string with the first argument of the suggested relation was related in training to an entity that shares a name with the second argument. Since titles like “General” count as descriptor strings, one effect of this feature is to increase the likelihood of generals being employed by armies. Observing such examples, but noting that the training didn’t include all the reasonable combinations of titles and organizations, the training for this feature was seeded by adding a virtual message constructed from a list of such titles and organizations, so that any reasonable such pair would turn up in training.
  • The third content feature was a kind of inverse of the first “by name” feature which was true if some entity sharing a name with the first argument of the proposed relation was related to an entity that did not share a name with the second argument. Using Employee_Of again as an example, it is less likely (though still possible) that a person who was known in another message to be employed by a different organization should be reported here as employed by the suggested one.


  • Given enough fully annotated data, with both sentence-level semantic annotation and message-level answer keys recorded along with the connections between them, training the features would be quite straightforward. For each possibly-related pair of entities mentioned in a document, one would just count up the 2x2 table showing how many of them exhibited the given structural feature and how many of them were actually related. The training issues that did arise stemmed from the limited supply of answer keys and that the keys were not connected to the sentence-level annotations.


  • Bikel, Dan; S. Miller; R. Schwartz; and R. Weischedel. (1997) “NYMBLE: A High-Performance Leraning Name-finder.” In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Association for Computational Linguistics, pp. 194-201.
  • Collins, Michael. (1996) “A New Statistical Parser Based on Bigram Lexical Dependencies.” In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 184-191.
  • Collins, Michael. (1997) “Three Generative, Lexicalised Models for Statistical Parsing.” In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 16-23.
  • Marcus, M.; B. Santorini; and M. Marcinkiewicz. (1993) “Building a Large Annotated Corpus of English: the Penn Treebank.” Computational Linguistics, 19(2):313-330.
  • Goodman, Joshua. (1997) “Global Thresholding and Multiple-Pass Parsing.” In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 11-25.
  • Weischedel, Ralph; Marie Meteer; Richard Schwartz; Lance Ramshaw; and Jeff Palmucci. (1993) “Coping with Ambiguity and Unknown Words through Probabilistic Models.” Computational Linguistics, 19(2):359-382.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1998 AlgsThatLearnToExtrInfBBNScott Miller
Michael Crystal
Heidi Fox
Lance A. Ramshaw
Richard Schwartz
Rebecca Stone
Ralph Weischedel
the Annotation Group
Algorithms That Learn to Extract Information BBN: Description of the SIFT system as used for MUC-7Proceedings of MUC-7http://www-nlpir.nist.gov/related projects/muc/proceedings/muc 7 proceedings/bbn muc7.pdf1998