2001 RepSentStructInHidMarkovModelsForIE

(Ray & Craven, 2001) ⇒ Soumya Ray, Mark Craven. (2001). “Representing Sentence Structure in Hidden Markov Models for Information Extraction.” In: Proceedings of IJCAI Conference (IJCAI 2001).

Subject Headings: Relation Detection from Text Algorithm

Notes

Data: They extracted relations from Swiss-Prot (like we did with PPLRE). They however did not manually review the generated entries to discard false positives. A random sampling tested of the data suggested that between 10% - 15% of the cases are mislabeled in both directions. True=>False and False=>True.

Cited By

~114

Quotes

Abstract

We study the application of Hidden Markov Models (HMMs) to learning information extractors for n-ary relations from free text. We propose an approach to representing the grammatical structure of sentences in the states of the model. We also investigate using an objective function during HMM training which maximizes the ability of the learned models to identify the phrases of interest. We evaluate our methods by deriving extractors for two binary relations in biomedical domains. Our experiments indicate that our approach learns more accurate models than several baseline approaches.

Introduction

... the data we are processing include many sentences that are not relevant to the relations of interest. Even in relevant sentences, only certain phrases contain information to be extracted. Whereas previous approaches to applying HMMs for IE have focused the training process on maximizing the likelihood of the training sentences, we adopt a training method that is designed to maximize the probability of assigning the correct labels to various parts of the sentences being processed [Krogh, 1994]. Our approach involves coupling the algorithm devised by Krogh with the use of null models which are intended to represent data not directly relevant to the task at hand.

Problem Domain

Our work is focused on extracting instances of specific relations of interest from abstracts in the MEDLINE database [National Library of Medicine, 2001]. MEDLINE contains bibliographic information and abstracts from more than 4,000 biomedical journals.

An example of a binary relation that we consider in our experiments is the subcellular-localization relation, which represents the location of a particular protein within a cell. We refer to the domains of this relation as PROTEIN and LOCATION. We refer to an instance of a relation as a tuple. Figure 1 provides an illustration of our extraction task. The top of the figure shows two sentences in a MEDLINE abstract. The bottom of the figure shows the instance of the target relation subcellular-localization that we would like to extract from the second sentence. This tuple asserts that the protein UBC6 is found in the subcellular compartment called the endoplasmic reticulum. In order to learn models to perform this task, training examples consisting of passages of text, annotated with the tuples that should be extracted from them, are needed. In our

Figure 1 "An example of the information extraction task. The top shows part of a document from which we wish to extract instances of the subcellular-localization relation. The bottom shows the extracted tuple.

“... Here we report the identification of an integral membrane ubiquitin-conjugating enzyme. This enzyme, UBC6, localizes to the endoplasmic reticulum, with he catalytic domain facing the cytosol...”

subcellular-localization(UBC6,endoplasmic reticulum)

Representing Phrase Structure

Hidden Markov Models (HMMs) are the stochastic analogs of finite state automata. An HMM is defined by a set of states and a set of transitions between them. Each state has an associated emission distribution which defines the likelihood of a state to emit various tokens. The transitions from a given state have an associated transition distribution which defines the likelihood of the next state given the current state.

In previous HMM approaches to information extraction, sentences have been represented as sequences of tokens. We hypothesize that incorporating sentence structure into the models we build results in better extraction accuracy.

Our approach is based on using syntactic parses of all sentences we process. In particular, we use the Sundance system [Riloff, 1998] to obtain a shallow parse of each given sentence. Our representation does not incorporate all of the information provided by a Sundance parse, but instead “flattens” it into a sequence of phrase segments. Each phrase segment consists of a type describing the grammatical nature of the phrase, and the words that are part of the phrase.

In positive training examples, if a segment contains a word or words that belong to a domain in a target tuple, it is annotated with the corresponding domain. We refer to these annotations as labels. Labels are absent in the test instances. Figure 2a shows a sentence containing an instance of the subcellular-localization relation and its annotated segments (we shall discuss the other panels of this figure later). The second phrase segment in this example is a noun phrase segment (NP SEGMENT) that contains the protein name UBC6 (hence the PROTEIN label). Note that the types are constants that are pre-defined by our representation of Sundance parses, while the labels are defined with respect to the domains of relation we are trying to extract. Also note that the parsing is not always accurate, for instance, the third segment in figure 2a should really be a VP SEGMENT, but has been typed as an NP SEGMENT by Sundance.

The subcellular localization data set includes 545 sentences that represent positive instances (labeled with tuples that should be extracted from them) and 6,700 sentences that represent negative instances (not labeled with any tuples). The 545 positive instances are labeled with 645 tuples in all; there are 335 unique tuples.

The target tuples for the subcellular-localization relation were collected from the Yeast Protein Database (YPD) [Hodges et al., 1998],

Relevant MEDLINE abstracts were also gathered from entries in these databases. To label the sentences in these abstracts, we matched the target tuples to the words in the sentence. A sentence which contained words that matched a tuple was taken to be a positive instance. Every other sentence was considered to be a negative instance. It is clear that while this process is automatic, it will result in a noisy labeling. A sentence may have the words in a target tuple of the relation while the semantics may not refer to the relation. On the other hand, a tuple in a relation may be described by synonymous words which were not in any target tuple; therefore, sentences where tuples exist as synonyms are labeled incorrectly. We used a random sample of 200 positive and 200 negative sentences to estimate the amount of noise introduced by the labeling process. We estimate with 95% confidence that approximately 10% to 15% of the sentences are labeled incorrectly (either falsely labeled or unlabeled when they should have been) in the subcellular-localization data set.

Figure 3 is a schematic of the architecture of our phrasebased hidden Markov models. The top of the figure shows the positive model, which is trained to represent positive instances in the training set. The bottom of the figure shows the null model, which is trained to represent negative instances in the training set. Since our Phrase representation includes 14 phrase types, both models have 14 states without labels, and the positive model also has five to six additional labeled states (one for each type,label combination that occurs in the training set). We assume a fully connected model, that is, the model may emit a segment of any type at any position within a sentence.

…

Conclusion

We have presented two contributions to learning Hidden Markov Models for information extraction, and evaluated these contributions on two challenging biomedical domains. We have presented an approach to representing the grammatical structure of sentences in an HMM. Comparative experiments with other models lacking such information shows that this approach learns extractors that have increased precision and recall performance. We have also investigated the application of a training algorithm developed by Krogh to our models. This algorithm consistently provides an accuracy gain over our original models. We believe that these are promising approaches to the task of deriving information extractors for free text domains.

References

[Baum, 1972] L. E. Baum. An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov process es. Inequalities, 3:1–8, 1972.
[Center for Medical Genetics, 2001] Center for Medical Genetics, Johns Hopkins University and National Center for Biotechnology Information. Online Mendelian inheritance in man, OMIM (TM), (2001). http://www.ncbi.nlm.nih.gov/omim/.
[Cestnik, 1990] B. Cestnik. Estimating probabilities: A crucial task in machine learning. In: Proceedings of the Ninth European Conference on Artificial Intelligence, pages 147–150, Stockholm, Sweden, (1990). Pitman.
[Freitag and McCallum, 1999] Dayne Freitag and Andrew McCallum. Information extraction with HMMs and shrinkage. In Working Notes of the AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, FL, (1999). AAAI Press.
[Freitag and McCallum, 2000] Dayne Freitag and Andrew McCallum. Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, TX, (2000). AAAI Press.
[Hodges et al., 1998] P. E. Hodges, W. E. Payne, and J. I. Garrels. Yeast protein database (YPD): A database for the complete proteome of saccharomyces cerevisiae. Nucleic Acids Research, 26:68–72, 1998.
[Krogh, 1994] A. Krogh. Hidden Markov models for labeled sequences. In: Proceedings of the Twelfth International Conference on Pattern Recognition, pages 140–144, Jerusalem, Israel, (1994). IEEE Computer Society Press.
[Leek, 1997] T. Leek. Information extraction using hidden Markov models. Master’s thesis, Department of Computer Science and Engineering, University of California, San Diego, CA, 1997.
[McCallum et al., 2000] Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the Seventeenth International Conference on Machine Learning, pages 591–598, Stanford, CA, (2000). Morgan Kaufmann.
[National Library of Medicine, 2001] National Library of Medicine. The MEDLINE database, (2001). http://www.ncbi.nlm.nih.gov/PubMed/.
[Porter, 1980] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):127–130, 1980.
[Rabiner, 1989] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[Riloff, 1996] Ellen Riloff. An empirical study of automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85:101–134, 1996.
[Riloff, 1998] Ellen Riloff. The Sundance sentence analyzer, (1998). http://www.cs.utah.edu/projects/nlp/.
[Seymore et al., 1999] K. Seymore, Andrew McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Working Notes of the AAAI Workshop on Machine Learning for Information Extraction, pages 37–42. AAAI Press, 1999.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 RepSentStructInHidMarkovModelsForIE	Mark Craven Soumya Ray			Representing Sentence Structure in Hidden Markov Models for Information Extraction		Proceedings of IJCAI Conference	http://www.biostat.wisc.edu/~craven/papers/ijcai01-hmm.pdf			2001