2007 SimulIdentOfBiomedNamedEntsAndFuncRels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Relation Recognition from Text Algorithm

Notes

Cited By

  • ~7

Quotes

Abstract

In this paper we propose a statistical parsing technique that simultaneously identifies biomedical named-entities (NEs) and extracts subcellular localization relations for bacterial proteins from the text in MEDLINE articles. We build a parser that derives both syntactic and domain-dependent semantic information and achieves an F-score of 48.4% for the relation extraction task. We then propose a semi-supervised approach that incorporates noisy automatically labeled data to improve the F-score of our parser to 83.2%. Our key contributions are: learning from noisy data, and building an annotated corpus that can benefit relation extraction research.

2. Related work in SCL Extraction

There has been some recent research on SCL extraction from text. …

(Nair and Rost, 2002) used the text taken from Swiss-Prot annotations of proteins to represent these proteins, and trained a subcellular classifier using this representation. They focused on a few specific subcellular localizations and reported results that are comparable to the state-of-the-art at that time. Their work was elaborated upon by Eskin and Agichtein, who added subsequences from the protein’s amino acid sequence as part of the terms considered in the text representation as described in (Eskin and Agichtein, 2004). Due to the nature of this problem, the system was not tested against existing systems or data sets.

(Stapley et al., 2002) represented yeast proteins as vectors of weighted terms from all the PubMed articles mentioning their respective genes. A support vector machine (SVM) was then trained on protein sequences and text vectors to distinguish among subcellular localizations, but the reported results did not suggest an improvement over earlier systems. Moreover, the combination of protein sequence and text data did not significantly outperform the textbased classifier alone.

(Lu and Hunter, 2005) explored the relationship between GO3 function annotations and localization 3Gene Ontology is a project established to provide a information, identifying both highly predictive single terms and terms with large information gain with respect to location. A hierarchical architecture of SVMs was applied to predict subcellular localization by incorporating a semantic hierarchy of localization classes modeled with biological processing pathways in (Nair and Rost, 2005).

(Hoglund et al., 2006) predicted subcellular localizations from both text and protein sequence data. They first applied SVMs to make predictions from protein sequence data, and then they weighted the terms from the text that co-occur with the localization name for each organism and assign each protein name a vector based on these co-occurence terms. Finally an SVM was trained on all such protein vectors generated from the sequence data and text.

3. Statistical Syntactic and Semantic Parser

Similar to the approach in (Miller et al., 2000) and (Kulick et al., 2004), our parser integrates both syntactic and semantic annotations into a single annotation as shown in Figure 2. A lexicalized statistical parser (Bikel, 2004) is applied to the parsing task. The parse tree is decorated with two types of semantic annotations:

  1. ) Annotations on relevant PROTEIN, BACTERIUM and LOCATION NEs. Tags are PROTEIN R, BACTERIUM R and LOCATION R respectively.
  2. ) Annotations on paths between relevant NEs. The lower-most node that spans both NEs is tagged as LNK and all nodes along the path to the NEs are tagged as PTR.

Binary relations are apparently much easier to represent on the parse tree, therefore we split the BPL ternary relation into two binary relations: BP (BACTERIUM and PROTEIN) and PL (PROTEIN and LOCATION). After capturing BP and PL relations, we will predict BPL as a fusion of BP and PL, see §4.1. In contrast to the global inference done using our generative model, heavily pipelined discriminative approaches usually have problems with error propagation. A more serious problem in a pipelined system when using syntactic parses for relation extraction is the alignment between the named entities produced by a separate system and the syntactic parses produced by the statistical parser. This alignment issue is non-trivial and we could not produce a pipelined system that dealt with this issue satisfactorily for our dataset. As a result, we did not directly compare our generative approach to a pipelined strategy.

References

  • D. Bikel. (2004). A distributional analysis of a lexicalized statistical parsing model. In: Proceedings of EMNLP ’04, pages 182–189.
  • Eugene Charniak and M. Johnson. (2005). Coarse-to-fine n-best parsing and maxent discriminative reranking. In: Proceedings of ACL ’05, pages 173–180.
  • A. Hoglund, T. Blum, S. Brady, P. Donnes, J. Miguel, M. Rocheford, O. Kohlbacher, and H. Shatkay. (2006). Significantly improved prediction of subcellular localization by integrating text and protein sequence data. In: Proceedings of PSB ’06, volume 11, pages 16–27.
  • S. Kulick, A. Bies, M. Libeman, M. Mandel, R. McDonald, M. Palmer, A. Schein, and L. Ungar. (2004). Integrated annotation for biomedical information extraction. In: Proceedings of HLT/NAACL ’04, pages 61–68, Boston, May.
  • Y. Liu, Z. Shi, and Anoop Sarkar. (2007). Exploiting rich syntactic information for relation extraction from biomedical articles. In NAACL-HLT ’07, poster track, Rochester, NY, April.
  • Z. Lu and L. Hunter. (2005). Go molecular function terms are predictive of subcellular localization. In: Proceedings of PSB ’05, volume 10, pages 151–161.
  • S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. (2000). A novel use of statistical parsing to extract information from text. In: Proceedings of NAACL ’06, pages 226–233.
  • R. Nair and B. Rost. (2002). Inferring subcellular localization through automated lexical
  • (ReyAGLFLB, 2005) ⇒ S. Rey, M. Acab, J. L. Gardy, M. R. Laird, K. deFays, C. Lambert, and F. S. L. Brinkman. (2005). “PSORTdb: a protein subcellular localization database for bacteria”. Nucleic Acids Res. 2005 January 1.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 SimulIdentOfBiomedNamedEntsAndFuncRelsFred Popowich
Zhongmin Shi
Anoop Sarkar
Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing TechniquesProceedings of NAACL/HLT Conferencehttp://www.cs.sfu.ca/~anoop/papers/pdf/bio-scl-short.pdf2007