2008 IntroducingMetaServicesForBioMedExtraction

(Leitner et al., 2008) ⇒ Florian Leitner, Martin Krallinger, Carlos Rodriguez-Penagos, Jörg Hakenberg, Conrad Plake, Cheng-Ju Kuo, Chun-Nan Hsu, Richard Tzong-Han Tsai, Hsi-Chuan Hung William W Lau, Calvin A Johnson, Rune Sætre, Kazuhiro Yoshida, Yan Hua Chen, Sun Kim, Soo-Yong Shin, Byoung-Tak Zhang, William A. Baumgartner Jr, Lawrence Hunter, Barry Haddow, Michael Matthews, Xinglong Wang, Patrick Ruch, Frédéric Ehrler, Arzucan Özgür, Güneş Erkan, Dragomir Radev, Michael Krauthammer, ThaiBinh Luong, Robert Hoffmann, Chris Sander, Alfonso Valencia. (2008). “Introducing Meta-Services for Biomedical Information Extraction.” In: Genome Biology, 9(Suppl 2):S6 doi:10.1186/gb-2008-9-s2-s6

Subject Headings: Gene Mention Recognition System, Gene Mention Normalization System, PubMed, Uniprot, BMCS, BioCreate Meta-Services, Entity Mention Normalization Task.

Notes

http://bcms.bioinfo.cnio.es/

Cited By

~40 http://scholar.google.com/scholar?q=%22Introducing+Meta-Services+for+Biomedical+Information+Extraction.%22+2008

Quotes

Abstract

We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http://bcms.bioinfo.cnio.es/ webcite). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by biomedical researchers and database annotators, and in biomedical language processing. The platform allows direct comparison, unified access, and result aggregation of the annotations.

Background

Information retrieval (IR), information extraction (IE), and text mining have become integral parts of computational biology over the past decade [1]. However, these services are dispersed, integrated in specific packages, and include proprietary software. Therefore, progress in the field requires offering better access to the tools, methods, and their results [2]. Other areas, such as sequence analysis, genome analysis, or protein structure prediction, have benefited greatly from enhanced access to services and tools for the community of biologists, bioinformaticians (through web servers and portals), and developers (by providing free, open source academic software) [3].

Web services, widely used throughout the internet to provide the functionality for distributed systems, are becoming a common part of bioinformatics tools; For example, one of the most used text mining applications, namely iHOP (Information Hyperlinked Over Proteins), provides such an infrastructure to access its data [4]. Meta-services, too, are a ubiquitous component of the world wide web, found as meta-search engines, in business-to-buisness and business-to-consumer transactions (for example, for flight booking systems), and are used in scientific research (for example, for protein structure prediction) [5]. Another example of a distributed meta-service is BioDAS (Distributed Annotation System), a platform to exchange biologic sequence annotations between independent resources [6].

This publication describes the development of the BioCreative MetaServer (BCMS) prototype. The Results section (below) provides an overview of the system design and introduces the basic components, followed by short descriptions of the IE systems currently available through the platform prototype. The Discussion section (below) reviews what problems are solved and what issues need further investigation. The Conclusions section (below) closes with current and future utilities of this platform for the biomedical community. Technical details on the platform and implementation aspects can be found in the Materials and methods section (below).

Results

The fundamental aim of the BCMS platform is to provide users with annotations on biomedical texts from different systems. At the current prototype level, the dataset is restricted to a fixed number of approximately 22,800 PubMed/Medline abstracts. The available annotations consist of marking passages that are detected as gene or protein name mentions, annotating the articles with the gene/protein and taxonomic IDs (providing hyperlinks to the corresponding database entries), and a confidence score for whether the text contains protein-protein interaction information. Expanding on stand alone IE systems, this platform gathers the results of several systems developed by various research groups, unifies them, and allows the user to access abstracts and annotations in a combined view. It is conceivable that collating classification results will often enhance performance, simply because multiple equal classifications for a given annotation are more likely to be correct. The gathered data are accessible to the user both as human-readable hypertext and as machine processable XML in the form of XML-RPC requests.

Annotation Systems

Gene/protein normalization (GN): detect which genes or proteins are mentioned, assigning sequence database identifiers to the text.

Biotec TU Dresden and Humboldt-Universität zu Berlin (JH, CP) [Hakenberg]

The annotations we currently provide are gene mention normalization (32,795 human genes from EntrezGene)

Entity mention normalization is based on large lexicon of known names and synonyms, which are kept in main memory at all times for efficiency. Once a potential named entity has been found, we further identify it using context profiles in case multiple entities share the same name [15];

References

Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol 2005, 6:224. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Cohen A, Hersh W: A survey of current work in biomedical text mining. Brief Bioinform 2005, 6:57-71. PubMed Abstract | Publisher Full Text OpenURL

Labarga A, Valentin F, Anderson M, Lopez R: Web Services at the European Bioinformatics Institute. Nucleic Acids Res 2007, (35 Web server):W6-W11. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Fernández J, Hoffmann R, Valencia A: iHOP web services. Nucleic Acids Res 2007, (35 Web server):W21-W26. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Bujnicki JM, Elofsson A, Fischer D, Rychlewski L: Structure prediction meta server. Bioinformatics 2001, 17:750-751. PubMed Abstract | Publisher Full Text OpenURL

Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The distributed annotation system. BMC Bioinformatics 2001, 2:7. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

BioCreative Homepage [1] webcite

XML-RPC Specification [2] webcite

BioCreative MetaServer [3] webcite

BioCreative XML-RPC MetaService [4] webcite

Krallinger M, Morgan A, Smith L, Florian Leitner, Tanabe L, Wilbur J, Lynette Hirschman, Valencia A: Evaluation of text mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 2008, 9(Suppl 2):S1. OpenURL

Smith L, Tanabe LK, Johnson nee Ando R, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RT-H, Dai H-J, Liu F, Chen Y, Sun C, Sophia Katrenko, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, et al.: Overview of BioCreative II gene mention recognition. Genome Biology 2008, 9(Suppl 2):S2. OpenURL

Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu H-h, Torres R, Krauthammer M, Lau WW, Liu H, Hsu C-N, Schuemie M, Cohen KB, Lynette Hirschman: Overview of BioCreative II gene normalization. Genome Biol 2008, 9(Suppl 2):S3. OpenURL

Krallinger M, Florian Leitner, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9(Suppl 2):S4. OpenURL

Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl 2):S14. OpenURL

Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph. Bioinformatics 2006, 22:2444-2445. PubMed Abstract | Publisher Full Text OpenURL

Kuo CJ, Chang YM, Huang HS, Lin KT, Yang BH, Lin YS, Hsu CN, Chung IF: Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score gene mention tagging. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL

Mallet: A machine learning for language toolkit [5] webcite

Yoshimasa Tsuruoka, Tateishi Y, Kim JD, Ohta T, McNaught J, Sophia Ananiadou, Jun'ichi Tsujii: Developing a robust part-of-speech tagger for biomedical text. In Advances in Informatics, 10th Panhellenic Conference on Informatics; 11-13 November (2005). Volos, Greece. Springer; 2005:382-392. OpenURL

Dai HJ, Hung HC, Tsai RTH, Hsu WL: IASL systems in the gene mention tagging task and protein interaction article subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL

Tsai RTH, Sung CL, Dai HJ, Hung HC, Sung TY, Hsu WL: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006, 7(suppl 5):S11. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Tsai RTH, Hung HC, Dai HJ, Hsu WL: Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles. Proceedings of the 6th International Conference on Bioinformatics; HongKong-Hanoi-Nansha; 27-31 August 2007 OpenURL

Sinica Annotation Server - Web Service [6] webcite

Lau WW, Johnson CA: Rule-based human gene normalization in biomedical text with confidence estimation. Comput Syst Bioinformatics Conf 2007, 6:371-379. PubMed Abstract | Publisher Full Text OpenURL

Nelder J, Mead R: A simplex method for function minimization. Computer J 1965, 7:308-313. OpenURL

Sætre R, Sagae K, Jun'ichi Tsujii: Syntactic features for protein-protein interaction extraction. Short Paper Proceedings of the 2nd International Symposium on Languages in Biology and Medicine (LBM-2007); 6-7 December 2007; Singapore OpenURL

Sætre R, Yoshida K, Yakushiji A, Miyao Y, Matsubyashi Y, Ohta T: AKANE system: protein-protein interaction pairs in BioCreAtIvE2 challenge, PPI-IPS subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:209-212. OpenURL

Chen YH, Ramampiaro H, Lægreid A, Sætre R: ProtIR prototype: abstract relevance for protein-protein interaction in BioCreAtIvE2 challenge, PPI-IAS subtask. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:179-182. OpenURL

Jang H, Lim J, Lim JH, Park SJ, Lee KC, Park SH: Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 2006, 22:e220-e226. PubMed Abstract | Publisher Full Text OpenURL

Fan W, Stolfo S, Zhang J, Chan P: AdaCost: misclassification cost-sensitive boosting. Proceedings of the 16th International Conference on Machine Learning; 27-30 (1999). Bled, Slovenia 1999, 97-105. OpenURL

PIE: Protein Interaction Information Extraction [7] webcite

Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity identification with a stochastic tagger. BMC Bioinformatics 2005, 6(suppl 1):S4. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Baumgartner WA Jr, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A, White EK, Medvedeva O, Cohen KB, Hunter L: Concept recognition for extracting protein interaction relations from biomedical text. Genome Biology 2008, 9(Suppl 2):S9. OpenURL

Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Tobin R, Wang X: Automating curation using a natural language processing pipeline. Genome Biol 2008, 9(Suppl 2):S10. OpenURL

Grover C, Haddow B, Klein E, Matthews M, Nielsen LA, Tobin R, Wang X: Adapting a relation extraction pipeline for the BioCreAtIvE II task. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL

Alex B, Haddow B, Grover C: Recognising nested named entities in biomedical text. Proceedings of BioNLP; June 2007; Prague, Czech Republic 2007, 65-72. OpenURL

Wang X: Rule-based protein term identification with help from automatic species tagging. Proceedings of CICLING; Mexico City, Mexico 2007, 288-298. OpenURL

Nielsen LA: Extracting protein-protein interactions using simple contextual features. Proceedings of BioNLP; New York 2006, 120-121. OpenURL

Matthews M: Improving biomedical text categorization with nlp. Proceedings of the SIGs, The Joint BioLINK-Bio-Ontologies Meeting 2006, 93-96. OpenURL

Ehrler F, Geissbuhler A, Jimeno A, Ruch P: Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot. BMC Bioinformatics 2005, 6(suppl 1):S23. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 2006, 22:658-664. PubMed Abstract | Publisher Full Text OpenURL

Pillet V, Zehnder M, Seewald AK, Veuthey AL, Petrak J: GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005, 21:1743-1744. PubMed Abstract | Publisher Full Text OpenURL

Genia Tagger [8] webcite

de Marneffe MC, MacCartney B, Manning CD: Generating typed dependency parses from phrase structure parses. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) OpenURL

Erkan G, Özgür A, Radev DR: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL); Prague, Czech Republic 2007, 1:228-237. OpenURL

Erkan G, Özgür A, Radev DR: Extracting interacting protein pairs and evidence sentences by using dependency parsing and machine learning techniques. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; (2007). OpenURL

Krauthammer M, Nenadic G: Term identification in the biomedical literature. J Biomed Inform 2004, 37:512-526. PubMed Abstract | Publisher Full Text OpenURL

Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21:3191-3192. PubMed Abstract | Publisher Full Text OpenURL

Luong T, Tran N, Krauthammer M: Context-aware mapping of gene names using trigrams. In: Proceedings of the Second BioCreative Challenge Workshop. Madrid, Spain. CNIO; 2007:145-148. OpenURL

Hoffmann R, Valencia A: A gene network for navigating the literature. Nat Genet 2004, 36:664. PubMed Abstract | Publisher Full Text OpenURL

Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 2005, 21(suppl 2):ii252-22258. PubMed Abstract | Publisher Full Text OpenURL

MEDLINE/PubMed update charts [9] webcite

Valencia A: Meta, Meta(N) and cyber servers. Bioinformatics 2003, 19:795. PubMed Abstract | Publisher Full Text OpenURL

eUtils SOAP API [10] webcite

PostgreSQL Open Source Database [11] webcite

Django Web Development Framework [12] webcite

jQuery JavaScript and AJAX library [13] webcite

LingPipe - Java Text Mining Library and Medline Importer [14] webcite

Python Programming Language [15] webcite

ITI Life Sciences Homepage [16] webcite

Cognia EU Homepage [17] webcite

Instituto Nacional de Bioinformática [18] webcite,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 IntroducingMetaServicesForBioMedExtraction	Kazuhiro Yoshida Patrick Ruch Dragomir Radev Alfonso Valencia Tzong-han Tsai Barry Haddow Michael Matthews Xinglong Wang William A. Baumgartner Jr Florian Leitner Lawrence Hunter Michael Krauthammer Chun-Nan Hsu Martin Krallinger Carlos Rodriguez-Penagos Cheng-Ju Kuo Hsi-Chuan Hung William W Lau Calvin A Johnson Rune Sætre Yan Hua Chen Sun Kim Soo-Yong Shin Byoung-Tak Zhang Frédéric Ehrler Arzucan Özgür Güneş Erkan ThaiBinh Luong Robert Hoffmann Chris Sander Jörg Hakenberg Conrad Plake			Introducing Meta-Services for Biomedical Information Extraction		Genome Biology	http://dx.doi.org/10.1186/gb-2008-9-s2-s6	10.1186/gb-2008-9-s2-s6		2008