2008 AutomatingCurationUsingANLPpipeline

Jump to: navigation, search

Subject Headings: Closed IE Task, Text Annotation, Biomedical Literature, BioCreative II Task.


Cited By



The tasks in BioCreative II were designed to approximate some of the laborious work involved in curating biomedical research papers. The approach to these tasks taken by the University of Edinburgh team was to adapt and extend the existing natural language processing (NLP) system that we have developed as part of a commercial curation assistant. Although this paper concentrates on using NLP to assist with curation, the system can be equally employed to extract types of information from the literature that is immediately relevant to biologists in general.
Our system was among the highest performing on the interaction subtasks, and competitive performance on the gene mention task was achieved with minimal development effort. For the gene normalization task, a string matching technique that can be quickly applied to new domains was shown to perform close to average.
The technologies being developed were shown to be readily adapted to the BioCreative II tasks. Although high performance may be obtained on individual tasks such as gene mention recognition and normalization, and document classification, tasks in which a number of components must be combined, such as detection and normalization of interacting protein pairs, are still challenging for NLP systems.


Curating biomedical literature into relational databases is a laborious task, in view of the quantity of biomedical research papers that are published on a daily basis. It is widely argued that text mining could simplify and speed up this task [1-3] . In this report we describe how a text mining system developed for a commercial curation project was adapted for the BioCreative II competition. Our submission (team 6) to this competition is based on research carried out as part of the Text Mining (TXM) program, a 3-year project aimed at producing natural language processing (NLP) tools to assist in the curation of biomedical papers. The principal product of this project is an information extraction (IE) pipeline, designed to extract named entities (NEs) and relations relevant to the biomedical domain, and to normalize the NEs to appropriate ontologies (Figure 1). Although the TXM pipeline is designed to assist specialized users, such as curators, it can equally be employed to extract information from the literature that is immediately relevant to biologists in general. For example, it can be used to automatically create large-scale databases or to generate protein-protein interaction networks.


  • 1. Yeh AS, Hirschman L, Morgan A Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 2003, 19(suppl 1):i331-i339.
  • 2. Rebholz-Schuhmann D, Kirsch H, Couto F Facts from text: is text mining ready to deliver? PLoS Biology 2005, 3:e65.
  • 3. Xu H, Krupke D, Blake J, Friedman C A natural language processing (NLP) tool to assist in the curation of the laboratory mouse tumor biology database. [1] AMIA Annu Symp Proc 2006, :1150.
  • 4. Alex B, Haddow B, Grover C Recognising nested named entities in biomedical text. [2] Proceedings of BioNLP; Prague, Czech Republic 2007.
  • 5. Haddow B, Matthews M The extraction of enriched protein-protein interactions from biomedical text. [3] Proceedings of BioNLP, Prague, Czech Republic 2007.
  • 6. Smith L, Tanabe LK, Ando R, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RTH, Dai HJ, Liu F, Chen Y, Sun C, Sophia Katrenko, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata-Vázquez J, Wilbur WJ Overview of BioCreative II gene mention recognition. Genome Biol 2008, 9(Suppl 2):S2.
  • 7. Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu H, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L Overview of BioCreative II gene normalization. Genome Biol 2008, 9(Suppl 2):S3.
  • 8. Krallinger M, Florian Leitner, Rodriguez-Penagos C, Valencia A Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol 2008, 9(Suppl 2):S4.
  • 9. Lafferty J, McCallum A, Pereira F Conditional random fields: probabilistic models for segmenting and labeling sequence data. [4] Proceedings of ICML 2001.
  • 10. Tsuruoka Y, Tsujii J Bidirectional Inference with the easiest-first strategy for tagging sequence data. [5] Proceedings of HLT/EMNLP 2005.
  • 11. Wilbur J, Smith L, Tanabe L BioCreative 2 gene mention task. Proceedings of the BioCreAtIvE II Workshop; Madrid, Spain 2007, 7-16.
  • 12. Stevenson M Fact distribution in information extraction. Lang Resources Eval 2006, 40:183-201.
  • 13. Bairoch A, Apweiler R The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucleic Acids Res 2000, 28:45-48.
  • 14. Language Technology Group Software [6]
  • 15. Curran J, Clark S Language independent NER using a maximum entropy tagger. [7] Proceedings of CoNLL03; Edmonton, Canada 2003.
  • 16. Smith L, Rindflesch T, Wilbur WJ MedPost: a part-of-speech tagger for biomedical text. Bioinformatics 2004, 20:2320-2321.
  • 17. Schwartz A, Hearst M A simple algorithm for identifying abbreviation definitions in biomedical text. [8] Proceedings of PSB 2003.
  • 18. Minnen G, Carroll J, Pearce D Robust, applied morphological generation. [9] Proceedings of INLG 2000.
  • 19. Nielsen LA Extracting protein-protein interactions using simple contextual features. [10] Proceedings of BioNLP; New York, USA 2006.
  • 20. Tjong Kim Sang EF, De Meulder F Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. [11] Proceedings of CoNLL 2003.
  • 21. McCallum A, Li W Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. [12] Proceedings of CoNLL 2003.
  • 22. McDonald R, Pereira F Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 2005, 6(suppl 1):S6. BioMed Central Full Text
  • 23. Sha F, Pereira F Shallow parsing with conditional random fields. [13] Proceedings of HTL-NAACL 2003.
  • 24. [14]
  • 25. Maximum Entropy Modeling Toolkit for Python and C++ [15]
  • 26. Collier N, Takeuchi K Comparison of character-level and part of speech features for name recognition in biomedical texts.
  • J Biomed Informatics 2004, 37:423-435.
  • 27. Jaro MA Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 1989, 84:414-420.
  • 28. Jaro MA Probabilistic linkage of large public health data files. Stat Med 1995, 14:491-498.
  • 29. Winkler WE The state of record linkage and current research problems. [16] Tech rep, Statistics of Income Division, Internal Revenue Service Publication R99/04 1999.
  • 30. Joachims T Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press; 1999.
  • 31. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader G, Michalickova K, Pawson T, Hogue C PreBIND and Textomy: mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4:11. BioMed Central Full Text
  • 32. Polavarapu N, Navathe SB, Ramnarayanan R, ul Haque A, Sahay S, Liu Y Investigation into biomedical literature classification using support vector machines. Proc IEEE Comput Syst Bioinform Conf 2005, 366-374.
  • 33. Cognia [17]
  • 34. ITI Life Sciences [18],

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 AutomatingCurationUsingANLPpipelineBeatrice Alex
Claire Grover
Barry Haddow
Mijail Kabadjov
Ewan Klein
Michael Matthews
Richard Tobin
Xinglong Wang
Automating Curation Using a Natural Language Processing PipelineGenome Biologyhttp://genomebiology.com/2008/9/S2/S1010.1186/gb-2008-9-s2-s102008