Open main menu


2004 MutationInformationExtraction

Cited By

= Quotes


Motivation: The amount of genomic and proteomic data that is published daily in the scientific literature is outstripping the ability of experimental scientists to stay current. Reviews, the traditional medium for collating published observations, are also unable to keep pace. For some specific classes of information (e.g. sequences and protein structures), obligatory data deposition policies have helped. However, a great deal of other valuable information is spread throughout the literature hindering coherent access. We are involved in the Molecular Class-Specific Information System (MCSIS) project, a collaborative effort to design and automate the maintenance of protein family databases. The first two databases, the GPCRDB and NucleaRDB, are focused on G protein-coupled receptors (GPCRs) and nuclear hormone receptors (NRs), respectively. The main aim of the MCSIS project is to gather heterogeneous data from across a variety of electronic and literature sources in order to draw new inferences about the target protein families.

  • Results: We present a computational method that identifies and extracts mutation data from the scientific literature. We focused on the extraction of single point mutations for the GPCR and NR superfamilies. After validation by plausibility filters, the mutation data is integrated into the corresponding MCSIS where it is combined with structural and sequence information already stored in these databases. We extracted and validated 2736 true point mutations from 914 articles on GPCRs and 785 true point mutations from 1094 articles on NRs. The current version of our automated extraction algorithm identifies 49.3% of the GPCR point mutations with a specificity of 87.9%, and 64.5% of the NR point mutations with a specificity of 85.8%. MuteXt routinely analyzes 100 electronic articles in approximately 1 h."

Systems and Methods: Information Extraction

  • The pattern must start with one amino acid in the one- or three-letter code followed by a number, and optimally by another amino acid encoded with the same letter code format as the first one.
  • The regular expression we use is: ([A–Z][1–9][0–9] + $)|([A–Z][1–9][0–9] ∗ [A–Z]$) |([A–Z][a–z][a–z][1–9][0–9] ∗ $) |([A–Z][a–z][a–z][1–9][0–9] ∗ [A–Z][a–z][a–z]$)


  • Andrade,M.A. andValencia,A. (1997) Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proceedings of International Conference Intell. Syst. Mol. Biol., 5, 25–32.
  • Andrade,M.A. and Bork,P. (2000) Automated extraction of information in molecular biology. FEBS Lett., 476, 12–17.
  • Blaschke,C., Andrade,M.A., Ouzounis,C. and Valencia,A. (1999) Automatic extraction of biological information from scientific text: protein–protein interactions. Proceedings of International Conference Intell. Syst. Mol. Biol., 60–67.
  • Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I., Pilbout,S. and Schneider,M. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in (2003). Nucleic Acids Res., 31, 365–370.
  • Craven,M. and Kumlien,J. (1999) Constructing biological knowledge bases by extracting information from text sources. Proceedings of International Conference Intell. Syst. Mol. Biol., 77–86.

den Dunnen,J.T. and Antonarakis,S.E. (2001) Nomenclature for the description of human sequence variations. Hum. Genet., 109, 121–124. Edvardsen,O., Reiersen,A.L., Beukers,M.W. and Kristiansen,K. (2002) tGRAP, the G-protein coupled receptors mutant database. Nucleic Acids Res., 30, 361–363. Fukuda,K., Tamura,A., Tsunoda,T. and Takagi,T. (1998) Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput., 707–718. Gaizauskas,R., Demetriou,G., Artymiuk,P.J. and Willett,P. (2003) Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics, 19, 135–143. Hirschman,L., Park,J.C., Tsujii,J., Wong,L. and Wu,C.H. (2002) Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18, 1553–1561. Horn,F., Vriend,G. and Cohen,F.E. (2001) Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res., 29, 346–349. Horn,F., Bettler,E., Oliveira,L., Campagne,F., Cohen,F.E. and Vriend,G. (2003). GPCRDB information system for G proteincoupled receptors. Nucleic Acids Res., 31, 294–297. Humphreys,K., Demetriou,G. and Gaizauskas,R. (2000) Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac. Symp. Biocomput., 505–516. Leonard,J.E., Colombe,J.B. and Levy,J.L. (2002). Finding relevant references to genes and proteins in Medline using a Bayesian approach. Bioinformatics, 18, 1515–1522. Marcotte,E.M., Xenarios,I. and Eisenberg,D. (2001) Mining literature for protein–protein interactions. Bioinformatics, 17, 359–363. Nuclear Receptors Committee (1999)Aunified nomenclature system for the nuclear receptor superfamily. Cell, 97, 161–163. Ohta,Y., Yamamoto,Y., Okazaki,T., Uchiyama,I. and Takagi,T. (1997) Automatic construction of knowledge base from biological papers. Proceedings of International Conference Intell. Syst. Mol. Biol., 5, 218–225. Ono,T., Hishigaki,H., Tanigami,A. and Takagi,T. (2001) Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics, 17, 155–161. Proux,D., Rechenmann,F. and Julliard,L. (2000) A pragmatic information extraction strategy for gathering data on genetic interactions. Proceedings of International Conference Intell. Syst. Mol. Biol., 8, 279–285. Proux,D., Rechenmann,F., Julliard,L., Pillet,V.V. and Jacq,B. (1998) Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. Genome Inform. Ser. Workshop Genome Inform., 9, 72–80. Rindflesch,T.C., Tanabe,L., Weinstein,J.N. and Hunter,L. (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput., 517–528. Schuler,G.D., Epstein,J.A., Ohkawa,H. and Kans,J.A. (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol., 266, 141–162. Tanabe,L. and Wilbur,W.J. (2002). Tagging gene and protein names in biomedical text. Bioinformatics, 18, 1124–1132. Thomas,J., Milward,D., Ouzounis,C., Pulman,S. and Carroll,M. (2000) Automatic extraction of protein interactions from scientific abstracts. Pac. Symp. Biocomput., 541–552. Van Durme,J.J., Bettler,E., Folkertsma,S., Horn,F. and Vriend,G. (2003). NRMD: nuclear receptor mutation database. Nucleic Acids Res., 31, 331–333.

  • Yoshida,M., Fukuda,K. and Takagi,T. (2000) PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics, 16, 169–175.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 MutationInformationExtractionFlorence Horn
Anthony L. Lau
Fred E. Cohen
Automated Extraction of Mutation Data from the Literature: Application of MuteXt to G protein-coupled receptors and nuclear hormone receptorsBioinformatics Subject Area