2005 WhatMakesAGeneName

Jump to: navigation, search

Subject Headings: Protein NER Task, Protein NER System.


  • Contains a nice list of publicly available NER - Protein Taggers.
  • Suggest that performance has plateaued, and the future progress may require specialization into subDomains.

Cited By



The recognition of biomedical concepts in natural text (named entity recognition, NER) is a key technology for automatic or semi-automatic analysis of textual resources. Precise NER tools are a prerequisite for many applications working on text, such as information retrieval, information extraction or document classification. Over the past years, the problem has achieved considerable attention in the bioinformatics community and experience has shown that NER in the life sciences is a rather difficult problem. Several systems and algorithms have been devised and implemented. In this paper, the problems and resources in NER research are described, the principal algorithms underlying most systems sketched, and the current state-of-the-art in the field surveyed.

| Tool | Recognised entities | Available as | Web page || GAPSCORE | Genes and proteins | Online form and web service | http://bionlp.stanford.edu/gapscore || ABNER | Protein, DNA, RNA, cell line, cell type | Java application and API |ttp://www.cs.wisc.edu/~bsettles/abner/ || KeX | Proteins | Shell and Perl scripts | ttp://www.hgc.jp/service/tooldoc/KeX/intro.html || AbGene | Genes | Binaries | ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene || LingPipe | Genes | Online form and Java API | http://www.alias-i.com/lingpipe/ |



As discussed in the section ‘Evaluation of NER systems’, it is unlikely that much further improvement is possible on the NER problem on general classes, but progress is likely in specialised areas. In particular, species-specific NER is a promising direction, but currently still hindered by the lack of sufficiently large, species-specific corpora.


  • Kanehisa, M. (2000), ‘Post-genome Informatics’, Oxford University Press, Oxford.
  • 2. Augen, J. (2001), ‘Information technology to the rescue!’, Nature Biotechnol., Vol. 19(6), pp. BE39–BE40.
  • Galperin, M. Y. (2005), ‘The Molecular Biology Database Collection: 2005 update’, Nucleic Acids Res., Vol. 33 (Database issue), pp. D5–24.
  • URL: http://www.ncbi.nlm.nih.gov/entrez
  • Schulze, A. and Downward, J. (2001), ‘Navigating gene expression using microarray – a technology review’, Nature Cell Biol., Vol. 8(8), pp. E190–195.
  • Legrain, P. and Selig, L. (2000), ‘Genomewide protein interaction maps using twohybrid systems’, FEBS Lett., Vol. 480(1), pp. 32–36.
  • Dekang Lin, Tabb, D. L. and Yates, J. R. (2003), ‘Large-scale protein identification using mass spectrometry’, Biochim. Biophys. Acta, Vol. 1646(1–2), pp. 1–10.
  • Hvidsten, T. R., Laegreid, A. and Komorowski, J. (2003), ‘Learning rule-based models of biological process from gene expression time profiles using gene ontology’, Bioinformatics, Vol. 19(9), pp. 1116–1123.
  • Wilbur, W. J. and Yang, Y. (1996), ‘Ananalysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts’, Comput. Biol. Med., Vol. 26(3), pp. 209–222.
  • Jenssen, T. K., Laegreid, A., Komorowski, J. and Hovig, E. (2001), ‘A literature network of human genes for high-throughput analysis of gene expression’, Nature Genet., Vol. 28(1), pp. 21–28.
  • Blaschke, C., Lynette Hirschman and Valencia, A. (2002), ‘Information extraction in molecular biology’, Brief. Bioinformatics, Vol. 3(2), pp. 1–12.
  • Nobata, C., Collier, N. and Jun'ichi Tsujii (1999), ‘Automatic term identification and classification in biology texts’, in ‘Proc. Natural Language Pacific Rim Symposium’, November, Beijing, China.
  • Craven, M. and Kumlien, J. (1999), ‘Constructing biological knowledge bases by extracting information from text sources’, in ‘Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, pp. 77–86.
  • Fukuda, K., Tamura, A., Tsunoda, T. and Takagi, T. (1998), ‘Toward information extraction: identifying protein names from biological papers’, in ‘Proceedings of the 3rd Pacific Symposium on Biocomputing’, 4th–9th January, Hawaii, pp. 707–718.
  • URL: http://www.genisis.ch/ natlang/JNLPBA04/
  • URL: http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html
  • Ananiadou, S., Friedman, C. and Jun'ichi Tsujii I.(2004), ‘Special issue on named entity recognition in biomedicine’, J. Biomed. Inform., Vol. 37(6).
  • Michael Krauthammer and Nenadic, G. (2004), ‘Term identification in the biomedical literature’, J. Biomed. Inform., Vol. 37(6), pp. 512–526.
  • Cohen, A. M. and Hersh, W. R. (2005), ‘A survey of current work in biomedical text mining’, Brief. Bioinformatics, Vol. 6(1), pp. 57–71.
  • Rindflesch, T. C., Tanabe, L., Weinstein, J. N. and Hunter, L. (2000), ‘EDGAR: Extraction of drugs, genes and relations from the biomedical literature’, in ‘Proceedings of the 5th Pacific Symposium on Biocomputing, 4th–9th January, Hawaii, pp. 517–528.
  • Horn, F., Lau, A. L. and Cohen, F. E. (2004), ‘Automated extraction of mutation data from the literature: Application of MuteXt to G protein-coupled receptors and nuclear hormone receptors’, Bioinformatics, Vol. 20(4), pp. 557–568. Guidelines for authors will support text mining in the future & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 4. 357–369. DECEMBER 2005 3 6 7
  • Gaizauskas, R., Demetriou, G., Artymiuk, P. J. and Willett, P. (2003), ‘Protein structures and information extraction from biological texts: the PASTA system’, Bioinformatics, Vol. 19(1), pp. 135–143.
  • Leser, U., Lehrach, H. and Roest Crollius, H. (1998), ‘Issues in developing integrated genomic databases and application to the human X chromosome’, Bioinformatics, Vol. 14(7), pp. 583–590.
  • Franzen, K., Eriksson, G., Olsson, F. et al.), ‘Protein names and how to find them’, International J. Med. Inf., Vol. 67(1–3), pp. 49–61.
  • Adar, E. (2004), ‘SaRAD: A Simple And Robust Abbreviation Dictionary’, Bioinformatics, Vol. 20(4), pp. 527–533.
  • Bayerl, P., Lu¨ngen, H., Gut, U. and Paul, K. (2003), ‘Methodology for reliable schema development and evaluation of manual annotation’, in ‘Proceedings of the Workshop on Knowledge Markup and Semantic Annotation at the 2nd International Conference on Knowledge Capture (K-CAP)’, 23rd–25th October, Sanibel Island, FL.
  • Mani, I., Hu, Z., Wu, C. et al. (2004), ‘Protein name tagging guidelines: Lessons learned’, in ‘Proceedings of the SIG BioLink, in conjunction with ISMB/ECCB 2004’, 31st July–4th August, Glasgow, UK.
  • Dingare, S., Nissim, M., Finkel, J. et al. (2004), ‘A system for identifying named entities in biomedical text: How results from two evaluations reflect both the system and the evaluation’, in ‘Proceedings of the SIG BioLink, in conjunction with ISMB/ECCB 2004’, 31st July–4th August, Glasgow, UK.
  • Hanisch, D., Fundel, K., Mevissen, H.-T. et al. (2004), ‘ProMiner: Organism-specific protein name detection using approximate string matching’, in ‘Proceedings of the EMBO workshop BioCreative: Critical Assessment for Information Extraction in Biology’, 28th–31st March, Granada, Spain.
  • Dingare, S., Finkel, J., Christopher D. Manning al. (2004), ‘Exploring the boundaries: Gene and protein identification in biomedical text’ in ‘Proceedings of the EMBO workshop BioCreative: Critical Assessment for Information Extraction in Biology’, 28th–31st March, Granada, Spain.
  • Chang, J. T., Hinrich Schütze and Altman, R. B. (2004), ‘GAPSCORE: Finding gene and protein names one word at a time’, Bioinformatics, Vol. 20(2), pp. 216–225.
  • Hakenberg, J., Bickel, S., Plake, C. et al. (2005), ‘Systematic feature evaluation for gene name recognition’, BMC Bioinformatics, Vol. 6(Suppl 1), p.S9.
  • Rebholz-Schuhmann, D., Kirsch, H. and Couto, F. (2005), ‘Facts from text – is text mining ready to deliver?’, PLoS Biol., Vol. 3(2), p. e65.
  • URL: http://www.cs.waikato.ac.nz/ml/weka/
  • Thorsten Joachims (1998), ‘Text categorization with support vector machines: Learning with many relevant features’, in Proceedings of the 10th European Conference on Machine Learning’, 21st–23rd April, Chemnitz, Germany, pp. 137–114.
  • Fan, R.-E., Chen, P.-H. and Lin, C.-J. (2005), ‘Working set selection using the second order information for training SVM’, Technical report, Department of Computer Science, National Taiwan University.
  • URL: http://www.run.montefiore.ulg.ac.be/francois/software/jahmm/
  • Daraselia, N., Yuryev, A., Egorov, S. et al. (2004), ‘Extracting human protein interactions from MEDLINE using a full-sentence parser’, Bioinformatics, Vol. 20(5), pp. 604–611.
  • Brill, E. (1992), ‘A simple rule-based part of speech tagger’, in ‘Proceedings of the Conference on Applied Natural Language Processing (ANLP92)’, Trento, Italy, pp. 152–155.
  • Brants, T. (2000), ‘TnT – a statistical part-ofspeech tagger’, in ‘Proceedings of the Conference on Applied Natural Language Processing (ANLP00)’, 29th April–4th May, Seattle, WA.
  • Schmid, H. (1995), ‘Improvements in part-ofspeech tagging with an application to German’, in ‘Proceedings of the ACL IGDATWorkshop’, Dublin, Ireland, pp. 47–50.
  • Zhou, G., Zhang, J., Su, J. et al. (2004), ‘Recognizing names in biomedical texts: A machine learning approach’, Bioinformatics, Vol. 20(7), pp. 1178–1190.
  • Clegg, A. B. and Sheperd, A. (2005), ‘Evaluating and integrating treebank parsers on a biomedical corpus’, in ‘Proceedings of the Workshop on Software at the 43rd Annual Meeting of the of the Association for Computational Linguistics’, 25th–30 June, Ann Arbor, MI.
  • Smith, L., Rindflesch, T. and Wilbur, W. J. (2004), ‘MedPost: A part-of-speech tagger for biomedical text’, Bioinformatics, Vol. 20(14), pp. 2320–2321.
  • Lynette Hirschman, Yeh, A., Blaschke, C. and Valencia, A. (2005), ‘Overview of BioCreAtIvE: critical assessment of information extraction for biology’, BMC Bioinformatics, Vol. 6 (Suppl 1), p. S1.
  • Kim, J. D., Ohta, T., Tateisi, Y. and Jun'ichi Tsujii (2003), ‘GENIA corpus – a semantically annotated corpus for biotextmining’, Bioinformatics, Vol. 19 (Suppl 1), pp. I180– I182. 368 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 4. 357–369. DECEMBER 2005 Leser and Hakenberg
  • Rosario, B. and Hearst, M. A. (2004), ‘Classifying semantic relations in bioscience text’, in ‘Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004)’, 21st–26 July,Barcelona, Spain).
  • Hakenberg, J., Rutsch, J. and Leser, U. (2005), ‘Tuning text classification for hereditary diseases with section weighting’, in ‘Proceedings of the Symposium on Semantic Mining in Biomedicine (SMBM)’, 10–13th April, Hinxton, UK, pp. 34–39.
  • Shah, P. K., Perez-Iratxeta, C., Bork, P. and Andrade, M. A. (2003), ‘Information extraction from full text scientific articles: where are the keywords?’, BMC Bioinformatics, Vol. 4(1), p. 20.
  • Bairoch, A., Apweiler, R., Wu, C. H. et al. (2005), ‘The universal protein resource (UniProt)’, Nucleic Acids Res., Vol. 33 (Database issue), pp. D154–159.
  • Fleischmann, A., Darsow, M., Degtyarenko, K. et al. (2004), ‘IntEnz, the integrated relational enzyme database’, Nucleic Acids Res., Vol. 32 (Database issue), pp. D434–437.
  • Medical Subject Headings, National Library of Medicine, NIH (URL: http://www.nlm.nih.gov/mesh/).
  • Hanisch, D., Fluck, J., Mevissen, H.-T. and Zimmer, R. (2003), ‘Playing biology’s name game: Identifying protein names in scientific text’, in ‘Proceedings of the 8th Pacific Symposium on Biocomputing’, 3rd–7th January, Hawaii.
  • Altschul, S. F., Madden, T. L., Schaffer, A. A. et al. (1997), ‘Gapped BLAST and PSIBLAST: A new generation of protein database search programs’, Nucleic Acids Res., Vol. 25(17), pp. 3389–3402.
  • Michael Krauthammer, Rzhetsky, A., Morozov, P. and Friedman, C. (2000), ‘Using BLAST for identifying gene and protein names in journal articles’, Gene, Vol. 259(1–2), pp. 245–252.
  • Tomohiro Mitsumori, T., Fation, S., Murata, M. et al. (2005), ‘Gene/protein name recognition based on support vector machine using dictionary as features’, BMC Bioinformatics, Vol. 6 (Suppl 1), pp.S8.
  • Curran, J. R. and Clark, S. (2003), ‘Language independent NER using a maximum entropy tagger’, in ‘Proceeding of the 7th Conference on Natural Language Learning’, 31st May–1st June, Edmonton, Canada, pp. 164–167.
  • Kinoshita, S., Cohen, K. B., Ogren, P. V. and Hunter, L. (2005), ‘BioCreAtIvE Task1A: Entity identification with a stochastic tagger’, BMC Bioinformatics, Vol. 6(Suppl 1), p. S4.
  • McDonald, R. and Fernando Pereira (2005), ‘Identifying gene and protein mentions in text using conditional random fields’, BMC Bioinformatics, Vol. 6 (Suppl 1), p. S6.
  • Zhou, G., Shen, D., Zhang, J. et al. (2004), ‘Recognition of protein/gene names from text using an ensemble of classifiers and effective abbreviation resolution’, in ‘Proceedings of the EMBO workshop BioCreative: Critical Assessment for Information Extraction in Biology’, 28th–31st March, Granada, Spain.
  • Mika, S. and Rost, B. (2004), ‘Protein names precisely peeled off free text’, Bioinformatics, Vol. 20 (Suppl 1), pp. I241–I247.
  • Mons, B. (2005), ‘Which gene did you mean?’, BMC Bioinformatics, Vol. 6, p. 142. Named entity recognition in the biomedical literature.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 WhatMakesAGeneNameUlf Leser
Jörg Hakenberg
What Makes a Gene Name? Named entity recognition in the biomedical literatureBriefings in Bioinformaticshttp://bib.oxfordjournals.org/cgi/reprint/6/4/357.pdf2005