2007 EvaluatingTheAutoMapOfGeneMentionsToIds

Jump to: navigation, search

Subject Headings: BioCreAtIvE II - Gene Normalization Task, Protein Mention, Entrez Gene Database, MEDLINE.


Cited By

  • ~14 …



We have developed a challenge task for the second BioCreAtIvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.

1. Background

The first Critical Assessment of Information Extraction in Biology's (BioCreAtIvE) Task 1B involved linking mentions of model organism genes and proteins in MEDLINE abstracts to their corresponding identifiers in three different model organism databases (MGD, SGD, and FlyBase). The task is described in some detail in [1], along with descriptions of many different approaches to the task in the same journal issue. There has been quite a bit of past work associating text mentions of human genes and proteins with unique identifiers including the early work by Cohen et al. [2] and the AZURE system [3]. Very recently, Fang et al. [4] reported excellent results on a data set they created using one hundred MEDLINE abstracts. This widespread community interest in the issue and our experience with the first BioCreAtIvE motivated us to prepare another evaluation task for inclusion in the second BioCreAtIvE [5]. This task will require systems to link mentions of human genes and proteins with their corresponding EntrezGene (LocusLink) identifiers. We hope that researchers in this area can use this data set to compare techniques and gauge performance gains. It can also be used to address issues in the general portability of normalization techniques and to investigate the relationships between co-mentioned genes and proteins.

5. Discussion

It is interesting to compare this new corpus with Task 1B of BioCreAtIvE 1 for insights into portability of normalization techniques. …

Vlachos et al. observed [19], in biomedical text there is a high occurrence of families of genes and proteins being mentioned by a single term such as: "Mxi1" belongs to the Mad (Mxi1) family of proteins, which function as potent antagonists of Myc oncoproteins". In future work in biomedical entity normalization, we suggest that normalizing entity mentions to family mentions may be an effective way to support other biomedical text mining tasks. Possibly the protein families in InterPro [6] could be used as normalization targets for mentions of families. For example, the mention of "Myc oncoproteins" could link to InterPro:IPR002418. This would enable information extraction systems that extract facts (relations, attributes) on gene families to attach those properties to all family members.

6. Conclusion

In summary, we have described the motivation and development of a dataset for evaluating the automatic mapping of the mention of human genes/proteins to unique identifiers, which will be used as part of the second BioCreAtIvE. We have elucidated some of the properties of this data set, and made some suggestions about how it may be used in conjunction with biological knowledge to investigate the properties of co-mentioned genes and proteins. Anonymized submissions by evaluation participants along with the evaluation set gold standard annotations will be made publicly available [5] after the workshop, tentatively scheduled for the spring of 2007.


  • Lynette Hirschman, et al., Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, (2005). 6 Suppl 1: p. S11.
  • Cohen, K.B., et al. Contrast and variability in gene names. In: Proceedings of the workshop on natural language processing in the biomedical domain, pp. 14-20. Association for Computational Linguistics. 2002.
  • Podowski, R.M., et al., AZuRE, a scalable system for automated term disambiguation of gene and protein names. Proc IEEE Comput Syst Bioinform Conf, 2004: p. 415-24
  • Fang, H., et al., Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries, In: Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, Association for Computational Linguistics: New York, New York. p. 41--48.
  • http://biocreative.sourceforge.net/, BioCreAtIvE 2 Homepage. 6. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, (2006). 34(Database issue): p. D187-91.
  • Blaschke, C., et al., Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics, (2005). 6 Suppl 1: p. S16.
  • Colosimo, M.E., et al., Data preparation and interannotator agreement: BioCreAtIvE Task 1B. BMC Bioinformatics, (2005). 6 Suppl 1: p. S12.
  • Tsai, R.T., et al., Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics, (2006). 7: p. 92.
  • Tuason, O., et al., Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput, 2004: p. 238-49.
  • http://www.geneontology.org/, The Gene Ontology.
  • ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI Gene FTP site.
  • Wain, H.M., et al., Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res, (2004). 32(Database issue): p. D255-7.
  • Wellner, B., Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data, In: Proceedings of the ACLISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, Association for Computational Linguistics: Detroit. p. 1--8.
  • (Morgan et al., 2004) ⇒ Alexander A. Morgan, Lynette Hirschman, Marc E. Colosimo, Alexander S. Yeh, and Jeff B. Colombe. (2004). “Gene Name Identification and Normalization Using a Model Organism Database.” In: Journal of Biomedical Informatics 37(6).
  • Aronson, A.R., The effect of textual variation on concept based information retrieval. Proc AMIA Annu Fall Symp, 1996: p. 373-7.
  • http://biopython.org, BioPython Website.
  • Hanisch, D., et al., Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput, 2003: p. 403-14.
  • Vlachos, A., et al., Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles. Pac Symp Biocomput, (2006). 11: p. 100-111.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 EvaluatingTheAutoMapOfGeneMentionsToIdsAlexander A. Morgan
Benjamin Wellner
Jeffrey B. Colombe
Robert Arens
Marc E. Colosimo
Lynette Hirschman
Evaluating the Automatic Mapping of Human Gene and Protein Mentions to Unique IdentifiersPacific Symposium Biocomputinghttp://psb.stanford.edu/psb-online/proceedings/psb07/morgan.pdf2007