BioCreative II Gene Normalization Task

Jump to navigation Jump to search

A BioCreative II Gene Normalization Task is a benchmark Gene Mention Normalization Task created in conjunction with BioCreative II.




  • (MorganWCACH, 2007) ⇒ Morgan AA, Ben Wellner, Colombe JB, Arens R, Marc E. Colosimo, and Lynette Hirschman. (2007). “Evaluating the automatic mapping of human gene and protein mentions to unique identifiers.” In: Pac Symp Biocomput 12 (2007) Maui, Hawaii. 281–91.
    • Human gene/protein normalization
    • Premise: Systems will be required to return the EntrezGene (formerly Locus Link) identifiers corresponding to the human genes and direct gene products appearing in a given MEDLINE abstract. This has relevance to improving document indexing and retrieval, and to linking text mentions to database identifiers in support of more sophisticated information extraction tasks. It is similar to Task 1B of BioCreAtIvE I [1].
    • System Input: Participating groups will be given a master list of human EntrezGene identifiers with some common gene and protein names (synonyms) for each identifier in the master list. For the evaluation task, the input is a collection of plain text abstracts.
    • System Output: For each abstract, the system will return a list of the EntrezGene identifiers and corresponding text excerpts for each human gene or gene product mentioned in the abstract. The excerpt required is a single mention of the gene's 'name' found in the abstract. Even if a gene is mentioned several different places in an abstract with alternate names being used, only a single excerpt/mention is to be returned by the system. If desired, groups may also include a fourth column which contains a confidence measure that ranges from 0 (no confidence) to 1 (absolute confidence). This is not a part of the main evaluation, and is included as an option for interested groups at the request of some participants. The return format is a single file, with each entry on one line, and the field delimited by tabs. The columns should then be: PUBMED ID, EntrezGene (LocusLink) ID, Mention Text, and optionally Confidence. There should be no column headers or line numbers, and the fields should all be separated with tabs. Although the hand annotated training file contains multiple text excerpts for each identifier, that is just meant to aid in training and only one would be expected from a participating system (any one of the set would be 'correct', although getting the right text is not the main part of the evaluation). An example line with made up identifiers follows: 123456 987 foobar