Gene Mention Normalization Task

Jump to: navigation, search

A Gene Mention Normalization Task is a domain specific entity mention normalization task that is restricted to the mapping of gene mentions to canonical gene records.




  • (Cusick et al., 2009) ⇒ Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha Venkatesan, Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François Rual, Heather Borick, Pascal Braun, Matija Dreze, Jean Vandenhaute, Mary Galli, Junshi Yazaki, David E Hill, Joseph R Ecker, Frederick P Roth, and Marc Vidal. (2009). “Literature-Curated Protein Interaction Datasets.” In: Nature Methods 6, 39 - 46 (2009)
    • Why is reliability of literature curation so low? Our findings of large error rates in curated protein interaction databases, at least for yeast and human, are consistent with recent hints that the quality of literature-curated datasets may not be as high as widely perceived23,29,43–45. Perhaps occasionally curator error is responsible. However, we suggest that the errors are due not so much to curators but to the simple reality that extracting accurate information from a long free-text document can be extremely difficult. Gene name confusion is particularly thorny30,46. An example from our curated yeast sample illustrates the difficulties. A purification with a tandem affinity purification tag with Vps71/Swc6 (slash separates synonymous approved names) as bait47 pulls down a protein named Swc3, but double-checking this finds that the coresponding open reading frame is actually SWC3 (locus name YAL011w), and not the ALR1/SWC3 (locus name YOL130w) open reading frame curated in the database. A shared synonym thoroughly muddled the curation.


  • (Morgan et al., 2008) ⇒ Alexander A Morgan, Zhiyong Lu, Xinglong Wang, Aaron M Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng-hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K Bretonnel Cohen, and Lynette Hirschman. (2008). “Overview of BioCreative II gene normalization.” In: Genome Biology 2008, 9(Suppl 2):S3. doi:10.1186/gb-2008-9-s2-s3.
    • QUOTE:The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. … Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. … Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.
  • (Farkas, 2008) ⇒ Richárd Farkas. (2008). “The strength of co-authorship in gene name disambiguation.” In: BMC Bioinformatics 2008, 9:69. doi:10.1186/1471-2105-9-69
    • QUOTE:Taken one step further, the goal of Gene Name Normalisation (GN) [2] is to assign a unique identifier to each gene name found in a text.


  • ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: MiningBiological Semantics