2004 GeneNameIdAndNormlUsingAModOrgDB

Jump to: navigation, search

Subject Headings: Gene Mention Normalization Task, BioCreAtIvE Task, FlyBase.



  • Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.

1. Introduction

2. Applying text mining to biological database curation

2.1. The curation process 2.2. Entity tagging: current approaches 2.3. Entity tagging and extraction systems in biology

3. Resources

3.1. FlyBase 3.2. MEDLINE abstracts 3.3. Evaluating gene name mentions 3.4. BioCreAtIvE evaluation datasets

4. Gene extraction and normalization experiments

4.1. Experiment A: using the lexicon to create a normalized list of fly gene mentions 4.1.1. Methodology 4.1.2. Analysis

4.2. Experiment B: machine learning to find gene names using noisy training data

4.2.1. Methodology 4.2.2. Analysis

4.3. Experiment C: using the HMM based tagger and lexicon to create a normalized list of gene mentions 4.3.1. Methodology 4.3.2. Analysis,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 GeneNameIdAndNormlUsingAModOrgDBAlexander A. Morgan
Lynette Hirschman
Marc E. Colosimo
Alexander S. Yeh
Jeff B. Colombe
Gene Name Identification and Normalization Using a Model Organism DatabaseJournal of Biomedical Informatics10.1016/j.jbi.2004.08.0102004
AuthorAlexander A. Morgan +, Lynette Hirschman +, Marc E. Colosimo +, Alexander S. Yeh + and Jeff B. Colombe +
doi10.1016/j.jbi.2004.08.010 +
journalJournal of Biomedical Informatics +
titleGene Name Identification and Normalization Using a Model Organism Database +
year2004 +