2004 GeneNameIdAndNormlUsingAModOrgDB

(Morgan et al., 2004) ⇒ Alexander A. Morgan, Lynette Hirschman, Marc E. Colosimo, Alexander S. Yeh, Jeff B. Colombe. (2004). “Gene Name Identification and Normalization Using a Model Organism Database.” In: Journal of Biomedical Informatics(6). doi:10.1016/j.jbi.2004.08.010

Subject Headings: Gene Mention Normalization Task, BioCreAtIvE Task, FlyBase.

Quotes

Abstract

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.

1. Introduction

2. Applying text mining to biological database curation

2.1. The curation process 2.2. Entity tagging: current approaches 2.3. Entity tagging and extraction systems in biology

3. Resources

3.1. FlyBase 3.2. MEDLINE abstracts 3.3. Evaluating gene name mentions 3.4. BioCreAtIvE evaluation datasets

4. Gene extraction and normalization experiments

4.1. Experiment A: using the lexicon to create a normalized list of fly gene mentions 4.1.1. Methodology 4.1.2. Analysis

4.2. Experiment B: machine learning to find gene names using noisy training data

4.2.1. Methodology 4.2.2. Analysis

4.3. Experiment C: using the HMM based tagger and lexicon to create a normalized list of gene mentions 4.3.1. Methodology 4.3.2. Analysis,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 GeneNameIdAndNormlUsingAModOrgDB	Alexander A. Morgan Lynette Hirschman Marc E. Colosimo Alexander S. Yeh Jeff B. Colombe			Gene Name Identification and Normalization Using a Model Organism Database				10.1016/j.jbi.2004.08.010