2005 Suregene

Jump to navigation Jump to search

Subject Headings: Document Classification, Supervised Learning, Named Entity Resolution Task, Entity Mention Grounding Task.


Cited By



Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system that is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts is described. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7344 produced good quality models (F-measure >0.7, nearly 60% of which were >0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system’s internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.


Biological researchers are constantly hindered in their work by a lack of standard naming conventions for genes and proteins. Near-frivolous choices of gene synonyms result in gene names like “IT” “midget”, or “ER”. These inherently ambiguous names cannot be effectively filtered by current search tools, nearly all of which are based on keyword queries. As a result, researchers must endure long, and sometimes fruitless, searches for literature about genes or proteins. Automated disambiguation of gene and protein names could significantly help improve access to biological literature and increase the efficiency of text analytics in the biomedical domain.

We present a system, called SureGene, for performing automated term disambiguation that can easily scale to tens of thousands of unique gene and protein names. SureGene uses a combination of machine learning and natural language processing technologies to identify abstracts relevant to specific genes and return these results as a ranked list.

Over 20,000 human genes have been identified in LocusLink and over 100,000 different names have been used to refer to them. A gene disambiguation system that is truly useful to a wide range of researchers must address some key, heretofore unsolved, challenges:

  • It must scale to cover tens of thousands of genes and proteins per organism;
  • It must be able to automatically generate training data with minimal human intervention;
  • It must be able to make use of quantities of training data ranging from a few paragraphs to tens of thousands of paragraphs;
  • It must be able to make use of low-quality training data that hasn’t been annotated or enhanced with meta-data; and
  • It mustn’t rely on a comprehensive list of all possible gene and protein synonyms, since creating such a list is impractical.

SureGene addresses these gaps using a supervised learning system. This article presents test results that show SureGene is capable of accurately distinguishing between highly ambiguous gene terms, as well as between synonymous gene and non-gene terms.

Previous Work

Disambiguation tasks fall into two basic categories: determining if a term refers to a gene or gene product (does “PI” refer to “glutathione transferase” or “Permeability Index”); and identifying the true meaning of a synonymous gene name or abbreviation (does “PI” refer to “glutathione transferase” or “alpha-1-antitrypsin”). Both of these problems often elude keyword searches.

Data collection

Individual genes included in SureGene are defined by the LocusLink (LL)12,13 human gene set. Gene names, symbols and synonyms were collected from LL and SwissProt (SP)14 databases. The system is designed to query and recognize gene or gene product context in MEDLINE abstracts.

The contextual information for the “AllGenes vs NotGene” (“AG-vs-NG”) model was collected from Medline as discussed in Sec. 3.4. The contextual information for the “Gene vs OtherGene” (“G-vs-OG”) models are collected from LocusLink and SwissProt Medline references as well as gene/protein descriptions from the preceding databases.

Training set selection

A pool of training documents for the AG-category was obtained by searching MEDLINE for articles containing one or more terms suggestive of gene or gene product context. Using a query composed of the terms “gene”, “genes”, “cDNA” and “mRNA”, we obtained 672,675 documents containing one or more of the terms in the title or abstract.

The NG-category training set document pool comprised MEDLINE documents that had at least 500 characters of text and did not contain terms from a stoplist. This stoplist included terms such as: “gene”, “protein”, “cDNA”, “mRNA”, “kinase”, “receptor”, “amino acid”, “encode”, “subunit”, “express”, “pathway”, “repress”, “inhibit”, “transcript”, “oncogene” and “oncoprotein” as well as plurals and other variants. This gave a final pool of 4.5 million documents. The AG and NG category document pools were then used to obtain random subsets for the final classifier training sets (see Secs. 3.4.3 and 3.5).

The AG set of documents were hand-curated to further clean the set of abstracts based on the abstracts that were most like the NG model using the initial AG-vs-NG model. The NG model was reviewed and found to be of sufficient accuracy that no further steps were necessary. The bias between the general nature of the NG training set and the focused nature of the AG training set means that the FN rate for the AG documents was 0.996 (at the point of FN = FP) which is an exceedingly accurate model especially given the ratio of documents in Medline estimated to be 1:2.5 as seen in Sec. 3.4.3.


The problems resulting from ambiguous gene and protein names have caused enormous difficulties in biomedical text mining as well as simple text searches for gene related information. The algorithms presented here provide a scaleable system for disambiguating gene and protein names for a variety of purposes. One can use it for improving text searches against the literature by tagging all potential gene names in the literature with their canonical forms. Much more accurate NLP systems for gene and protein relation extraction will be possible given accurate disambiguation. The results show that when more than 20 abstracts per gene are available for training, accuracy of the system is mostly over 90%. Even with 5 documents we usually get significant enrichment of the search results. The system can easily be altered dynamically to provide greater precision or recall by altering the thresholds associated with the gene disambiguation models.

The next step (in process) is a developing a central, publicly available web service to allow researchers to access this system when searching the literature for specific genes or proteins. The users of the public system will be able to provide performance feedback and additional training data if the gene of interest has too little training data to yield accurate disambiguation results, or if the existing training data displays an unexpected bias (such as that found in LLID 796 as documented in Sec. 4.2). Although models will initially have small numbers of training documents, training can be quickly bootstrapped as users submit feedback on initial predictions. In this way, additional training data can be collected in a scaleable manner based on distributed feedback/annotation. Further, genes of specific interest such as pharmaceutically relevant genes (GPCR’s, NHR’s, Kinases, etc.) can be enhanced in an organized way based on their family membership or if the gene shows a low F-measure.


  • Liu H, Johnson SB, Friedman C, Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS, J Am Med Inform Assoc 9:621–636, 2003.
  • MacNeil JS, What big pharma wants, Genome Tech 29:31–38, (2003). SureGene, Scalable System for Automated Term Disambiguation of Gene and Protein Names 769
  • Resnik P, David Yarowsky, Distinguishing systems and distinguishing senses: new evaluation methods for word sense disambiguation, Nat Lang Engi 5(3):113–133, 2000.
  • Aronson AR, Ambiguity in the UMLS Metathesaurus, National Library of Medicine, 2001.
  • Hatzivassiloglou V, Duboue PA, Rzhetsky A, Disambiguating proteins, genes, and RNA in text: A machine learning approach, Bioinformatics 1:1–10, 2001.
  • David Yarowsky, Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora, In: Proceedings of the l4th International Conference on Computational Linguistics, 454–460, 2000.
  • Gale WA, Kenneth W. Church, David Yarowsky, A method for disambiguating word senses in a large corpus, Comp Human 26:415–439, 1993.
  • David Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp. 189–196, 1995.
  • Aronson AR, Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program, Proc AMIA Symp, pp. 17–21, 2001.
  • . Rindflesch T, Tanabe L, Weinstein J, Hunter L, EDGAR: extraction of drugs, genes and relations from the biomedical literature, In: Proceedings of the Pacific Symposium on Biocomputing 5:514–525, 2000, World Scientific, Singapore.
  • . Yu H, Agichtein E, Extracting synonymous gene and protein terms from biological literature, Bioinformatics 19(Suppl. 1):i340–i349, 2003.
  • LocusLink Database, 2003, ftp://ftp.ncbi.nlm.nih.gov/refseq/LocusLink/LL tmpl.gz, accessed December 2004.
  • Pruitt KD, Maglott DR, RefSeq and LocusLink: NCBI gene-centered resources, Nucleic Acids Res 29(1):137–140, 2001.
  • O’Donovan C, Martin MJ, Gattiker A, Gasteiger E, Bairoch A, Apweiler R, Highquality protein knowledge resource: SWISS-PROT and TrEMBL, Brief Bioinform 3:275–284, 2002.
  • Reel Two Classification System, Reel Two Inc. San Francisco, CA, 2001–2004, http://www.reeltwo.com.
  • Mitchell T, Machine Learning, McGraw-Hill, 1997.
  • Thorsten Joachims, Learning to classify text using support vector machines. Dissertation, Kluwer, 2002.
  • Keerthi SS, DeCoste DM, (2004). A modified finite Newton method for fast solution of large scale linear SVMs, Yahoo! Research Labs Tech Report YRL-2004-037.
  • Abramowitz M, Stegun IA (eds.). Psi (digamma) function, in Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Chap. 63, 9th printing, Dover, New York, pp. 258–259, 1972.
  • Weiss SM, Maximizing text-mining performance, IEEE Int Syst July/August, pp. 2–8, 1999.
  • Winkler WE, Machine learning, information retrieval, and record linkage, American Statistical Association, In: Proceedings of the Section on Survey Research Methods, pp. 20–29, 2000.
  • Aas K, Eikvil L, Text categorisation: a survey, Technical report: Norwegian ComputingCenter, June, 1999.
  • Hearst M et al., Support vector machines, IEEE Intelligent Systems, 13(4), July-August 1998.
  • Medical Subject Headings, National Library of Medicine, 1998, http://www.nlm.nih.gov/mesh/meshhome.html.



 author    = {Raf M. Podowski and
              John G. Cleary and
              Nicholas T. Goncharoff and
              Gregory Amoutzias and
              William S. Hayes},
 title     = {Suregene, a Scalable System for Automated Term Disambiguation
              of Gene and Protein Names.},
 journal   = {J. Bioinformatics and Computational Biology},
 volume    = {3},
 number    = {3},
 year      = {2005},
 pages     = {743-770},
 ee        = {http://dx.doi.org/10.1142/S0219720005001223},
 bibsource = {DBLP, http://dblp.uni-trier.de}

} ,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 SuregeneRaf M. Podowski
John G. Cleary
Nicholas T. Goncharoff
Gregory Amoutzias
William S. Hayes
Suregene, a scalable system for automated term disambiguation of gene and protein nameshttp://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=PubMed&list uids=1610809210.1142/S0219720005001223