2008 TheStrengthOfCoAuthInGeneNameDisambig

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Gene Normalization Algorithm

Notes

Cited By

Quotes

Abstract

Background

A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.

Results

Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.

Conclusion

Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.

Background

Biological articles provide a huge amount of information about genes, proteins, their behaviour under different conditions, and their interactions. The handling of huge amounts of unstructured data (free text) has increased in interest along with the application of automatic Natural Language Processing (NLP) techniques to biomedical articles. Named Entity (NE) recognition is the first and crucial step of an Information Extraction (IE) system and a major building block of an Information Retrieval (IR) system as well.

The task of biological entity recognition is to identify and classify gene, protein, chemical names in biological articles [1]. Taken one step further, the goal of Gene Name Normalisation (GN) [2] is to assign a unique identifier to each gene name found in a text. The GN task is challenging for two main reasons. First, although synonym (alias) lists which map gene name variants to gene identifiers exist like that given in [3], they are incomplete and they do not contain all the spelling variants [4]. On the other hand one name can refer to different entities (for example IL-21 can refer to the genes with EntrezGeneID 27189, 50616 or 59067). Chen et al. [5] investigated gene name ambiguity in a comprehensive empirical study and reported an average of 5% overlap on intra-species synonyms, and ambiguity rates of 13.4%, and 1.1% on inter-species and against English words respectively. In general, the Word Sense Disambiguation (WSD) approaches (for a comprehensive study, see [6]) are concerned with this crucial problem. Their goal is to select the correct sense – from a well-defined sense inventory – of a term according to its context. A special case of WSD task is the Gene Symbol Disambiguation (GSD) [7] task where the terms are gene names, the senses are genes referred by unique identifiers and the contexts are biological articles.

There are several earlier studies on general biomedical disambiguation tasks like [8-10], to name but a few. Weeber et al. [8] annotated manually a UMLS-WSD corpus for supervised learning purposes. Savova et al. [9] introduced the utility of unlabeled data in general biomedical entity disambiguation. Their unsupervised approach looked for clusters among MedLine abstracts containing the target word, based on single word and bigram, first- and second order co-occurrence information. Liu et al [10] built a train set automatically for each target term based on the co-occurrences of unambiguous synonyms in other documents. He also mentioned that disambiguation on this domain has several features which distinguish it from the general English WSD task, mainly the granularity and nature of sense distinctions. In this paper we will examine the potential utilisation of another particular fact, namely that the authors of the documents are known.

When handling the GSD task, the AZuRE system [11] automatically assigns gene names to their LocusLink IDs based on the Naive Bayes model and contextual similarity. It extracted the training sets automatically from MedLine references in the LocusLink and SwissProt databases. Schijvenaars et al [12] also generates the training set automatically from several existing databases. They build up their vector space from MeSH terms and gene names identified by string-matching then a cosine similarity metric based disambiguation is applied. The ProMiner system [13] GN system contains a disambiguation module as well. It utilises the synonyms of the target gene name which are present in the document of the test gene. In this study we present experimental results on the GSD datasets built by Xu et al [14,15]. In [14] Xu and his colleagues took the words of the abstracts, the MeSH codes provided along with the MedLine articles, the words of the texts and some computer tagged information (UMLS CUIs and biomedical entities) as features while in [15] they experimented with the use of combinations of these features. They used them to get manually disambiguated instances (training data) and applied a vector space model with cosine similarity measure between the abstracts in question and the gene profiles which were in fact the centroids of the training instances. As they pointed out, there was not any significant information gain using the texts themselves along with the manually added MeSH codes, so we decided to just use these codes along with some novel features like author information and the year of publication.

The GSD datasets for yeast, fly and mouse are generated using MedLine abstracts and the Entrez 'gene2pubmed' file [3], which is manually disambiguated [14]. The dataset for human genes was derived [15] from the training and evaluation sets of the BioCreative II GN task [16].

Our main idea here is that an author uses gene names consistently, that is they employ a gene name to refer exclusively to one gene in their publications, hence the co-authorship between articles may contain very useful information. In this study we built an inverse co-author graph on MedLine abstracts and have introduced two methods based on the graph for the GSD task. Our methods utilise unlabelled instances (which are not manually tagged on gene meanings) by looking for paths in the graph, thus it can be regarded as a semi-supervised approach in the middle of supervised (e.g. vector space based similarity models) and fully unsupervised techniques.

Results

The inverse co-author graph

Generalising the hypothesis that an author habitually uses a gene name to refer exclusively to one gene, we can assume that the same holds true for the co-authors of the biologist in question. But what is the situation for the co-authors of the co-authors? To answer this question – and utilise the information obtained from co-authorship in the GSD problem – we decided to use the so-called co-author graph [17]. The co-author graph represents the relationship between authors. The nodes of the graph are authors, while the edges represent mutual publications. In the GSD task we basically look for an appropriate distance (or similarity) metric between pairs of abstracts, hence we define the inverse co-author graph as a graph whose nodes are abstracts from MedLine (we usually just used their PMID and not their actual text) and there is an undirected edge between two nodes if and only if the intersection of their author sets is not empty.

Conclusion

In this paper we examined the utility of co-authorship and experimentally demonstrated the utility of co-authorship analysis for the GSD task. Our hypothesis was that a biologist refers to exactly one gene by a fixed gene alias, and in experiments we found evidence for this. Moreover, we found that a disambiguation decision can be made in 85% of the cases with an extremely high precision rate (99.5%) by just using information obtained from the inverse co-author graph. If we need to build a GSD system with a full coverage we can incorporate the co-authorship information into the system and by doing so eliminate about the half of the errors of the original system.

Based on the promising results obtained so far from our study, we suppose that for abstracts the co-authorship information, the circumstances of the article's release (the journal, the year of publication) and a graph constructed above, can all be crucial building blocks for a sophisticated similarity measure among biological articles and therefore the methods introduced here ought to be useful for other biomedical natural language processing tasks as well. For example, we can reasonably assume that a biologist or biologist author group usually deals with the same special species. Hence a co-author graph-based method could be a powerful tool in the identification of the organism dealing with in an article. In addition, all text classification and clustering tasks can achieve better results with a sophisticated similarity measure. Besides the biological named entity disambiguation tasks (which is also a document classification task), a task could for instance be one for target disease identification or protocol detection.

References

  • 1. Yeh AS, Lynette Hirschman, Morgan AA: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. [1] webcite CoRR (2003). OpenURL cs.CL/0308032
  • 2. Lynette Hirschman, Marc E. Colosimo, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 2005, 6(Suppl 1):S11. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL
  • 3. Maglott DR, Ostell J, Pruitt KD, Tatusova TA: Entrez Gene: gene-centered information at NCBI. [2] webcite Nucleic Acids Research 2007, (35 Database):26-31. Publisher Full Text OpenURL
  • 4. Hakenberg J: What's in a gene name? Automated refinement of gene name dictionaries. [3] webcite Biological, translational, and clinical language processing Prague, Czech Republic: Association for Computational Linguistics; 2007, 153-160. OpenURL
  • 5. Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. [http:/ / dblp.uni-trier.de/ db/ journals/ bioinformatics/ bioinformatics21.html#ChenLF05] webcite Bioinformatics 2005, 21(2):248-256. PubMed Abstract | Publisher Full Text OpenURL
  • 6. Agirre E, Edmonds P, (Eds): [4] webcite Word Sense Disambiguation: Algorithms and Applications, Volume 33 of Text, Speech and Language Technology. Springer; (2006). OpenURL
  • 7. Xu H, Markatou M, Dimova R, Liu H, Friedman C: Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues. [5] webcite BMC Bioinformatics 2006, 7:334. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL
  • 8. Weeber M, Mork J, Aronson A: Developing a test collection for biomedical word sense disambiguation. Proc AMIA Symp 2001, 746-750. PubMed Abstract OpenURL
  • 9. Savova G, Pedersen T, Purandare A, Kulkarni A: Resolving Ambiguities in Biomedical Text with Unsupervised Clustering Approaches. Research Report UMSI 2005/80 and CB Number 2005/21, University of Minnesota Supercomputing Institute (2005). OpenURL
  • 10. Liu H, Lussier YA, Friedman C: Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method. [6] webcite Journal of Biomedical Informatics 2001, 34(4):249-261. PubMed Abstract | Publisher Full Text OpenURL
  • 11. Podowski RM, Cleary JG, Goncharoff NT, Amoutzias G, Hayes WS: AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names. In CSB '04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04). Washington, DC, USA: IEEE Computer Society; 2004:415-424. OpenURL
  • 12. Schijvenaars B, Mons B, Weeber M, Schuemie M, van Mulligen E, Wain H, Kors J: Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005., 6:PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
  • 13. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 2005., 6(Suppl 1):PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL
  • 14. Xu H, Fan JW, Hripcsak G, Mendonca EA, Markatou M, Friedman C: Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 2007, 23(8):1015-1022. PubMed Abstract | Publisher Full Text OpenURL
  • 15. Xu H, Fan JW, Friedman C: Combining multiple evidence for gene symbol disambiguation. [7] webcite Biological, translational, and clinical language processing Prague, Czech Republic: Association for Computational Linguistics; 2007, 41-48. OpenURL
  • 16. Morgan A, Ben Wellner, Colombe J, Arens R, Marc E. Colosimo, Lynette Hirschman: Evaluating the automatic mapping of human gene and protein mentions to unique identifiers. Pac Symp Biocomput 2007.PubMed Abstract OpenURL
  • 17. Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T: Evolution of the social network of scientific collaborations. Physica A: Statistical Mechanics and its Applications 2002, 311(3–4):590-614. Publisher Full Text OpenURL
  • 18. Quinlan JR: C4.5: Programs for Machine Learning. Morgan Kaufmann; (1993). OpenURL
  • 19. Ian H. Witten, Frank E: [8] webcite Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann; (1999). OpenURL,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 TheStrengthOfCoAuthInGeneNameDisambigRichárd FarkasThe strength of co-authorship in gene name disambiguationBMC Bioinformaticshttp://www.biomedcentral.com/1471-2105/9/69/abstract/10.1186/1471-2105-9-692008