2011 TheGeneNormTaskInBioCreIII

(Lu, Kao et al., 2011) ⇒ Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Minlie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu, Richard Tzong-Han Tsai, Hong-Jie Dai, Naoaki Okazaki, Han-Cheol Cho, Martin Gerner, Illes Solt, Shashank Agarwal, Feifan Liu, Dina Vishnyakova, Patrick Ruch, Martin Romacker, Fabio Rinaldi, Sanmitra Bhattacharya, Padmini Srinivasan, Hongfang Liu, Manabu Torii, Sergio Matos, David Campos, Karin Verspoor, Kevin M Livingston and W. John Wilbur. (2011). “The Gene Normalization Task in BioCreative III.” In: BMC Bioinformatics, 12(Suppl 8). doi:10.1186/1471-2105-12-S8-S2

Subject Headings: BioCreative III Gene Normalization Task, Summary Paper.

Notes

Cited By

Quotes

Abstract

Background

We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).

Results

We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.

Conclusions

By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

List of abbreviations used

GN: Gene Normalization; EM: Expectation Maximization; TAP: Threshold Average Precision; MOD: Model Organism Database; BMC: BioMed Central; PLoS: Public Library of Science; PMC: Central; NCBI: National Center for Biotechnology Information; NLM: National Library of Medicine; UAG: User Advisory Group; AP: Average Precision; SVM: Support Vector Machine; CRF: Conditional Random Fields; GNR: Gene Name Recognition; GOCat: Gene Ontology Categorizer; CLKB: Cell Line Knowledge Base; NER: Named Entity Recognition; KNoGM: Knowledge-based Normalization of Gene Mentions; WSD: Word Sense Disambiguation.

References

1. Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, et al.: Overview of BioCreative II gene normalization. Genome Biol 2008, 9(Suppl 2):S3.
2. Hirschman L, Colosimo M, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 2005, 6(Suppl 1):S11.
3. Colosimo ME, Morgan AA, Yeh AS, Colombe JB, Hirschman L: Data preparation and interannotator agreement: BioCreAtIvE task 1B. BMC Bioinformatics 2005, 6(Suppl 1):S12.
4. Dowell KG, McAndrews-Hill MS, Hill DP, Drabkin HJ, Blake JA: Integrating text mining into the MGI biocuration workflow. Database (Oxford) 2009, 2009:bap019.
5. Carroll HD, Kann MG, Sheetlin SL, Spouge JL: Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics 2010, 26(14):1708-1713.
6. Snow R, O'Connor B, Jurafsky D, Ng AY: Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii: Association for Computational Linguistics; 2008.
7. Sheng VS, Provost F, Ipeirotis PG: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedingseeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, Nevada, USA: ACM; 2008.
8. Donmez P, Carbonell JG, Schneider J: Efficiently learning the accuracy of labelling sources for selective sampling. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France: ACM; 2009.
9. Whitechill J, Ruvolo P, Wu T, Bergsma J, Movellan J: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. Advances in Neural Information Processing Systems 2009, 2035-3043.
10. Welinder P, Perona P: Online crowdsourcing: rating annotators and obtaining cost-effective labels. Workshop on Advancing Computer Vision with Humans in the Loop at CVPR'10 2010.
11. Smyth P, Fayyad U, Burl M, Perona P, Baldi P: Inferring ground truth from subjective labelling of venus images. Advances in Neural Information Processing Systems 1995, 7:1085-1092.
12. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L: Learning From Crowds. Journal of Machine Learning Research 2010, 11:1297-1322.
13. Dawid AP, Skene AM: Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Journal of the Royal Statistical Society Series C (Applied Statistics) 1979, 28(1):20-28.
14. Rebholz-Schuhmann D, Yepes AJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U: CALBC silver standard corpus. J Bioinform Comput Biol 2010, 8(1):163-179.
15. Kappeler T, Kaljurand K, Rinaldi F: TX task: automatic detection of focus organisms in biomedical publications. In: Proceedings of the Workshop on BioNLP. Boulder, Colorado: Association for Computational Linguistics; 2009.
16. Wang X, Tsujii J, Ananiadou S: Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics 2010, 26(5):661-667.
17. Lewis DD: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Proceedings of the 10th European Conference on Machine Learning. Springer-Verlag; 1998:4-15.
18. McCallum A, Nigam K: A comparison of event models for Naive Bayes text classification. AAAI-98 WORKSHOP ON LEARNING FOR TEXT CATEGORIZATION 1998, 41-48.
19. Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An Overview of BioCreative II.5. IEEE/ACM Trans Comput Biol Bioinform 2010, 7(3):385-399.
20. Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, et al.: Overview of BioCreative II gene mention recognition. Genome Biol 2008, 9(Suppl 2):S2.
21. Zhang T: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first International Conference on Machine learning. Banff, Alberta, Canada: ACM; 2004.
22. Classias: A collection of machine-learning algorithms for classification [1]
23. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl 2009, 11(1):10-18.
24. MALLET: MAchine Learning for LanguagE Toolkit [2]
25. Gerner M, Nenadic G, Bergman CM: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010, 11:85.
26. Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21(14):3191-3192.
27. Hsu CN, Chang YM, Kuo CJ, Lin YS, Huang HS, Chung IF: Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics 2008, 24(13):i286-294.
28. Leaman R, Gonzalez G: BANNER: an executable survey of advances in biomedical named entity recognition. Pac Symp Biocomput 2008, 652-663.
29. Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl 2):S14.
30. NERsuite: A Named Entity Recognition toolkit [3]
31. LingPipe 4.0.0 [4]
32. Entrez Gene [5]
33. Gene and Protein Synonym DataBase [6]
34. Lindberg C: The Unified Medical Language System (UMLS) of the National Library of Medicine. J Am Med Rec Assoc 1990, 61(5):40-42.
35. Gene Ontology Annotation (UniProtKB-GOA) Database [7]
36. Cell Line Knowledge Base [8]
37. Sarntivijai S, Ade AS, Athey BD, States DJ: A bioinformatics analysis of the cell line nomenclature. Bioinformatics 2008, 24(23):2760-2766.
38. Apache Lucene [9]
39. Liu H, Hu ZZ, Zhang J, Wu C: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 2006, 22(1):103-105.
40. GOCat – Gene Ontology Categorizer [10]
41. GenNorm [11]
42. Huang M, Liu J, Zhu X: GeneTUKit: a software for document-level gene normalization. Bioinformatics 2011, 1(27):1032-1033.
43. IASL-IISR Gene Mention/Normalization Tool [12]
44. Hong-Jie D, Po-Ting L, Tsai RTH: Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles. IEEE/ACM Trans Comput Biol Bioinform 2010, 7(3):412-420.
45. Lu Z, Wilbur WJ: Overview of BioCreative III Gene Normalization. In: Proceedings of the BioCreative III workshop. Bethesda, MD, USA; 2010:24-45.
46. Ruch P: Automatic assignment of biomedical categories: toward a generic approach. Bioinformatics 2006, 22(6):658-664.
47. Rinaldi F, Kappeler T, Kaljurand K, Schneider G, Klenner M, Clematide S, Hess M, von Allmen JM, Parisot P, Romacker M, et al.: OntoGene in BioCreative II. Genome Biol 2008, 9(Suppl 2):S13.
48. Rinaldi F, Schneider G, Kaljurand K, Clematide S, Vachon T, Romacker M: OntoGene in BioCreative II.5. IEEE/ACM Trans Comput Biol Bioinform 2010, 7(3):472-480.
49. Bhattacharya S, Sehgal AK, Srinivasan P: Cross-species Gene Normalization at the University of Iowa. In: Proceedings of the BioCreative III workshop. Bethesda, MD, USA; 2010:55-59.
50. Matos S, Campos D, Oliveira JL: Vector-space models and terminologies in gene normalization and document classification. In: Proceedings of the BioCreative III Workshop. Bethesda, MD, USA; 2010:119-124.
51. Agirre E, Soroa A: Personalizing PageRank for word sense disambiguation. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Athens, Greece: Association for Computational Linguistics; 2009:33-41.
52. Haveliwala TH: Topic-sensitive PageRank. In: Proceedings of the 11th International Conference on World Wide Web. Honolulu, Hawaii, USA: ACM; 2002.
53. Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the seventh International Conference on World Wide Web 7. Brisbane, Australia: Elsevier Science Publishers B. V.; 1998.
54. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29.
55. Turner B, Razick S, Turinsky AL, Vlasblom J, Crowdy EK, Cho E, Morrison K, Donaldson IM, Wodak SJ: iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford) 2010, 2010:baq023.
56. HomoloGene [13]
57. Liu H, Hu ZZ, Torii M, Wu C, Friedman C: Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc 2006, 13(5):497-507.
58. Schwartz AS, Hearst MA: A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput 2003, 451-462.

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2011 TheGeneNormTaskInBioCreIII	W. John Wilbur Patrick Ruch Fabio Rinaldi Hongfang Liu Tzong-han Tsai Zhiyong Lu Karin Verspoor Chun-Nan Hsu Cheng-Ju Kuo Hung-Yu Kao Chih-Hsuan Wei Minlie Huang Jingchen Liu Hong-Jie Dai Naoaki Okazaki Han-Cheol Cho Martin Gerner Illes Solt Shashank Agarwal Feifan Liu Dina Vishnyakova Martin Romacker Sanmitra Bhattacharya Padmini Srinivasan Manabu Torii Sergio Matos David Campos Kevin M Livingston			The Gene Normalization Task in BioCreative III		BMC Bioinformatics	http://www.biomedcentral.com/1471-2105/12/S8/S2	10.1186/1471-2105-12-S8-S2		2011