2005 ConsolidHumanProteinProteinInteract

From GM-RKB
Jump to navigation Jump to search

Subjects Headings: Semantic Relation Extraction Algorithm, Protein-Protein Interaction, Log Odds Ratio, PPLRE Project

Notes

Cited By

Quotes

Abstract

Background
Extensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins.
Results
We established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets.
Conclusion
These interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network.

Results

Assembling existing public protein interaction data

Benchmarking of protein information data

To measure the relative accuracy of each protein interaction dataset, we established two benchmarks of interaction accuracy, one based on shared protein function and the other based on previously known interactions. First, we constructed a benchmark in which we tested the extent to which interaction partners in a dataset shared annotation, a measure previously shown to correlate with the accuracy of functional genomics datasets [13,14,21]. We used the functional annotations listed in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [30] and Gene Ontology (GO) [31] annotation databases. These databases provide specific pathway and biological process annotations for approximately 7,500 human genes, assigning human genes into 155 KEGG pathways (at the lowest level of KEGG) and 1,356 GO pathways (at level 8 of the GO biological process annotation). KEGG and GO annotations were combined into a single composite functional annotation set, which was then split into independent testing and training sets by randomly assigning annotated genes into the two categories (3,792 and 3,809 annotated genes respectively). For the second benchmark based on known physical interactions, we assembled the human protein interactions from Reactome and BIND, a set of 11,425 interactions between 1,710 proteins. Each benchmark therefore consists of a set of binary relations between proteins, either based on proteins sharing annotation or physically interacting. Generally speaking, we expect more accurate protein interaction datasets to be more enriched in these protein pairs. More specifically, we expect true physical interactions to score highly on both tests, while non-physical or indirect associations, such as genetic associations, should score highly on the functional, but not the physical interaction, test.

For both benchmarks, the scoring scheme for measuring interaction set accuracy is in the form of a log odds ratio of gene pairs either sharing annotations or physically interacting. To evaluate a dataset, we calculate a log likelihood ratio (LLR) as:

[math]\displaystyle{ LLR = ln (\frac{P(D \vert I)} {P(D \vert \sim I)}) }[/math]

where [math]\displaystyle{ P(D\vert I) }[/math] and [math]\displaystyle{ P(D\vert \sim I) }[/math] are the probability of observing the data (D) conditioned on the genes sharing benchmark associations (I) and not sharing benchmark associations (~I). By Bayes theorem, this equation can be rewritten as:

[math]\displaystyle{ LLR = ln (\frac{(P(I \vert D)/P(\sim I \vert D))}{P(I)/P(\sim I)}) }[/math]

where [math]\displaystyle{ P(I\vert D) }[/math] and [math]\displaystyle{ P(\sim I\vert D) }[/math] are the frequencies of interactions observed in the given dataset (D) between annotated genes sharing benchmark associations (I) and not sharing associations (~I), respectively, while P(I) and P(~I) represent the prior expectations (the total frequencies of all benchmark genes sharing the same associations and not sharing associations, respectively). This latter version of the equation is simpler to compute. A score of zero indicates interaction partners in the dataset being tested are no more likely than random to belong to the same pathway or to interact; higher scores indicate a more accurate dataset.

Discussion

Shortcomings and strengths of literature mining via the co-citation/Bayesian classifier approach

The co-citation approach [14,26,40] calculates the random probability of co-occurrence of two protein names. The assumption is that if the co-citation is statistically unlikely under the random model, then there is a true underlying reason for the proteins to be co-cited - that is, they are interacting at either the functional, pathway level, or are co-localized or physically interact. The method has both advantages and disadvantages. It does not extract all interactions, but only those with statistically significant co-citations.

Intuition: proteins co-occurring in a large number of abstracts tend to be interacting proteins. (from paper presentation)

Compute the probability of co-citation under a random model (hyper-geometric distribution). [math]\displaystyle{ P(k \vert N,n,m)=(n choose k)((N-n) choose (m-k))/(N choose m) }[/math] where, [math]\displaystyle{ N }[/math] <= total number of abstracts (750K). “n – abstracts citing the first protein; [math]\displaystyle{ m }[/math] <= abstracts citing the second protein; [math]\displaystyle{ k }[/math] <= abstracts citing both proteins." (from paper presentation)

References

  • 1. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98:4569-4574.
  • 2. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403:623-627.
  • 3. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415:141-147.
  • 4. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415:180-183.
  • 5. Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, et al.: Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 2001, 294:2364-2368.
  • 6. Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al.: Global mapping of the yeast genetic interaction network. Science 2004, 303:808-813.
  • 7. Gabaldon T, Huynen MA: Prediction of protein function and pathways in the genome era. Cell Mol Life Sci 2004, 61:930-944.
  • 8. Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature 2000, 405:823-826.
  • 9. Huynen MA, Snel B, von Mering C, Bork P: Function prediction and protein networks. Curr Opin Cell Biol 2003, 15:191-198.
  • 10. Mellor JC, Yanai I, Clodfelter KH, Mintseris J, DeLisi C: Predictome: a database of putative functional links between proteins. Nucleic Acids Res 2002, 30:306-309.
  • 11. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302:449-453.
  • 12. Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1:349-356.
  • 13. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale datasets of protein- protein interactions. Nature 2002, 417:399-403.
  • 14. Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306:1555-1558.
  • 15. Mrowka R, Patzak A, Herzel H: Is there a bias in proteome research? Genome Res 2001, 11:1971-1973.
  • 16. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al.: A protein interaction map of Drosophila melanogaster. Science 2003, 302:1727-1736.
  • 17. Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al.: A map of the interactome network of the metazoan C. elegans. Science 2004, 303:540-543.
  • 18. Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 2003, 31:248-250.
  • 19. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30:303-305.
  • 20. Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V,Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh S, et al.: Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 2004, 32(Database): D497-501.
  • 21. Lehner B, Fraser AG: A first-draft human protein-interaction map. Genome Biol 2004, 5:R63.
  • 22. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33(Database):D428-432.
  • 23. Reactome database [1] R40.12 Genome Biology 2005, Volume 6, Issue 5, Article R40 Ramani et al. http://genomebiology.com/2005/6/5/R40
  • 24. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, et al.: A physical and functional map of the human TNF-alpha/NFkappa B signal transduction pathway. Nat Cell Biol 2004, 6:97-105.
  • 25. Colland F, Jacq X, Trouplin V, Mougin C, Groizeleau C, Hamburger A, Meil A, Wojcik J, Legrain P, Gauthier JM: Functional proteomics mapping of a human signaling pathway. Genome Res 2004, 14:1324-1332.
  • 26. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28:21-28.
  • 27. Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA, Weng W, Wilbur WJ, et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 2004, 37:43-53.
  • 28. Liu H, Wong L: Data mining tools for biological sequences. J Bioinform Comput Biol 2003, 1:139-167.
  • 29. Lynette Hirschman, Park JC, Jun'ichi Tsujii, Wong L, Wu CH: Accomplishments and challenges in literature data mining for biology. Bioinformatics 2002, 18:1553-1561.
  • 30. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database):D277-280.
  • 31. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29.
  • 32. Bunescu R, Ge R, Kate R, Marcotte EM, Mooney RJ, Ramani AK, Wong YW: Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intell Med 2005 in press. doi:10.1016/j.artmed.2004.07.016
  • 33. Franzen K, Eriksson G, Olsson F, Asker L, Liden P, Coster J: Protein names and how to find them. Int J Med Inform 2002, 67:49-61.
  • 34. Fukuda K, Tamura A, Tsunoda T, Takagi T: Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput 1998:707-718.
  • 35. Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text. Bioinformatics 2002, 18:1124-1132.
  • 36. Marcotte EM, Xenarios I, Eisenberg D: Mining literature for protein-protein interactions. Bioinformatics 2001, 17:359-363.
  • 37. ID-Serve [2]
  • 38. Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5:101-113.
  • 39. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431:931-945.
  • 40. Stapley BJ, Benoit G: Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput 2000:529-540.
  • 41. Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings 18th Int Conf Machine Learning (ICML 2001) Edited by: Danyluk A. San Francisco: Morgan Kaufman; 2001.
  • 42. Brill E.: Transformation-based error driven learning and natural language processing: A case study in parts of speech tagging. Comput Linguistics 1995, 21:543-565.
  • 43. A. McCallum: MALLET: A Machine Learning for Language Toolkit 2002 [3].
  • 44. Gene Ontology database [4]
  • 45. KEGG Encyclopedia [5]
  • 46. Adai AT, Date SV, Wieland S, Marcotte EM: LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. J Mol Biol 2004, 340:179-190.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ConsolidHumanProteinProteinInteractArun K. Ramani
Razvan C. Bunescu
Edward M Marcotte
Raymond J. Mooney
Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human InteractomeGenome Biologyhttp://genomebiology.com/content/pdf/gb-2005-6-5-r40.pdf10.1186/gb-2005-6-5-r402005