2006 SigImprovPredOfSCLByIntegTextAndSeqData

(Hoglund et al., 2006) ⇒ Annette Hoglund, Torsten Blum, Scott Brady, Pierre Donnes, John San Miguel, Matthew Rocheford, Oliver Kohlbacher, Hagit Shatkay. (2006). “Significantly Improved Prediction of Subcellular Localization by Integrating Text and Protein Sequence Data.” In: Pacific Symposium on Biocomputing, vol.11.

Subject Headings: Subcellular Protein Localization.

Notes

Cited By

Quotes

Abstract

Computational prediction of protein subcellular localization is a challenging problem. Several approaches have been presented during the past few years; some attempt to cover a wide variety of localizations, while others focus on a small number of localizations and on specific organisms. We present a comprehensive system, integrating protein sequence-derived data and text-based information. It is tested on three large data sets, previously used by leading prediction methods. The results demonstrate that our system performs significantly better than previously reported results, for a wide range of eukaryotic subcellular localizations.

Introduction

Knowing a protein’s localization helps elucidate its function, its role in both healthy processes and in the onset of disease, and its potential use as a drug target. Experimental methods for protein localization range from immunolocalization to tagging of proteins using green fluorescent protein (GFP) and isotopes . Such methods are accurate but, even at their best, are slow and labor-intensive compared with large-scale computational methods. Computational tools for predicting localization are useful for a large-scale initial “triage”, especially for proteins whose amino acid sequence may be determined from the genomic sequence, but are hard to produce, isolate, or locate experimentally.
Several recent publications have examined the possibility of using text to support subcellular localization. Specifically, Stapley et al. represented yeast proteins as vectors of weighted terms from all the PubMed articles mentioning their respective genes. They then trained a support vector machine (SVM) on protein text-vectors, to distinguish among subcellular localizations. The performancewas favorable when compared to a classifier trained on amino acid composition alone, but it was not compared against any state-of-the-art localization system, and the reported results do not suggest an improvement over earlier systems. Moreover, while their text-based classifier performed better than an amino acid composition classifier, combining the two forms of data did not significantly improve performance with respect to the text-based classifier alone.

Text-based method

The idea underlying the text-based classifier is the representation of each protein as a vector of weighted text features. While text-based localization has been presented before, the key differences between the current work and previous ones is in the text source used, the feature selection, and the term weighting scheme.
First, for each protein the text comes from the abstracts curated for the protein in its Swiss-Prot entry. We used a script that scanned each protein in Swiss-Prot for all the PubMed identifiers occurring in its Swiss-Prot entry, and obtained the respective title and abstract from PubMed. Each protein is thus assigned a set of PubMed abstracts, based on Swiss-Prot. This choice of abstracts is different from that of Stapley et al. who used all the PubMed abstracts mentioning the gene’s name, and from that of Nair and Rost – who use Swiss-Prot annotation text rather than PubMed abstracts. The assigned abstracts are then tokenized into a set of terms, consisting of singleton and pairs of consecutive words, with a list of standard stop words excluded from consideration. The results reported here also include the application of Porter stemming to all the words in the terms.
Second, from all the extracted terms, we select a subset of distinguishing terms. This is done by scoring each term with respect to each subcellular localization, where the score reflects the probability of the term to occur in abstracts that are associated with proteins of this certain localization. Intuitively, a term is distinguishing for a localization, if it is much more likely to occur in abstracts associated with localization than with abstracts associated with all other localizations.

References

1. Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 300 (2000) 1005–1016
2. Nair, R., Rost, B.: Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 18 (2002) S78–S86
3. Gardy, J.L., Spencer, C., Wang, K. el al.: PSORT-B: Improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Research 31 (2003) 137–140

4. Cai, Y.D., Chou, K.C.: Predicting 22 protein localizations in budding yeast. Biochem Biophys Res Commun. 323 (2004) 425–428 5. Schneider, G., Fechner, U.: Advances in the prediction of protein targeting signals. Proteomics 4 (2004) 1571–1580 6. D¨onnes, P., H¨oglund, A.: Predicting Protein Subcellular Localization: Past, Present, and Future. Genomics, Proteomics, and Bioinformatics 2 (2004) 7. Burns, N., Grimwade, B., Ross-Macdonald, P., Choi, E., Finberg, K., GS, R., M, S.: Large-scale analysis of gene expression, protein localization and gene disruption in Saccharomyces cerevisiae. Genes and Development 8 (1994) 1087–1105 8. Hanson, M.R., K¨ohler, R.H.: GFP imaging: Methodology and application to investigate cellular compartmentation in plants. Journal of Experimental Botany 52 (2001) 9. Dunkley, T., Watson, R., Griffin, J., Dupree, P., Lilley, K.: Localization of organelle proteins by isotope tagging (LOPIT). Molecular and Cellular Proteomics 3 (2004) 10. Nakai, K., Kanehisa, M.: Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Structure, Function and Genetics 11 (1991) 95–110 11. Nakai, K., Kanehisa, M.: A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 14 (1992) 897–911 12. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellular localization of proteins. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB). (1996) 13. Horton, P., Nakai, K.: Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB). (1997)

14. Emanuelsson, O., Nielsen, H., von Heijne, G.: Chlorop, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Science 8 (1999) 978–984
15. * (Bannai et al., 2002) ⇒ Hideo Bannai, Yoshinori Tamada, Osamu Maruyama, Kenta Nakai and Satoru Miyano. (2002). “Extensive feature detection of N-terminal protein sorting signals.” In: Bioinforatics, 18(2).
16. Nair, R., Rost, B.: Mimicking cellular sorting improves prediction of subcellular localization. J Mol Biol. 348 (2005) 85–100

17. Stapley, B.J., Kelley, L.A., Sternberg, M.J.E.: Predicting the subcellular location of proteins from text using support vector machines. In: Proceedings of the Pacific Symposium on Biocomputing (PSB). (2002). 374–385 18. Eskin, E., Eugene Agichtein: Combining text mining and sequence analysis to discover protein functional regions. In: Proceedings of the 9th Pacific Symposium on Biocomputing (PSB). (2004). 288–299 19. Park, K.J., Kanehisa, M.: Prediction of protein subcellular location by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 19 (2003) 1656–1663 20. H¨oglund, A., D¨onnes, P., Blum, T., Adolph, H., Kohlbacher, O.: Using N-terminal targeting sequences, amino acid composition, and sequence motifs for predicting protein subcellular localization. German Conference on Bioinformatics (GCB) 2005. 21. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines (2003) http://www.csie.ntu.edu.tw/ clin/libsvm/. 22. Wu, T.F., Linand, C.J., Weng, R.C.: Probability Estimates for Multi-class Classification by Pairwise Coupling. Journal of Machine Learning Research 5 (2004) 975–1005 23. Bairoch, A., Bucher, P.: PROSITE: recent developments. Nucleic Acids Res. 22 (1994) 3583–3589 24. Cokol, M., Nair, R., Rost, B.: Finding nuclear localization signals. EMBO Rep. 1 (2000) 411–415 25. Nair, R., Carter, P., Rost, B.: NLSdb: database of nuclear localization signals. Nucleic Acids Res. 31 (2003) 397–399 26. Porter, M.F.: An Algorithm for Suffix Stripping (Reprint). In: Readings in Information Retrieval. Morgan Kaufmann (1997) http://www.tartarus.org/ martin/PorterStemmer/. 27. Walpole, R.E., Myers, R.H., Myers, S.L. In: One- and Two-Sample Tests of Hypotheses. (1998) 235–335 28. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement in TrEMBL in (2000). Nucleic Acids Res. 28 (2000) 45–48

29. Matthews, B.W.: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 405 (1975) 442–451,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 SigImprovPredOfSCLByIntegTextAndSeqData	Annette Hoglund Torsten Blum Scott Brady Pierre Donnes John San Miguel Matthew Rocheford Oliver Kohlbacher Hagit Shatkay			Significantly Improved Prediction of Subcellular Localization by Integrating Text and Protein Sequence Data		Pacific Symposium on Biocomputing	http://helix-web.stanford.edu/psb06/hoglund.pdf			2006