2002 InferringSCLThroughAutoLexAnalysis

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Protein Localization Information Extraction Task.

Notes

Cited By

Quotes

Abstract

Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is available for only a few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.

Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for fewer than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at: http://cubic.bioc.columbia.edu/services/LOCkey

Bibliography

  • Adams, M. D. et al. (2000) The genome sequence of Drosophila melanogaster. Science, 287, 2185-2195.
  • Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. (1997) Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402.
  • Altschul, S. F. and Gish, W. (1996) Local alignment statistics. Meth. Enzymol., 266, 460-480.
  • Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403- 410.
  • Andrade, M. A., Brown, N. P., Leroy, C., Hoersch, S., de Daruvar, A., Reich, C., Franchini, A., Tamames, J., Valencia, A., Ouzounis, C. and Sander, C. (1999) Automated genome sequence analysis and annotation. Bioinformatics, 15, 391-412.
  • Andrade, M. A. and Valencia, A. (1998) Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, 14, 600-607.
  • Apte, C., Damerau, F. and Weiss, S. (1994) Towards language inde- pendent automated learning of text categorization models. Pro- ceedings of the 17th Annual ACM/SIGIR conference.
  • Apweiler, R. (2001) Functional information in SWISS-PROT: the basis for large-scale characterisation of protein sequences. Brief Bioinform., 2, 9-18.
  • Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Herm- jakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Kar- avidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J. and Zdobnov, E. M. (2000) InterPro-an integrated documentation resource for protein families, domains and functional sites. Bioinformatics, 16, 1145-1150.
  • Apweiler, R., Gateau, A., Contrino, S., Martin, M. J., Junker, V., O'Donovan, C., Lang, F., Mitaritonna, N., Kappus, S. and Bairoch, A. (1997) Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL. Proceedings of International Conference Intell. Syst. Mol. Biol., 5, 33-43.
  • Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the ?owering plant Arabidopsis thaliana. Nature, 408, 796-815.
  • Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in (2000). Nucleic Acids Res., 28, 45-48.
  • Baker, P. G. and Brass, A. (1998) Recent developments in biological sequence databases. Curr. Opin. Biotechnol., 9, 54-58.
  • Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. and Yuan, Y. (1998) Predicting function: from genes to genomes and back. J. Mol. Biol., 283, 707-725.
  • Bork, P. and Gibson, T. J. (1996) Applying motif and profile searches. Meth. Enzymol., 266, 162-184.
  • Bork, P. and Koonin, E. V. (1998) Predicting functions from protein sequences-where are the bottlenecks? Nature Genet., 18, 313- 318.
  • Casari, G., Andrade, M. A., Bork, P., Boyle, J., Daruvar, A., Ouzou- nis, C., Schneider, R., Tamames, J., Valencia, A. and Sander, C. (1995) Challenging times for bioinformatics. Nature, 376, 647- 648.
  • Cokol, M., Nair, R. and Rost, B. (2000) Finding nuclear localisation signals. EMBO Reports, 1, 411-415.
  • Dasarathy, B. V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, California.
  • Devos, D. and Valencia, A. (2001) Intrinsic errors in genome annota- tion. Trends Genet., 17, 429-431.
  • Doerks, T., Bairoch, A. and Bork, P. (1998) Protein annotation: detec- tive work for function prediction. Trends Genet., 14, 248-250.
  • Eisenberg, D., Marcotte, E. M., Xenarios, I. and Yeates, T. O. (2000) Protein function in the post-genomic era. Nature, 405, 823-826.
  • Eisenhaber, F. and Bork, P. (1998) Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol., 8, 169-170.
  • Eisenhaber, F. and Bork, P. (1999) Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics, 15, 528-535.
  • Fleischmann, R. D. et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496- 512.
  • Fleischmann, W., Moller, S., Gateau, A. and Apweiler, R. (1999) Anovel method for automatic functional annotation of proteins. Bioinformatics, 15, 228-233.
  • Frishman, D. (2000) PEDANT: Protein Extraction, Description, and Analysis Tool. Max-Planck-Institute, Munich.
  • Gaasterland, T. and Sensen, C. W. (1996) MAGPIE: automated genome interpretation. Trends Genet., 12, 76-78.
  • Galperin, M. Y. and Koonin, E. V. (2000) Who's your neighbor? New computational approaches for functional genomics. Nat. Biotechnol., 18, 609-613.
  • Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feld- mann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S. G. (1996) Life with 6000 genes. Science, 274, 546- 567.
  • Hobohm, U., Scharf, M., Schneider, R. and Sander, C. (1992) Selec- tion of representative protein data sets. Protein Sci., 1, 409-417.
  • Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database, its status in (1999). Nucleic Acids Res., 27, 215-219.
  • Karp, P. D., Riley, M., Paley, S. M., Pellegrini-Toole, A. and Krummen- acker, M. (1999) Eco Cyc: encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res., 27, 55-58.
  • Koonin, E. V. (2000) Bridging the gap between sequence and func- tion. Trends Genet., 16, 16.
  • Krawiec, S. and Riley, M. (1990) Organization of the bacterial chromosome. Microbiol. Rev., 54, 502-539.
  • Kretschmann, E., Fleischmann, W. and Apweiler, R. (2001) Auto- matic rule generation for protein annotation with the C4. 5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17, 920-926.
  • Lewis, D. D. and Ringuette, M. (1994) Comparison of two learning algorithms for text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94).
  • Lewis, S., Ashburner, M. and Reese, M. G. (2000) Annotating eukary- ote genomes. Curr. Opin. Struct. Biol., 10, 349-354.
  • Liu, J. and Rost, B. (2000) Analysing All Proteins in Entire Genomes, CUBIC, Columbia University, Department of Biochemistry and Molecular Biophysics.
  • Nielsen, H., Engelbrecht, J., Brunak, S. and von Heijne, G. (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. International J. Neural Syst., 8, 581-599.
  • Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proceedings of Natl Acad. Sci. USA, 85, 2444- 2448.
  • Remm, M., Storm, C. E. and Sonnhammer, E. L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314, 1041-1052.
  • Riley, M. (1993) Function of the gene products in Escherichia coli. Microbiol. Rev., 57, 862-952.
  • Riley, M. and Labedan, B. (1997) Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of astructural segment of homology, the module. J. Mol. Biol., 268, 857-868.
  • Rost, B. (1999) Twilight zone of protein sequence alignments. Protein Eng., 12, 85-94.
  • Rost, B. (2001) Enzyme function less conserved than anticipated. J. Mol. Biol., submitted.
  • Salton, G. (1989) Automatic Text Processing. Addison-Wesley, Reading, MA.
  • Sander, C. and Schneider, R. (1994) The HSSP database of protein structure-sequence alignments. Nucleic Acids Res., 22, 3597- 3599.
  • Schutze, H., Hull, D. A. and Pederson, J. O. (1995) A comparison of classifiers and document representation for the routing problem. 18th Ann Int ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (SIGIR '95). pp. 229-237.
  • Shannon, C. E. (1951) Prediction and entropy of printed English. Bell System Tech. J., 30, 50-64.
  • Tamames, J., Ouzounis, C., Casari, G., Sander, C. and Valencia, A. (1998) EUCLID: automatic classification of proteins in func- tional classes by their database annotations. Bioinformatics, 14, 542-543.
  • Tamames, J., Ouzounis, C., Sander, C. and Valencia, A. (1996) Genomes with distinct function composition. FEBS Lett., 389, 96-101.
  • Tatusov, R. L., Galperin, M. Y., Natale, D. A. and Koonin, E. V. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28, 33-36.
  • The C. elegans Sequencing Consortium, (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012-2018.
  • Wall, L. and Schwartz, R. L. (1990) Programming Perl. O'Reilly, Sebastopol, CA.
  • Yang, Y. and Chute, C. G. (1992) An application of least squares fit mapping to clinical classification. Proceedings of the Annual Symposium on Computer Applications in Medical Care. pp. 460- 464.
  • Yang, Y. and Liu, X. (1999) A re-examination of text categorisation methods. Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 42-49.
  • Yang, Y. and Pederson, J. P. (1997) A comparative study on feature selection in text categorization. The Fourteenth International Conference on Machine Learning. pp. 412-420.,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 InferringSCLThroughAutoLexAnalysisRajesh Nair
Burkhard Rost
Inferring Sub-cellular Localization Through Automated Lexical AnalysisBioinformatics Subject Areahttp://cubic.bioc.columbia.edu/papers/2002 loci text/paper.pdf2002