2006 DiscoveringSemanticFeaturesInTheLiterature

Jump to: navigation, search

Subject Headings:


Cited By



= Background

  • Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.

= Results

  • We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes.

= Conclusion

  • The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.


  • We have presented a method that is able to discover semantic features from the analysis of literature relevant to sets of genes. The representation of genes as additive linear combinations of basis semantic features allows for the exploration of functional associations as well as clustering. We anticipate the potential use of our method for the validation and interpretation of high-throughput experimental data, as well as for the analysis of any genome-wide information.


  • 1. Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: An overview. J Comput Biol 2003, 10:821-855.
  • 2. Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003, 19 Suppl 1:i91-i94.
  • 3. Hearst MA: Untangling text data mining. Proc 37th Annual Meeting of the Association for Computational Linguistics 1999:3-10.
  • 4. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28:21-28.
  • 5. Jelier R, Jenster G, Dorssers LC, van der Eijk CC, van Mulligen EM, Mons B, Kors JA: Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes. Bioinformatics 2005, 21:2049-2058.
  • 6. Wren JD, Garner HR: Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 2004, 20:191-198.
  • 7. Blaschke C, Oliveros JC, Valencia A: Mining functional information associated with expression arrays. Funct Integr Genomics 2001, 1:256-268.
  • 8. Kuffner R, Fundel K, Zimmer R: Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 2005, 21 Suppl 2:ii259-ii267.
  • 9. Raychaudhuri S, Schutze H, Altman RB: Using text analysis to identify functionally coherent gene groups. Genome Res 2002, 12:1582-1590.
  • 10. Shatkay H, Edwards S, Wilbur WJ, Boguski M: Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol 2000, 8:317-328.
  • 11. Shatkay H, Wilbur WJ: Finding themes in Medline documents: Probabilistic similarity search. Proc IEEE Advances in Digital Libraries 2000:183-192.
  • 12. Chaussabel D, Sher A: Mining microarray expression data by literature profiling. Genome Biol 2002, 3:RESEARCH0055.
  • 13. Salton G: Automatic information organization and retrieval. New York, McGraw-Hill; 1968.
  • 14. Salton G, Wong A, Yang CS: A Vector Space Model for Automatic Indexing. Communications of the ACM 1975, 18:617-620.
  • 15. Glenisson P, Antal P, Mathys J, Moreau Y, De Moor B: Evaluation of the vector space representation in text-based gene clustering. Pac Symp Biocomput 2003:391-402.
  • 16. Iliopoulos I, Enright AJ, Ouzounis CA: Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 2001:384-395.
  • 17. Mao W, Chu WW: Free-text medical document retrieval via phrase-based vector space model. Proc AMIA Symp 2002:489-493.
  • 18. Renner A, Aszodi A: High-throughput functional annotation of novel gene products using document clustering. Pac Symp Biocomput 2000:54-68.
  • 19. Homayouni R, Heinrich K, Wei L, Berry MW: Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics 2005, 21:104-115.
  • 20. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R: Indexing by Latent Semantic Analysis. J Am Soc Inform Sci 1990, 41:391-407.
  • 21. Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate: profiling gene groups with text-based information. Genome Biol 2004, 5:R43.
  • 22. Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401:788-791.
  • 23. Kim PM, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res 2003, 13:1706-1718.
  • 24. Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 2004, 101:4164-4169.
  • 25. Heger A, Holm L: Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins. Bioinformatics 2003, 19 Suppl 1:i130-i137.
  • 26. Pehkonen P, Wong G, Toronen P: Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 2005, 6:162.
  • 27. (Xu et al., 2003) ⇒ Wei Xu, Xin Liu, and Yihong Gong. (2003). “Document Clustering Based on Non-Negative Matrix Factorization.” In: Proceedings of the 26th ACM SIGIR Conference (SIGIR 2003). doi:10.1145/860435.860485
  • 28. Shahnaz F, Berry MW, Pauca VP, Plemmons RJ: Document clustering using nonnegative matrix factorization. Information Processing & Management 2006, 42:373-386.
  • 29. Tsuge S, Shishibori M, Kuroiwa S, Kita K: Dimensionality reduction using non-negative matrix factorization for information retrieval. Proc IEEE Int Conf on Systems, Man and Cybernetics 2001, 2:960-965.
  • 30. Saccharomyces Genome Database (SGD) [1]
  • 31. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: genecentered information at NCBI. Nucleic Acids Res 2005, 33 Database Issue:D54-D58.
  • 32. Entrez Gene [2]
  • 33. Associated web site [3]
  • 34. SGD Gene Ontology Slim Mapper [4]
  • 35. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28:27-30.
  • 36. Kanehisa M: A database for post-genome analysis. Trends Genet 1997, 13:375-376.
  • 37. KEGG PATHWAY database [5]
  • 38. Hoffmann T: Probabilistic latent semantic indexing. Proc Int ACM SIGIR Conf on Research and Development in Information Retrieval 1999:50-57.
  • 39. Deerwester S, Dumais S, Landauer T, Furnas G, Beck L: Improving Information-Retrieval with Latent Semantic Indexing. P Asis Annu Meet P Asis Annu Meet 1988, 25:36-40.
  • 40. Landauer TK, Laham D, Derr M: From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci U S A 2004, 101 Suppl 1:5214-5219.
  • 41. Lee DD, Seung HS: Algorithms for non-negative matrix factorization. Proc Advances in Neural Information Processing 2000:556-562.
  • 42. Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD: Non-smooth Non-Negative Matrix Factorization (nsNMF). IEEE Trans on Pattern Analysis and Machine Intelligence 2006, 28:403-415.
  • 43. Singhal A: Modern information retrieval: a brief overview. IEEE Data Eng Bull 2001, 24:35-43.
  • 44. Spark-Jones K: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 1972, 28:11-21.
  • 45. Porter MF: An algorithm for suffix stripping. Program 1980, 14:130-137.
  • 46. Ward JH: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 1963, 58:236-244.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 DiscoveringSemanticFeaturesInTheLiteratureMonica Chagoyen
Pedro Carmona-Saez
Hagit Shatkay
Jose M Carazo
Alberto Pascual-Montano
Discovering semantic features in the literature: a foundation for building functional associationsBMC Bioinformaticshttp://www.biomedcentral.com/1471-2105/7/412006