Subject Headings: Document Vector, PUBMED Abstract, Gene Function, Coocurrence Relation Extraction, Gene Mention.
- Research in bioinformatics in the past decade has generated a large volume of textual biological data stored in databases such as MEDLINE. It takes a copious amount of effort and time, even for expert users, to manually extract useful information embedded in such a large volume of retrieved data and automated intelligent text analysis tools are increasingly becoming essential. In this article, we present a simple analysis and knowledge discovery method that can identify related genes as well as their shared functionality (if any) based on a collection of relevant retrieved relevant MEDLINE documents. The relative computational simplicity of the proposed method makes it possible to process and analyze large volumes of data in a short time. Hence, it significantly contributes to and enhances a user’s ability to discover such embedded information. Two case studies are presented that indicate the usefulness of the proposed method."
Text Document Representation
- The document representation step converts text documents into structures that can be efficiently processed without the loss of vital content. At the core of this process is a thesaurus, an array T of atomic tokens (e.g., a single term) each identified by a unique numeric identifier culled from authoritative sources or automatically
- The purpose of the document representation step is to convert each document to a weight vector whose dimension is the same as the number of terms in the thesaurus
- In this section, we describe how the document vectors can be used to identify Gene pair relationships. The goal is to discover pairs of genes from a collection of retrieved text documents such that the genes in each pair are related to one other in some manner. This is similar, in spirit, to the problem of association rule discovery, extensively studied in the database mining literature. However, there are differences between gene-association discovery and association rule discovery in databases:
- (i) Association rule discovery is frequently based on transaction records, stored in specific formats; whereas the gene relationships are discovered from natural language text.
- (ii) Commonly, database association rule discoveries are based on frequencies of individual items as well as the joint frequencies of pairs. In the context of a text document, these parameters are insufficient.
- Figure 5-1: The Thesaurus of Relationships
- "activates, activator" "inhibits, inhibitor" "phosphorylates" "binds, binding, complexes" "catalyst, catalyses" "hydrolysis, hydrolyzes" "cleaves" "adhesion" "donates" "regulates" "induces" "creates" "becomes" "transports" "exports" "releases" "suppresses, suppressors"
Future Directions and Enhancements:
- The association and functional relationship discovery algorithm described in this paper are based on the information contained in the retrieved documents from MEDLINE. As pointed out in section 6, the abstracts extracted from MEDLINE in some cases lacked specific information concerning gene functions. One way to remedy this problem is to access and analyze full documents, rather than only abstracts. The retrieved collection can also be augmented by accessing other textbased collections, such as the On-Line Mendelian Inheritance in Man (OMIM) collection, in addition to MEDLINE. The sequence databases (e.g., GeneBank) also contain functional information about genes, which can be utilized. Finally, predicting functions of genes from sequences using computational models (e.g. Hidden Markov Models or Neural Networks) is an important and on-going international effort. Accessing sequence databases with the gene names to retrieve their sequence data and then applying the prediction models on that sequence data would help leverage finding new relationships by accurate computational models. The intersection of the sets of functions (either accessed from sequence databases or predicted form sequences) of associated genes can be a useful pointer for identifying the nature of the relationship.
- Another future direction of great usefulness is to integrate association discovery tools with profile-based information filtering (IF) engines. Such biological IF systems retrieve documents on the basis of stable long-term automatically learned profiles of user interests, rather than specific user queries. Such an integrated filtering and analysis system will help the user to keep up-to-date with evolving document and information collections.
- N. See-Kiong and M.Wong, “Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts” Genome Informatics 10: 104-112 (1999).
- T. Sekimizu, H. Park, and Jun'ichi Tsujii, “Identifying the interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts” Genome Informatics 9: 62-71 (1998).
- D. Proux, F. Rechenmann, L. Julliard, V. Pillet, and B. Jacq, “Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction” Genome Informatics 9: 72-80 (1998).
- T. Hishiki, N. Collier, C. Nobata, T. Okazaki-Ohta, N. Ogata, H. Park, C. Nobata, T. Sekimizu, and Jun'ichi Tsujii, “Developing NLP Tools for Genome Informatics: An Information Extraction Perspective” Genome Informatics 9: 81-90 (1998).
- M. Andrade and A. Valencia, “Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families” Bioinformatics, 14:600-6007 (1998) * O. no, A. anigami, A. is iga [math]i[/math], O. a agi, “Automatic extraction of Information on Protein-Protein Interaction from Scientific Literature” Genome Informatics 10: 296-297 (1999)
- Gerard M. Salton, Automatic Text Processing. Addison-Wesley (1989)
- Rothblatt J., Novick P., Stevens T. Guidebook to the Secretory Pathway. Oxford University Press Inc., New York (1994)
- Lodish H., Berk A., Matsudaira P., Baltimore D., Zipursky S., Darnell J. Molecular Cell Biology. Third Edition. Scientific Books, Inc. New York (1995)
- Wilbur WJ, Yang Y. “An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts” Comput Biol Med. 1996 May; 26(3):209-22. Pacific Symposium on Biocomputing 6:483-496 (2001),