2002 PredictingSCLfromTextUsingSVMs

(Stapley et al., 2002) ⇒ Ben J. Stapley, Lawrence A. Kelley, and Michael J. E. Sternberg. (2002). “Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines.” In: Pacific Symposium on Biocomputing, 7.

Subject Headings: Subcellular Localization Mention, Protein Localization Information Extraction Task.

Notes

Task: Given the members of the concept types, identify any one-to-many relationships between the instances from text data based on a set of training examples. Specifically the task is to report the implied protein localizations for the common Yeast. Each protein has one localization and each localization is targeted by many proteins.
Approach: For each of the instance of the entity in the one-to-many relationship (i.e. for each protein), create a vector where the elements mark the presence of some term(word) in the text. For some of these vectors the single localization label is associated.
Input Data: Abstracts (as vector of words?)
Training Data: Positive examples
Experiments: Does not compare against another system. (but there may be numbers for this data set)
This approach avoids NLP. This fact is suggested NLP may be a good future
Text relevant to these proteins is obtained from Medline by key-word matching of the 'gene naming terms'
Any document that contained an occurrence of the gene name or aliases of that gene was considered relevant.
Used a variant of inverse document frequeny (IDF) though based on someone elses work suggests that the transformation is not critical.
They measured the recall and precision for each of the 11 localizations. Because each localization was supported by a different
number and ratio of positive to negative examples they investegated a few other types of measures. (Unfortuante that they did not
report an overall F-score nonetheless)

Cited By

Quotes

Abstract

We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medilines in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S.cerevisiae. No prior knowledge of the probelm domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2002 PredictingSCLfromTextUsingSVMs	Ben J. Stapley Lawrence A. Kelley Michael J. E. Sternberg			Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines		Pacific Symposium on Biocomputing	http://psb.stanford.edu/psb-online/proceedings/psb02/stapley.pdf			2002