2009 WeightedRandomSubspaceMethodfor

(Li & Zhao, 2009) ⇒ Xiaoye Li, and Hongyu Zhao. (2009). “Weighted Random Subspace Method for High Dimensional Data Classification.” In: Statistics and its Interface Journal, 2(2). doi:10.4310/sii.2009.v2.n2.a5

Subject Headings: Random Subspace Method Algorithm, Weighted Random Subspace Method Algorithm.

Notes

Cited By

Google Scholar: ~28 Citations [1]
Semantic Scholar: ~ 16 Citations [2]

Quotes

Author Keywords

Classification; Aggregating Algorithm; Voting Weight; Random Subspace Projection.

Abstract

High dimensional data, especially those emerging from genomics and proteomics studies, pose significant challenges to traditional classification algorithms because the performance of these algorithms may substantially deteriorate due to high dimensionality and existence of many noisy features in these data. To address these problems, pre-classification feature selection and aggregating algorithms have been proposed. However, most feature selection procedures either fail to consider potential interactions among the features or tend to over fit the data. The aggregating algorithms, e.g. the bagging predictor, the boosting algorithm, the random subspace method, and the Random Forests algorithm, are promising in handling high dimensional data. However, there is a lack of attention to optimal weight assignments to individual classifiers and this has prevented these algorithms from achieving better classification accuracy. In this article, we formulate the weight assignment problem and propose a heuristic optimization solution. We have applied the proposed weight assignment procedures to the random subspace method to develop a weighted random subspace method. Several public gene expression and mass spectrometry data sets at the Kent Ridge biomedical data repository have been analyzed by this novel method. We have found that significant improvement over the common equal weight assignment scheme may be achieved by our method.

1. INTRODUCTION

With rapid improvement of computing power and innovations in biological technology, huge amounts of biological data can be produced, recorded, stored and shared rather easily today. These data contain information that can be used to gain biological insight and to make further deductions or predictions based on the knowledge acquired. For the bioinformatics community, the microarray and mass spectrometry technologies have drawn great attention because of their abilities to measure the expression levels of thousands of genes or proteins simultaneously. Such information can be analyzed by statistical and data mining methods to identify disease related biomarkers and to understand the mechanism of the underlying biological processes, e.g. 5, 15. However, the high dimensionality of genomics and proteomics data and the existence of many noisy features can substantially deteriorate the performance of many traditional classification algorithms. For example, the linear discriminant analysis (LDA) and the quadratic discriminant analysis (QDA) are not directly applicable because they require the number of features be less than the number of samples which is usually not the case for gene expression and mass spectrometry data. The performances of the support vector machine (SVM) and the k-nearest-neighbor classifier (knn) also decay quickly with an increasing number of noisy features despite their ability to handle a large number of features. Therefore, various procedures have been proposed to tackle these problems posed by high dimensional data, including feature selection and individual classifier aggregation.

The pre-classification feature selection is commonly used to directly reduce the number of features and only those features likely to be differentially expressed among different classes are selected for classification analysis. The feature selection methods fall into two categories, namely, the filters and the wrappers, e.g. 8, [#R12|12]]. The filter approach evaluates the discriminant power of each feature individually. It neglects potential interactions among the features. The wrapper approach selects the combination of features that has the best performance on the training set, which may cause over-fitting when the sample is small relative to the number of features. In practice, although the feature selection step helps to eliminate the noisy features, it also likely eliminates potentially informative features from the data. Therefore, the information in the original data may not be fully utilized due to the removal of informative features. In addition, feature selection increases the chance of over-fitting by constructing a classifier on a small fraction of features that may only have good performance on the training set. Therefore, the feature selection approach does not provide a satisfactory solution to the problems in high dimensional data classification.

Aggregating algorithms have been proposed in recent years for high dimensional data classification to overcome over-fitting and improve prediction accuracy. These methods have enjoyed great popularity because of their good performance. They include the bagging predictor 1, the boosting algorithm 4, and the Random Forests algorithm 3. They share some common features such as the perturbation of the original training set to reduce over-fitting and the aggregations of a large number of individual classifiers hoping that the majority of the classifiers would make correct predictions. Random Forests introduces the random feature selection step into the traditional tree classifier to further reduce over-fitting and achieves better performance over the bagging predictor. The boosting algorithm adaptively changes the sampling probabilities in bootstrapping and the voting weights in aggregating according to the performance of the individual classifiers on the training set. Since these aggregating algorithms all use the tree classifier as the base learner which can handle high dimensional data, they can circumvent the feature selection step to make full use of the information contained in all the features. The implicit weight adjustment among individual classifiers through the random feature selection mechanism in Random Forests and the explicit weight assignment in the boosting algorithm help to reduce the effects from the noisy features. However, the bootstrap distorts the original distribution of the training samples and achieves improved performance only with instable algorithms like the tree classifier 2. Furthermore, the weight adjustment procedures introduced in Random Forests and the boosting algorithm do not consider the correlations among individual classifiers, leaving room for improved weight assignment scheme.

The random subspace method introduced by Ho (1998) represents a distinct aggregating method that has a fundamentally different philosophy from the above aggregating methods. The random subspace method originated from the stochastic discriminant analysis 10, 11, which uses weak classifiers that are not necessarily very accurate but generalize well as base learner and achieves good prediction accuracy by aggregating many different such individual classifiers. The random subspace method generates multiple individual classifiers by projecting the original feature space into different subspaces. The projection from the high dimensional feature space into the low dimensional space can avoid the problems caused by high dimensionality. And all the information in the original data is maintained by aggregating many individual classifiers based on different subspaces. However, the original random subspace method did not emphasize the importance of selecting a base learner that should generalize well. Also, all individual classifiers have equal weights so the existence of uninformative subspaces would still deteriorate the performance of the aggregated classifier.

To address the limitations of the current methods, we consider alternative approaches for weight assignments in this article. More specifically, we incorporate the correlations among individual classifiers. As for individual classifiers, we focus on the random subspace idea to generate individual classifiers on the low dimensional subspaces. In contrast to perturbing the original samples, we believe that the random projection into subspaces can keep the completeness of the information in the original data. At the same time, it can substantially reduce the feature number so that many good traditional classifiers can be used as base learner. In this article, we focus on SVM as our base learner because it has good performance on low dimensional data and generalizes well, making the weights derived from the training set have good performance on the test data.

The layout of this article is as follows: In the Methods section, we discuss our proposed weight assignment schemes. In the Experiments section we evaluate the performance of our weight assignment procedures based on a number of gene expression and mass spectrometry data sets. We conclude this article by discussing the base learner selection and weight assignment issues.

2. METHOD

Individual classifiers are built by randomly projecting the original data into subspaces and training proper base learner on these subspaces to capture possible patterns that are informative on classification. Because SVM with appropriate kernels has the ability to classify sample with nonlinear boundary and generalizes well, we use SVM here as the base learner.

For high dimensional data, because most subspaces may only contain noisy feature and individual classifiers developed from these subspaces are not informative, an equal voting weight scheme may not be ideal. Intuitively, patterns with better classification accuracy should be assigned more weight. In the following, we discuss possible weight assignment schemes. We first consider a simplified scenario where all the individual classifiers are independent of each other, and derive an optimal weight under this assumption as shown in the following theorem.

2.1 Weight assignment for the independent case

Theorem 2.1 Optimal weight assignment for the independent case)

Suppose we have n independent individual classifiers to classify the samples with the classification error rates [math]\displaystyle{ \{e_1,e_2,, \cdots, e_n }[/math] respectively. If these classifiers have voting weights [math]\displaystyle{ q_1, q_2, \cdots, q_n }[/math] respectively where [math]\displaystyle{ q_1 +q_2 + \cdots+q_2 = 1 }[/math], the error rate for the aggregated classifier is

[math]\displaystyle{ \sum_Q \left[\prod_{q_i\in Q} e_i\; \prod_{q_j \in Q^c} (1-e_j)\right], }[/math]

where [math]\displaystyle{ Q }[/math] is any subset of [math]\displaystyle{ \{q_1, q_2, \cdots, q_n\} }[/math] that satisfies [math]\displaystyle{ \sum_{q \in Q} q \gt \frac{1}{2} }[/math], and [math]\displaystyle{ Q^{c} }[/math] is the complement of [math]\displaystyle{ Q }[/math]. The weight assignments [math]\displaystyle{ q_k=\frac{1}{Cons}\log\frac{1-e_k}{e_k} }[/math], where [math]\displaystyle{ Const=\sum_{k=1}^n \times log\frac{1-e_k}{e_k} }[/math] to make the [math]\displaystyle{ q_k }[/math] sum up to [math]\displaystyle{ 1 }[/math], achieve the minimal error rate.

Proof

Without loss of generality, we suppose [math]\displaystyle{ e_{1},e_{2}, \cdots e_{n} }[/math]. Note that for any subset of [math]\displaystyle{ \{q_{1}, q_{2}, \cdots, q_n\} }[/math] whose sum is greater than [math]\displaystyle{ \frac{1}{2} }[/math], its complement has sum less than [math]\displaystyle{ \frac{1}{2} }[/math]. Therefore, there are a total of [math]\displaystyle{ 2^{n-1} }[/math] subsets with sum greater than [math]\displaystyle{ \frac{1}{2} }[/math]. So to minimize the error rate for the aggregated classifier, we only need to make sure that out of the possible [math]\displaystyle{ _{n} }[/math] subsets, we assign the weights such that the [math]\displaystyle{ 2^{n-1} }[/math] subsets [math]\displaystyle{ Qs }[/math] with the smallest values of [math]\displaystyle{ \prod_{q_i \in Q} e_i \prod_{q_j \in Q^c} (1 - e_{j}) }[/math] satisfy [math]\displaystyle{ \sum_{q\in Q} q \gt \frac{1}{2} }[/math]. This is equivalent to the following condition: given any two subsets [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], if

[math]\displaystyle{ \prod_{q_i\in P} e_i \prod_{q_j\in P^c}(1-e_j) \geq \prod_{q_i\in Q} e_i \prod_{q_j\in Q^c} (1 -e_j), }[/math]

then [math]\displaystyle{ \sum_{q \in P} \lt \sum_{q \in Q} q }[/math]. Taking the logarithm of the last equation, we obtain:

[math]\displaystyle{ \sum_{q_i \in P\ Q}\log\frac{1-e_i}{e_i} \leq \sum;_{q_i \in Q\ P}\log\frac{1-e_j}{e_j}. }[/math]

Thus, if we let [math]\displaystyle{ \frac{1-e_k}{e_k} }[/math] for [math]\displaystyle{ k = 1, 2, \cdots, n }[/math], we obtain [math]\displaystyle{ \sum_{q \in P\ Q} q \leq \sum {q \in Q\ P} q }[/math], which can be used to deduce that [math]\displaystyle{ \sum_{q \in P} q \leq \sum {q \in Q} q }[/math]. Therefore the weight assignment achieves the minimal error rate.

The above theorem gives the exact weight for each individual classifier according to its performance to achieve the best aggregated accuracy. For individual weak classifiers with error rate [math]\displaystyle{ \frac{1}{2} }[/math], no weight will be assigned to them and for classifiers with error rate 0, all weight will be assigned to it, which is intuitively true.

We observe that the weights given in the above theorem are exactly the same as the heuristic weights used in the boosting algorithm. Thus, our results suggest that if we assume that all the individual classifiers are independent, the boosting algorithm gives the optimal aggregating accuracy.

2.2 Weight assignment for the dependent case

Under our current random subspace sampling scheme, especially when the subspace dimension is large, it is very possible that there are overlaps of the features used in constructing individual classifiers on different subspaces. Furthermore, the original features may already have strong correlations among them. Thus, some of the individual classifiers based on the selected subspaces can show correlated classification results. Therefore, there are likely complicated correlations among the individual classifiers and the independence assumption discussed in the previous subsection usually does not hold. A weight assignment mechanism should incorporate information about the potential complicated correlations among individual classifiers in the ensemble of individual classifiers.

We first formally formulate the problem. Suppose there are [math]\displaystyle{ k }[/math] weak classifiers [math]\displaystyle{ X_{i} }[/math], where [math]\displaystyle{ i= 1, \cdots, k }[/math]. Each [math]\displaystyle{ X_{i} }[/math] takes on two possible values, 1 with probability [math]\displaystyle{ e_{i} }[/math], which represents misclassification, and 0 otherwise. Our objective is to find a weight assignment [math]\displaystyle{ w_{i} }[/math] for each classifier, satisfying [math]\displaystyle{ w_{i} \geq 0 }[/math] and [math]\displaystyle{ \sum_{i=1}^k w_i=1 }[/math], such that the aggregated error rate [math]\displaystyle{ e=P \{\sum_{i=1}^{k} w_i X_i \gt \frac{1}{2} \} }[/math] is minimized. If the joint distribution of ([math]\displaystyle{ X_{1}, \cdots, X_{k} }[/math]) is known, the distribution of [math]\displaystyle{ \sum_{i=1}^k w_i Xi }[/math] in terms of the [math]\displaystyle{ w_i }[/math] can in principle be derived and the optimal weight assignment that minimizes the probability that a sample is misclassified may be identified. However, the joint distribution is often unknown and this optimization problem does not lead to easy numerical solutions.

We propose to formulate this problem from an alternative perspective that can be solved easily and has some statistical justifications. Because the aggregated classifier can be viewed as a random variable in the above paragraph and we can cast our goal as to minimize both its bias (from 0) and variance. The expectation of the square loss [math]\displaystyle{ E \left(\sum_{i=1}^k w_i X_i\right)^2 }[/math] can be written as [math]\displaystyle{ \left(E \left(\sum_{i=1}^k w_i X_i\right)\right)^2 +Var\left(\sum_{i=1}^k w_i X_i \right) }[/math], which is the sum of the squared bias and variance. By minimizing this objective function, we simultaneously control the bias and variance and the aggregated classifier can be expected to achieve good performance. The objective function is a quadratic form of the w_{i}s and can be easily solved. To summarize, our goal is to find the [math]\displaystyle{ w_{i} }[/math]:

[math]\displaystyle{ min\{\sum_{i=1}^k \left(E X_i^2\right) w_i^2+2 \sum_{1\leq j\leq k}\left(E X_i X_j\right) w_i w_j }[/math]

subject to:

[math]\displaystyle{ \sum_{i=1}^k\lt w_i= 1,\quad w_i\leq 0 \; \text{ for } i=1,\cdots,k }[/math]

In this simplified problem, only pair-wise correlations between individual classifiers are needed. The joint distribution of pair-wise weak classifiers can be estimated using the classification result on the training set. Actually, if we let [math]\displaystyle{ a }[/math] denote the number of samples both the classifiers [math]\displaystyle{ X_{i} }[/math] and [math]\displaystyle{ X_{j} }[/math] make correct classification, [math]\displaystyle{ b }[/math] denote the number of samples where both classifiers are wrong, [math]\displaystyle{ c }[/math] and [math]\displaystyle{ d }[/math] denote the number of samples where the two classifiers have opposite results, we have [math]\displaystyle{ \hat{P}\{X_i=0,X_j=0\}=\frac{a}{n}, \hat{P}\{X_i=1, X_j=1\}=\frac{b}{n}\hat{P}\{X_i=0, X_j=1\}=\frac{c}{n}\gt }[/math], and [math]\displaystyle{ \hat{P}\{X_i=1, X_j=0\}=\frac{d}{n}\gt }[/math], where [math]\displaystyle{ n }[/math] is the total number of samples. Therefore [math]\displaystyle{ E X_{i}X_{j} }[/math] can be calculated as [math]\displaystyle{ \frac{b}{n}\gt }[/math]. By solving the above quadratic programming problem, we may obtain better weight assignment than the equal weight assignment. In the following experiments, we used the “quadprog” package in the R software to solve the above quadratic programming problem. Its "solve. QP" function implements the dual method of Gold-farb and Idnani for solving quadratic programming problems of the form [math]\displaystyle{ min-b^T x + \frac{1}{2}x^T Ax }[/math] with the constraints [math]\displaystyle{ C^{T} x \geq x_{0} }[/math]. We can easily reformat our problem in the matrix notation and use “quadprog” to obtain better weights for our aggregated classifier.

3. EXPERIMENTS AND RESULTS

To assess the performance of the proposed weighting schemes, we have conducted several experiments on the two-class classification problem using the data sets obtained from the Kent Ridge Bio-medical Data Set Repository. Four data sets are used here: the leukemia gene expression data, the breast cancer gene expression data, the lung cancer gene expression data, and the ovarian cancer protein mass spectrometry data. The number of the features in these four data sets ranges from several thousands to dozens of thousands. These four data sets are selected out of the repository because they are all two-class classification problems and they have a relatively large number of training samples and somewhat balanced distribution between the two classes. The details about these data sets are as follows.

The leukemia data set was originally used in 6 as an example of the generic approach to cancer classification based on gene expression profiles measured from DNA microarrays. The training set consists of 38 bone marrow samples (27 ALL and 11 AML) with over 7,129 probes from 6,817 human genes. The test set is also provided, with 20 ALL and 14 AML cases. The breast cancer data was from 14 that identifies a gene expression signature to accurately predict breast cancer status. In this data set, 78 (34/44)^[1] training samples and 19 (12/7) test samples were collected and 24,481 genes were measured in the microarray. The lung cancer data used in 7 has 32 (16/16) training samples and 149 (15/134) test samples with each sample measured on 12,533 genes. The ovarian cancer data in 13 is the only protein expression data. There are 253 (162/91) samples without explicit separation into the training and the test sets. Each sample contains the relative amplitude of the intensities at 15,154 detected mass/charge (m/z) identity. The characteristics of these data sets are summarized in Table 1.

Table 1

Characteristics of the high dimensional data sets

Data set	Training Set		Test set		Features
Data set	Case	Control	Case	Control	Features
Leukemia	27	11	20	14	7129
Breast Cancer	34	44	12	7	24481
Lung Cancer	16	16	15	134	12533

Ovarian Cancer	253 (162/91)				15154

In the following experiments, SVM is selected as the base learner of the random subspace method. We used the “svm” function in R package “e1071” to train our classifier. The dimension of the randomly selected subspace is set to be 10. We generate 1500 individual learners in total and plot the error rate as we vary the number of individual learners used for aggregation. The choice of number of individual learners and subspace dimension are heuristic. There are several considerations in these choices. First we want to have enough base learners so that error rates converge to a stable level. Second, it is very time consuming to generate SVM base learners and solve the weight assignment quadratic programming problem. We tried several choices of the number of base learners, 1500 ~ 2000 is a reasonable choice. As for the subspace dimension, the error rate is not very sensitive to it. We selected a small number 10 in hope that the selected subspaces don’t overlap in features. In the discussion section, we will discuss further on this topic. The prediction error rate is based on the test set error rate if there exists one test set and the mean error rate of 10 iterations of randomly splitting the original data set into the training and the test sets if no independent test set exists.

In the experiments, we compare the performances of the three weight assignment procedures we discussed above, which are equal weight assignments, optimal assignments for the independent case and weight assignments considering correlations among the individual classifiers.

As observed from Figure 1, our proposed weight assignment procedure incorporating correlations has significant improvement over equal weight assignment except for the lung cancer data set when all classifiers have low error rate.

Figure 1 External Link

4. DISCUSSION

In this article, we have formulated the weight assignment problems in the aggregating algorithms and proposed weighting schemes both under the independent and dependent situations. Our experiments confirmed the improvement of our weight assignment procedures in terms of prediction accuracy.

To explore the outcome of applying the proposed weight assignment procedures to over-fitting base learners, we apply the weight assignment procedures to the Random Forests algorithm to check if we can improve its performance. We extracted the individual trees from the Random Forests algorithm and assigned weights to them according to our developed procedure to compare with equal weight assignment. Thus, we used exactly the same set of underlying trees but with different weights. The results are shown in Figure 2. We observe that the weighted Random Forests method outperform Random Forests with equal weight on the leukemia data and the ovarian cancer data. However, the results on the breast cancer data and lung cancer data show a similar error rate or even a little bit worse for the weighted method. We investigated the cause of the degradation and found that it was caused by the over-fitting of tree classifiers. The tree classifier is too over-fitting to have consistent performance on the training set and the test set. The trees that were assigned large weights because of good performances on the training set tended to have ordinary performances or sometimes poor performances on the test set. The effect of wrongly assigned weights to poorly performed classifier is substantial, which can result in the degraded performance of the weighted Random Forests method.

Figure 2 External Link

The problem exposed by inconsistent individual classifiers reminds us of the risk to use re-substitution error estimate on the training set to determine the weights of individual classifiers. We would suggest the introduction of an independent validation set if there are enough samples available. The performance of weak classifiers on the validation set should be used to determine the weight. Most of the time, due to the limited number of samples, a cross validation error estimate on the training set can be used.

We also compare the performance of the Random Forests algorithm and the weighted random subspace method. The subspace dimension in the random subspace method and the size of the randomly selected subset (“mtry”) in the Random Forests algorithm are both set to be 10. Totally, 2,000 weak classifiers are generated. From the results shown in Figure 3, we observe that the random subspace space method is competitive to the Random Forests algorithm. The random subspace method has improved performance over Random Forests on the leukemia data and the ovarian data but is slightly worse on the breast cancer data and the lung cancer data. It appears that weighted random subspace method does not uniformly outperform the Random Forests algorithm. On one hand, data characteristics may be one factor that affect method performance. For example, the boundary shape of different classes. A good choice of kernel for SVM and careful tuning of the gamma parameter may adapt it to various data characters. We used the radial kernel and default value for gamma here. We did not fully explore all the possible choices because it is not our main focus here. There is a R package “caret” that developed a uniform front end to many classification and regression models. You can easily tune your parameters using this package to have improved performance. On the other hand, note that there are about 10, 000 features for each data set and about 10_{50} possible subspaces if using a subspace dimension 10. Among them, we only randomly selected 2,000 subspaces to construct our classifier. The Random Forests method possibly uses more features in constructing the final classifier with the same number of weak classifiers because it searches a new subspace for the best feature in each split and a fully developed tree can be very deep. From our past experience, the performance of the random subspace method would have significant improvement whenever a new pattern is identified in a new weak classifier. And the error rate usually does not change even when some poorly performed weak classifiers are incorporated because of the weight assignment procedure. It can be explained as follows: Assigning 0 weight to the newly generated weak classifier can at least retain the aggregated error rate of the original ensemble of weak classifiers. And the optimal weights determined by our weight assignment procedure should outperform the assignment giving 0 weight to the new weak classifier. Sometimes, the random subspace method may have a slightly increased error rate when certain new weak classifiers are incorporated. This is due to the inconsistent performance of the new weak classifier between the training set and the test set, resulting in more weight being incorrectly assigned to it. Since SVM generalizes well, the chance of this happening is small and the effect is limited. Therefore, the random subspace method has not reached its best performance with a small number of weak classifiers. But the Random Forests algorithm cannot guarantee its performance not being deteriorated by newly generated noisy trees.

Figure 3 External Link

In the above experiments, we used fixed subspace dimension size for the random subspace method. However, the size of the informative subspaces of the features is unknown in practice and needs to be tuned to have good performance. In addition, the underlying patterns may consist of different numbers of features. Ideally, we hope that all the weak classifiers are built on the selected subspaces that exactly capture those patterns. But in implementation, we may not be able to capture a pattern if the chosen subspace dimension is smaller than the true one or includes noisy features if the chosen subspace dimension is larger than the true one. In practice, we may want to vary the subspace dimensions in deriving the optimal classifier. For example, we can set the subspace dimension to be random to include subspaces of different dimensions. But the introduction of randomness may also induce a large number of uninformative weak classifiers.

All the above suggestions put a lot of requirement on computation speed for high dimensional data. High performance parallel computation may be a solution. It is especially suitable for aggregating algorithm. The R package “nws” can provide coordination and parallel execution facilities, which may be helpful for further improvement of the speed of weighted random subspace method.

5. CONCLUSION

In summary, we have proposed a weighted random subspace method for high dimensional data classification problems. We have also demonstrated the good performance of this method through the application to several widely-studied data sets.

Acknowledgments

This work was supported in part from NHLB/NIH contract N01-HV-28186, NIDA/NIH grant P30 DA 018343-01, and NIGMS grant R01 GM 59507.

Footnotes

↑ The number before the slash denotes the number of case samples and the number after the slash denotes the number of control samples.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 WeightedRandomSubspaceMethodfor	Xiaoye Li Hongyu Zhao			Weighted Random Subspace Method for High Dimensional Data Classification				10.4310/sii.2009.v2.n2.a5		2009

[1] The number before the slash denotes the number of case samples and the number after the slash denotes the number of control samples.

[1]