# 2003 AnIntroToVariableAndFeatureSelection

## Quotes

### Abstract

Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

### 1. Introduction

As of 1997, when a special issue on relevance including several papers on variable and feature selection was published (Blum and Langley, 1997, Kohavi and John, 1997), few domains explored used more than 40 features. The situation has changed considerably in the past few years and, in this special issue, most papers explore domains with hundreds to tens of thousands of variables or features:1 New techniques are proposed to address these challenging tasks involving many irrelevant and redundant variables and often comparably few training examples. [footnote 1. We call “variable” the “raw” input variables and “features” variables constructed for the input variables. We use without distinction the terms “variable” and “feature” when there is no impact on the selection algorithms, e.g., when features resulting from a pre-processing of input variables are explicitly computed. The distinction is necessary in the case of kernel methods for which features are not explicitly computed (see section 5.3).]

Two examples are typical of the new application domains and serve us as illustration throughout this introduction. One is gene selection from microarray data and the other is text categorization. …

There are many potential benefits of variable and feature selection: facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times, defying the curse of dimensionality to improve prediction performance. Some methods put more emphasis on one aspect than another, and this is another point of distinction between this special issue and previous work. The papers in this issue focus mainly on constructing and selecting subsets of features that are useful to build a good predictor. This contrasts with the problem of finding or ranking all potentially relevant variables. Selecting the most relevant variables is usually suboptimal for building a predictor, particularly if the variables are redundant. Conversely, a subset of useful variables may exclude many redundant, but relevant, variables. For a discussion of relevance vs. usefulness and definitions of the various notions of relevance, see the review articles of Kohavi and John (1997) and Blum and Langley (1997).

... Because the organization of our paper does not follow the work flow of building a machine learning application, we summarize the steps that may be taken to solve a feature selection problem in a check list (We caution the reader that this check list is heuristic. The only recommendation that is almost surely valid is to try the simplest things first.)

• 1. Do you have domain knowledge? If yes, construct a better set of “ad hoc” features.
• 2. Are your features commensurate? If no, consider normalizing them.
• 3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you (see example of use in Section 4.4).
• 4. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features (e.g. by clustering or matrix factorization, see Section 5).
• 5. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method (Section 2 and Section 7.2); else, do it anyway to get baseline results.
• 6. Do you need a predictor? If no, stop.
• 7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
• 8. Do you know what to try first? If no, use a linear predictor.3 Use a forward selection method (Section 4.2) with the “probe” method as a stopping criterion (Section 6) or use the `0-norm embedded method (Section 4.3). For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
• 9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods (Section 4). Use linear and non-linear predictors. Select the best approach with model selection (Section 6).
• 10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstraps” (Section 7.1).

### 2. Variable Ranking

Many variable selection algorithms include variable ranking as a principal or auxiliary selection mechanism because of its simplicity, scalability, and good empirical success. Several papers in this issue use variable ranking as a baseline method (see, e.g., Bekkerman et al., 2003, Caruana and de Sa, 2003, Forman, 2003, Weston et al., 2003). Variable ranking is not necessarily used to build predictors. One of its common uses in the microarray analysis domain is to discover a set of drug leads (see, e.g., et al., 1999): A ranking criterion is used to find genes that discriminate between healthy and disease patients; such genes may code for “drugable” proteins, or proteins that may themselves be used as drugs. Validating drug leads is a labor intensive problem in biology that is outside of the scope of machine learning, so we focus here on building predictors. We consider in this section ranking criteria defined for individual variables, independently of the context of others. Correlation methods belong to that category. We also limit ourselves to supervised learning criteria. We refer the reader to Section 7.2 for a discussion of other techniques.

### 4. Variable Subset Selection

In the previous section, we presented examples that illustrate the usefulness of selecting subsets of variables that together have good predictive power, as opposed to ranking variables according to their individual predictive power. We now turn to this problem and outline the main directions that have been taken to tackle it. They essentially divide into wrappers, filters, and embedded methods. Wrappers utilize the learning machine of interest as a black box to score subsets of variable according to their predictive power. Filters select subsets of variables as a pre-processing step, independently of the chosen predictor. Embedded methods perform variable selection in the process of training and are usually specific to given learning machines. …

### 5. Feature Construction and Space Dimensionality Reduction

In some applications, reducing the dimensionality of the data by selecting a subset of the original variables may be advantageous for reasons including the expense of making, storing and processing measurements. If these considerations are not of concern, other means of space dimensionality reduction should also be considered.

The art of machine learning starts with the design of appropriate data representations. Better performance is often achieved using features derived from the original input. Building a feature representation is an opportunity to incorporate domain knowledge into the data and can be very application specific. Nonetheless, there are a number of generic feature construction methods, including: clustering; basic linear transforms of the input variables (PCA/SVD, LDA); more sophisticated linear transforms like spectral transforms (Fourier, Hadamard), wavelet transforms or convolutions of kernels; and applying simple functions to subsets of variables, like products to create monomials.

Two distinct goals may be pursued for feature construction: achieving best reconstruction of the data or being most efficient for making predictions. The first problem is an unsupervised learning problem. It is closely related to that of data compression and a lot of algorithms are used across both fields. The second problem is supervised. Are there reasons to select features in an unsupervised manner when the problem is supervised? Yes, possibly several: Some problems, e.g., in text processing applications, come with more unlabelled data than labelled data. Also, unsupervised feature selection is less prone to overfitting.

In this issue, four papers address the problem of feature construction. All of them take an information theoretic approach to the problem. Two of them illustrate the use of clustering to construct features (Bekkerman et al., 2003, Dhillon et al., 2003), one provides a new matrix factorization algorithm (Globerson and Tishby, 2003), and one provides a supervised means of learning features from a variety of models (Torkkola, 2003). In addition, two papers whose main focus is directed to variable selection also address the selection of monomials of a polynomial model and the hidden units of a neural network (Rivals and Personnaz, 2003, Stoppiglia et al., 2003), and one paper addresses the implicit feature selection in non-linear kernel methods for polynomial kernels (Weston et al., 2003).

#### 5.1 Clustering

Clustering has long been used for feature construction. The idea is to replace a group of “similar” variables by a cluster centroid, which becomes a feature. The most popular algorithms include K-means and hierarchical clustering. For a review, see, e.g., the textbook of Duda et al. (2001).

Clustering is usually associated with the idea of unsupervised learning. It can be useful to introduce some supervision in the clustering procedure to obtain more discriminant features. This is the idea of distributional clustering (Pereira et al., 1993), which is developed in two papers of this issue. Distributional clustering is rooted in the information bottleneck (IB) theory of Tishby et al. (1999). If we call $\tilde{X}$ the random variable representing the constructed features, the IB method seeks to minimize the mutual information $I(X , \tilde{X} )$, while preserving the mutual information $I(\tilde{X},Y )$. A global objective function is built by introducing a Lagrange multiplier $\beta : J = I(X,\tilde{X}) − \beta I(\tilde{X} ,Y )$. So, the method searches for the solution that achieves the largest possible compression, while retaining the essential information about the target.

Text processing applications are usual targets for such techniques. Patterns are full documents and variables come from a bag-of-words representation: Each variable is associated to a word and is proportional to the fraction of documents in which that word appears. In application to feature construction, clustering methods group words, not documents. In text categorization tasks, the supervision comes from the knowledge of document categories. It is introduced by replacing variable vectors containing document frequency counts by shorter variable vectors containing document category frequency counts, i.e., the words are represented as distributions over document categories.

### 6. Validation Methods

We group in this section all the issues related to out-of-sample performance prediction (generalization prediction) and model selection. These are involved in various aspects of variable and feature selection: to determine the number of variables that are “significant”, to guide and halt the search for good variable subsets, to choose hyperparameters, and to evaluate the final performance of the system. …

### 7.2 Variable Ranking in the Context of Others

In Section 2, we limited ourselves to presenting variable ranking methods using a criterion computed from single variables, ignoring the context of others. In Section 4.2, we introduced nested subset methods that provide a useful ranking of subsets, not of individual variables: some variables may have a low rank because they are redundant and yet be highly relevant. Bootstrap and Bayesian methods presented in Section 7.1, may be instrumental in producing a good variable ranking incorporating the context of others.

The relief algorithm uses another approach based on the nearest-neighbor algorithm (Kira and Rendell, 1992). For each example, the closest example of the same class (nearest hit) and the closest example of a different class (nearest miss) are selected. The score $S(i)$ of the $i^{th}$ variable is computed as the average over all examples of magnitude of the difference between the distance to the nearest hit and the distance to the nearest miss, in projection on the $i^{th}$ variable.

### 7.3 Unsupervised Variable Selection

Sometimes, no target y is provided, but one still would want to select a set of most significant variables with respect to a defined criterion. Obviously, there are as many criteria as problems can be stated. Still, a number of variable ranking criteria are useful across applications, including saliency, entropy, smoothness, density and reliability. …

### 8. Conclusion

The recent developments in variable and feature selection have addressed the problem from the pragmatic point of view of improving the performance of predictors. They have met the challenge of operating on input spaces of several thousand variables. Sophisticated wrapper or embedded methods improve predictor performance compared to simpler variable ranking methods like correlation methods, but the improvements are not always significant: domains with large numbers of input variables suffer from the curse of dimensionality and multivariate methods may overfit the data. For some domains, applying first a method of automatic feature construction yields improved performance and a more compact set of features. The methods proposed in this special issue have been tested on a wide variety of data sets (see Table 1), which limits the possibility of making comparisons across papers. Further work includes the organization of a benchmark. The approaches are very diverse and motivated by various theoretical arguments, but a unifying theoretical framework is lacking. Because of these shortcomings, it is important when starting with a new problem to have a few baseline performance values. To that end, we recommend using a linear predictor of your choice (e.g. a linear SVM) and select variables in two alternate ways: (1) with a variable ranking method using a correlation coefficient or mutual information; (2) with a nested subset selection method performing forward or backward selection or with multiplicative updates. Further down the road, connections need to be made between the problems of variable and feature selection and those of experimental design and active learning, in an effort to move away from observational data toward experimental data, and to address problems of causality inference.

## References

,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 AnIntroToVariableAndFeatureSelectionAn Introduction to Variable and Feature SelectionThe Journal of Machine Learning Researchhttp://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf2003
 Author Isabelle M. Guyon + and André Elisseeff + journal The Journal of Machine Learning Research + title An Introduction to Variable and Feature Selection + titleUrl http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf + year 2003 +