2004 BoostedLasso

From GM-RKB
Jump to navigation Jump to search

Keywords: Boosted Lasso Algorithm.

Notes

Quotes

1. Introduction

  • Lasso (Tibshirani, 1996) is an important idea which has received much attention recently in statistics, signal processing (under the name Basis Pursuit Chen and Donoho, 1994) and machine learning. It regularizes or shrinks a fitted model through an L1 penalty or constraint. Its popularity can be explained in several ways. Since nonparametric models that fit training data well often have low biases but large variances, prediction accuracies can sometimes be improved by shrinking a model or making it more sparse. The regularization resulting from the L1 penalty leads to sparse solutions, that is, there are few basis functions with nonzero weights (among all possible choices). This statement is proved asymptotically by Knight and Fu (2000) and for the finite case under various conditions by different authors (see e.g. Osborne et al., 2000a,b; Donoho et al., 2004; Donoho, 2004; Tropp, 2004; Rosset et al., 2004). Furthermore, the sparse models induced by Lasso are usually more interpretable and therefore preferred in sciences and social sciences.
  • Another vastly popular and recent idea is Boosting. Since its inception in 1990 (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996), it has become one of the most successful machine learning methods and is now understood as an iterative gradient descent method leading to an additive model (Breiman, 1998; Mason et al., 1999; Friedman et al., 2000).
  • While it is a natural idea to combine boosting and Lasso to have a regularized Boosting procedure, it is also intriguing that Boosting, without any additional regularization, has its own resistance to overfitting. For specific cases such as L2Boost (Friedman, 2001), this resistance is understood to some extent (Buhlmann and Yu, 2003). However, it was not until later when Forward Stagewise Fitting (FSF) was introduced as a boosting based procedure with much more cautious steps that a similarity between FSF and Lasso was observed (Rosset et al., 2004; Hastie et al., 2001; Efron et al., 2004).
  • This link between Lasso and FSF is more formally described for the linear regression case through the LARS algorithm (Least Angle Regression Efron et al., 2004). It is also known that for special cases (such as orthogonal designs) FSF can approximate Lasso path infinitely close, but in general, it is unclear what regularization criteria FSF optimizes. As can be seen in our experiments (Figure 1), FSF solutions can be significantly different from the Lasso solutions in the case of strongly correlated predictors which are common in highdimensional data problems. However, it is still used as an approximation to Lasso because it is often computationally prohibitive to solve Lasso with general loss functions for many regularization parameters through Quadratic Programming.
  • In this paper, we propose a new algorithm Boosted Lasso (BLasso). It approximates the Lasso path in general situations. The motivation comes from a critical observation that both FSF and Boosting only work in a forward fashion (so is FSF named). They always take steps that reduce empirical loss the most regardless of the impact on model complexity (or the L1 penalty in the Lasso case). This often proves to be too greedy – the algorithms are not able to correct mistakes made in early stages. Taking a coordinate (difference) descent view point of the Lasso minimization with a fixed step size, we introduce an innovative “backward” step. This step utilizes the same minimization rule as the forward step to define each fitting stage but with an extra rule to force the model complexity to decrease. By combining backward and forward steps, Boosted Lasso is able to go back and forth to approximate the Lasso path correctly.
  • BLasso can be seen as a marriage between two families of successful methods. Computationally, BLasso works similarly to Boosting and FSF. It isolates the sub-optimization problem at each step from the whole process, i.e. in the language of the Boosting literature, each base learner is learned separately. This way BLasso can deal with different loss functions and large classes of base learners like trees, wavelets and splines by fitting a base learner at each step and aggregating the base learners as the algorithm progresses. Moreover, BLasso can be proven to converge to the Lasso solutions which have explicit global L1 regularization for cases with finite number of base learners. In contrast, FSF can be seen as local regularization in the sense that at any iteration, FSF with a fixed small step size only searches over those models which are one small step away from the current one in all possible directions corresponding to the base learners (cf. Hastie et al. 2006 for a recent interpretation of the e ? 0 case).

= 2.2 Lasso

  • ... However, it remains open how to give the entire regularization path of the Lasso problem for general convex loss function. FSF exists as a compromise since, like Boosting, it is a nonparametric learning algorithm that works with different loss functions and large numbers of base learners but it is local regularization and does not converge to the Lasso path in general.
 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 BoostedLassoPeng Zhao
Bin Yu
Boosted Lasso.http://www.stat.berkeley.edu/users/binyu/ps/blasso.sub.pdf2004