2004 TheEntireRegulPathForTheSVM

(Hastie et al., 2004) ⇒ Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. (2004). “The Entire Regularization Path for the Support Vector Machine.” In: The Journal of Machine Learning Research, 5.

Subject Headings: SVM Classification Algorithm, Regularization Path, Algorithm Tuning Parameter, Regularization Cost Parameter, Kernel Function Parameter, Cost Parameter.

Notes

Cited By

~314 http://scholar.google.com/scholar?q=%22entire+regularization+path%22+for+the+%22support+vector+machine%22+2004

Quotes

Author Keywords

support vector machines, regularization, coefficient path

Abstract

The support vector machine (SVM) is a widely used tool for classification. Many efficient implementations exist for fitting a two-class SVM model. The user has to supply values for the tuning parameters: the regularization cost parameter, and the kernel parameters. It seems a common practice is to use a default value for the cost parameter, often leading to the least restrictive model. In this paper we argue that the choice of the cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. We illustrate our algorithm on some examples, and use our representation to give further insight into the range of SVM solutions.

1. Introduction

In this paper we study the support vector machine (SVM) (Vapnik, 1996; Schölkopf and Smola, 2001) for two-class classification. We have a set of [math]\displaystyle{ n }[/math] training pairs [math]\displaystyle{ x_i,y_i }[/math], where [math]\displaystyle{ x_i \in \Re^p }[/math] is a p-vector of real-valued predictors (attributes) for the i^th observation, and y_i in {−1,+1} codes its binary response. We start off with the simple case of a linear classifier, where our goal is to estimate a linear decision function

[math]\displaystyle{ f(x) = \beta_0 + \beta^T x (1) }[/math],

and its associated classifier

[math]\displaystyle{ \text{Class}(x) = \text{sign}[f(x)] (2) }[/math].

There are many ways to fit such a linear classifier, including linear regression, Fisher’s linear discriminant analysis, and logistic regression (Hastie et al., 2001, Chapter 4). If the training data are linearly separable, an appealing approach is to ask for the decision boundary {x : ƒ(x) = 0} that maximizes the margin between the two classes (Vapnik, 1996). Solving such a problem is an exercise in convex optimization;

Figure 3: Simulated data illustrate the need for regularization. The 200 data points are generated from a pair of mixture densities. The two SVM models used radial kernels with the scale and cost parameters as indicated at the top of the plots. … Figure 3 shows the results of fitting two SVM models to the same simulated data set. The data are generated from a pair of mixture densities, described in detail in Hastie et al. (2001, Chapter 2). The actual training data and test distribution are available from http:// www-stat.stanford.edu/ElemStatLearn.

It seems that the regularization parameter C (or l) is often regarded as a genuine “nuisance” in the community of SVM users. Software packages, such as the widely used SVMlight (Joachims, 1999), provide default settings for C, which are then used without much further exploration. A recent introductory document (Hsu et al., 2003) supporting the LIBSVM package does encourage grid search for C.

…

7. Discussion

Our work on the SVM path algorithm was inspired by earlier work on exact path algorithms in other settings. “Least Angle Regression” (Efron et al., 2002) shows that the coefficient path for the sequence of “lasso” coefficients (Tibshirani, 1996) is piecewise linear. The lasso solves the following regularized linear regression problem, .... is the L₁ norm of the coefficient vector. This L₁ constraint delivers a sparse solution vector bl; the larger l, the more elements of bl are zero, the remainder shrunk toward zero. In fact, any model with an L₁ constraint and a quadratic, piecewise quadratic, piecewise linear, or mixed quadratic and linear loss function, will have piecewise linear coefficient paths, which can be calculated exactly and efficiently for all values of l (Rosset and Zhu, 2003). These models include, among others,

A robust version of the lasso, using a “Huberized” loss function.
The L₁ constrained support vector machine (Zhu et al., 2003).

The SVM model has a quadratic constraint and a piecewise linear (“hinge”) loss function. This leads to a piecewise linear path in the dual space, hence the Lagrange coefficients ai are piecewise linear.

Other models that would share this property include

The e-insensitive SVM regression model
Quadratically regularized L1 regression, including flexible models based on kernels or smoothing splines.

Of course, quadratic criterion + quadratic constraints also lead to exact path solutions, as in the classic case of ridge regression, since a closed form solution is obtained via the SVD. However, these paths are nonlinear in the regularization parameter.

For general non-quadratic loss functions and [math]\displaystyle{ L_1 }[/math] constraints, the solution paths are typically piecewise non-linear. Logistic regression is a leading example. In this case, approximate pathfollowing algorithms are possible (Rosset, 2005).

The general techniques employed in this paper are known as parametric programming via active sets in the convex optimization literature (Allgower and Georg, 1992). The closest we have seen to our work in the literature employ similar techniques in incremental learning for SVMs (Fine and Scheinberg, 2002; Cauwenberghs and Poggio, 2001; DeCoste and Wagstaff, 2000). These authors do not, however, construct exact paths as we do, but rather focus on updating and downdating the solutions as more (or less) data arises. Diehl and Cauwenberghs (2003) allow for updating the parameters as well, but again do not construct entire solution paths. The work of Pontil and Verri (1998) recently came to our notice, who also observed that the lagrange multipliers for the margin vectors change in a piece-wise linear fashion, while the others remain constant.

The SvmPath has been implemented in the R computing environment (contributed library svmpath at CRAN), and is available from the first author’s website.

References

1. Eugene Allgower and Kurt Georg. Continuation and path following. Acta Numerica, pages 1-64, 1992.
2. Francis R. Bach, Michael I. Jordan, Kernel independent component analysis, The Journal of Machine Learning Research, 3, p.1-48, 3/1/2003
3. Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. In Advances in Neural Information Processing Systems (NIPS 2000), volume 13. MIT Press, Cambridge, MA, 2001.
4. Dennis DeCoste, Kiri Wagstaff, Alpha seeding for support vector machines, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.345-349, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347165
5. Christopher Diehl and Gert Cauwenberghs. SVM incremental learning, adaptation and optimization. In Proceedings of the 2003 International Joint Conference on Neural Networks, pages 2685-2690, 2003. Special series on Incremental Learning.
6. Brad Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Technical report, Stanford University, 2002.
7. Brad Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 2004. (to appear, with discussion).
8. (Evgeniou et al., 2000) ⇒ Theodorus Evgeniou, Massimiliano Pontil, and Tomaso Poggio. (2000). “Regularization Networks and Support Vector Machines.” In: Advances in Computational Mathematics, 13(1). doi:10.1023/A:1018946025316
9. Shai Fine and Katya Scheinberg. Incas: An incremental active set method for SVM. Technical report, IBM Research Labs, Haifa, 2002.
10. Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Chapman and Hall, 1990.
11. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning; Data Mining, Inference and Prediction. Springer Verlag, New York, 2001.
12. Trevor Hastie and Rob Tibshirani. Efficient quadratic regularization for expression arrays. Technical report, Stanford University, 2003.
13. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 2003. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
14. Thorsten Joachims. Practical Advances in Kernel Methods -- Support Vector Learning, chapter Making large scale SVM learning practical. MIT Press, 1999. See http://svmlight.joachims.org.
15. Steve Marron. An overview of support vector machines and kernel methods. Talk, 2003. Available from author's website: http://www.stat.unc.edu/postscript/papers/marron/Talks/.
16. Massimiliano Pontil, Alessandro Verri, Properties of support vector machines, Neural Computation, v.10 n.4, p.955-974, May 15,1998 doi:10.1162/089976698300017575
17. Sridhar Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub. Multiclass cancer diagnosis using tumor gene expression signature. PNAS, 98:15149-15154, 2001.
18. B. D. Ripley. Pattern recognition and neural networks. Cambridge University Press, 1996.
19. Saharon Rosset. Tracking curved regularized optimization solution paths. In Advances in Neural Information Processing Systems (NIPS 2004), volume 17. MIT Press, Cambridge, MA, 2005. to appear.
20. Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Technical report, Stanford University, 2003. http://www-stat.stanford.edu/~saharon/papers/piecewise.ps.
21. Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximizing loss functions. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16. MIT Press, Cambridge, MA, 2004.
22. Bernhard Schölkopf, Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2001
23. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B., 58:267-288, 1996.
24. Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, 1995
25. G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, 1990.
26. G. Wahba, Y. Lin, and H. Zhang. Gacv for support vector machines. In A.J. Smola, P.L. Bartlett, Bernhard Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 297- 311, Cambridge, MA, 2000. MIT Press.
27. J. Weston and C. Watkins. Multi-class support vector machines, 1998. URL citeseer.nj.nec.com/8884.html.
28. Christopher K. I. Williams, Matthias Seeger, The Effect of the Input Density Distribution on Kernel-based Classifiers, Proceedings of the Seventeenth International Conference on Machine Learning, p.1159-1166, June 29-July 02, 2000
29. Ji Zhu and Trevor Hastie. Classification of gene microarrays by penalized logistic regression. Biostatistics, 2004. (to appear).
30. Ji Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani. L1 norm support vector machines. Technical report, Stanford University, 2003.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 TheEntireRegulPathForTheSVM	Trevor Hastie Saharon Rosset Ji Zhu Robert Tibshirani			The Entire Regularization Path for the Support Vector Machine		The Journal of Machine Learning Research	http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf			2004