# Support Vector Machine (SVM) Training Algorithm

A Support Vector Machine (SVM) Training Algorithm is a kernel-based supervised learning algorithm that can produce a support vector-based predictive model.

**Context:**- It can range from being a Linear SVM Training Algorithm to being a Non-Linear SVM Training Algorithm/Kernel-based SVM.
- It can range from being an SVM Classification Algorithm to being an SVM Ranking Algorithm to being an SVM Regression Algorithm, to being a Structured SVM Algorithm, to being a Transductive SVM Algorithm.
- It can be implemented into an SVM Training System (that solves an SVM training task).
- It can require Tunable Parameter, such as Regularization Cost Parameter, a Kernel Function Parameter, ...
- It can have an SVM Computational Complexity (for [math]\displaystyle{ n }[/math] observations and [math]\displaystyle{ d }[/math] features) of:
- Linear kernel [math]\displaystyle{ O(nd) }[/math] and [math]\displaystyle{ d }[/math] for inference (Bengio, 2011).
- Non-Linear kernel [math]\displaystyle{ O(n^2) }[/math] to train, and [math]\displaystyle{ O(nd) }[/math] for inference (Bengio, 2011)..

- It can be considered as minimizing Hinge Loss.
- …

**Example(s):****Counter-Example(s):****See:**Hyperplane, Radial Basis Function Networks.

## References

### 2018

- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Support_vector_machine Retrieved:2018-4-8.
- In machine learning,
**support vector machines**(**SVMs**, also**support vector networks**^{[1]}) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The

**support vector clustering**^{[2]}algorithm created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.

- In machine learning,

### 2017

- (Zhang, 2017) ⇒ Zhang X. (2017) Support Vector Machines. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA
- QUOTE: Support vector machines (SVMs) are a class of linear algorithms which can be used for classification, regression, density estimation, novelty detection, etc. In the simplest case of two-class classification, SVMs find a hyperplane that separates the two classes of data with as wide a margin as possible. This leads to good generalization accuracy on unseen data and supports specialized optimization methods that allow SVM to learn from a large amount of data.

### 2017a

- https://qr.ae/TWsitr
- QUOTE: ... This definition of “best” results in different loss functions. If you look at the optimization problems of linear SVM and (regularized) LR, they are very similar:
- [math]\displaystyle{ min_𝑤 𝜆‖𝑤‖^2 + ∑_𝑖 max\{0,1 − 𝑦_𝑖 𝑤^𝑇 𝑥_𝑖\} }[/math]
- [math]\displaystyle{ min_𝑤 𝜆‖𝑤‖^2 + ∑_𝑖 log(1+exp(1 − 𝑦_𝑖 𝑤^𝑇 𝑥_𝑖)) }[/math]

- That is, they only differ in the loss function — SVM minimizes hinge loss while logistic regression minimizes logistic loss. ...

- QUOTE: ... This definition of “best” results in different loss functions. If you look at the optimization problems of linear SVM and (regularized) LR, they are very similar:

### 2015

- (Tomar & Agarwal, 2015) ⇒ Divya Tomar, and Sonali Agarwal. (2015). “A Comparison on Multi-class Classification Methods based on Least Squares Twin Support Vector Machine.” In: Knowledge-Based Systems Journal, 81(C). doi:10.1016/j.knosys.2015.02.009
- QUOTE: Table 1 Comparison of computational complexity

- QUOTE: Table 1 Comparison of computational complexity

### 2013

- http://scikit-learn.org/stable/modules/svm.html
- The advantages of support vector machines are:
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

- The disadvantages of support vector machines include:
- If the number of features is much greater than the number of samples, the method is likely to give poor performances.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

- The advantages of support vector machines are:

### 2011

- Yoshua Bengio. (2011). “Why is kernelized SVM much slower than linear SVM?." In: Quora
- QUOTE: ... Basically, a kernel-based SVM requires on the order of [math]\displaystyle{ n^2 }[/math] computation for training and order of [math]\displaystyle{ nd }[/math] computation for classification, where n is the number of training examples and d the input dimension (and assuming that the number of support vectors ends up being a fraction of n, which is shown to be expected in theory and in practice). Instead, a 2-class linear SVM requires on the order of [math]\displaystyle{ nd }[/math] computation for training (times the number of training iterations, which remains small even for large n) and on the order of [math]\displaystyle{ d }[/math] computations for classification. ...

### 2009

- (Wu & Kumar, 2009) ⇒ Xindong Wu, and Vipin Kumar, editors. (2009). “The Top Ten Algorithms in Data Mining.” Chapman & Hall. ISBN:1420089641

### 2007

- (Shalev-Shwartz et al., 2007) ⇒ Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. (2007). “Pegasos: Primal Estimated Sub-GrAdient SOlver for SVM.” In: Proceedings of the 24th International Conference on Machine learning. doi:10.1145/1273496.1273598

### 2004

- (Hastie et al., 2004) ⇒ Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. (2004). “The Entire Regularization Path for the Support Vector Machine.” In: The Journal of Machine Learning Research, 5.
- The support vector machine (SVM) is a widely used tool for classification. Many efficient implementations exist for fitting a two-class SVM model. The user has to supply values for the tuning parameters: the regularization cost parameter, and the kernel parameters. It seems a common practice is to use a default value for the cost parameter, often leading to the least restrictive model. In this paper we argue that the choice of the cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model.

### 2001

- (Schölkopf & Smola, 2001) ⇒ Bernhard Schölkopf, and Alexander J. Smola. (2001). “Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond." MIT Press. ISBN:0262194759
- QUOTE: In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs -- kernels -- for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm.

### 2000a

- (Cristianini & Shawe-Taylor, 2000) ⇒ Nello Cristianini, and John Shawe-Taylor. (2000). “An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods." Cambridge University Press. ISBN:0521780195

### 2000b

- (Xiao et al., 2000) ⇒ Rong Xiao, Jicheng Wang, and Fayan Zhang. (2000). “An Approach to Incremental SVM Learning Algorithm.” In: Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2000). doi:10.1109/TAI.2000.889881

### 2000c

- (Evgeniou et al., 2000) ⇒ Theodorus Evgeniou, Massimiliano Pontil, and Tomaso Poggio. (2000). “Regularization Networks and Support Vector Machines.” In: Advances in Computational Mathematics, 13(1).
- QUOTE: Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples

### 1999

- (Joachims, 1999) ⇒ Thorsten Joachims. (1999). “Making Large-Scale SVM Learning Practical.” In: Advances in Kernel Methods - Support Vector Learning, Bernhard Schölkopf and C. Burges and Alexander J. Smola (ed.), MIT-Press.

### 1995a

- (Cortes & Vapnik, 1995) ⇒ Corinna Cortes, and Vladimir N. Vapnik. (1995). “Support Vector Networks.” In: Machine Learning, 20(3).

### 1995b

- (Vapnik, 1995) ⇒ Vladimir N. Vapnik. (1995). “The Nature of Statistical Learning Theory.” Springer. ISBN:0387945598.

### 1992

- (Boser et al., 1992) ⇒ Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. (1992). “A Training Algorithm for Optimal Margin Classifiers.” In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory (COLT).

### 1971

- (Vapnik & Chervonenkis, 1971) ⇒ Vladimir N. Vapnik, and A. Chervonenkis. (1971). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and Its Applications.

- ↑ Cortes, Corinna; Vapnik, Vladimir N. (1995). “Support-vector networks". Machine Learning. 20 (3): 273–297. doi:10.1007/BF00994018.
- ↑ Ben-Hur, Asa; Horn, David; Siegelmann, Hava; and Vapnik, Vladimir N.; "Support vector clustering"; (2001);
*Journal of Machine Learning Research*, 2: 125–137