Stochastic Gradient Descent (SGD)-based Classification System

From GM-RKB
Jump to navigation Jump to search

A Stochastic Gradient Descent (SGD)-based Classification System is a supervised classification system that implements a Stochastic Gradient Descent Algorithm to solve a SGD Classification Task.



References

2017a

2017b

and all regression losses below (...)
SGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting `average=True`. ASGD works by averaging the coefficients of the plain SGD over each iteration over a sample. When using ASGD the learning rate can be larger and even constant leading on some datasets to a speed up in training time.

For classification with a logistic loss, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in LogisticRegression.

2017c

  • (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation Retrieved:2017-09-17
    • QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by

      [math]\displaystyle{ E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) }[/math]

      where [math]\displaystyle{ L }[/math] is a loss function that measures model (mis)fit and [math]\displaystyle{ R }[/math] is a regularization term (aka penalty) that penalizes model complexity; [math]\displaystyle{ \alpha \gt 0 }[/math] is a non-negative hyperparameter.

      Different choices for L entail different classifiers such as

      • Hinge: (soft-margin) Support Vector Machines.
      • Log: Logistic Regression.
      • Least-Squares: Ridge Regression.
      • Epsilon-Insensitive: (soft-margin) Support Vector Regression.
(...)
Popular choices for the regularization term R include:
  • L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
  • L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
  • Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by 1 - l1_ratio.