Stochastic Gradient Descent (SGD)-based Regression System

From GM-RKB
Jump to navigation Jump to search

A Stochastic Gradient Descent (SGD)-based Regression System is a Linear Regression System that implements a Stochastic Gradient Descent Algorithm to solve a SGD Regression Task.



References

2017a

2017b

  • (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#regression Retrieved:2017-09-17
    • The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.

      The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:

      • loss="squared_loss": Ordinary least squares,
      • loss="huber": Huber loss for robust regression,
      • loss="epsilon_insensitive": linear Support Vector Regression.
The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.

SGDRegressor supports averaged SGD as SGDClassifier. Averaging can be enabled by setting `average=True`.

For regression with a squared loss and a l2 penalty, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in Ridge.

2017c

  • (Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation Retrieved:2017-09-17
    • QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by

      [math]\displaystyle{ E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) }[/math]

      where [math]\displaystyle{ L }[/math] is a loss function that measures model (mis)fit and [math]\displaystyle{ R }[/math] is a regularization term (aka penalty) that penalizes model complexity; [math]\displaystyle{ \alpha \gt 0 }[/math] is a non-negative hyperparameter.

      Different choices for L entail different classifiers such as

      • Hinge: (soft-margin) Support Vector Machines.
      • Log: Logistic Regression.
      • Least-Squares: Ridge Regression.
      • Epsilon-Insensitive: (soft-margin) Support Vector Regression.
(...)
Popular choices for the regularization term R include:
  • L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
  • L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
  • Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by 1 - l1_ratio.