Stochastic Gradient Descent (SGD)-based Regression System

A Stochastic Gradient Descent (SGD)-based Regression System is a Linear Regression System that implements a Stochastic Gradient Descent Algorithm to solve a SGD Regression Task.

Example(s):
- sklearn.linear_model.SGDRegressor [1]:
  - Prediction Latency
Counter-Example(s):
See: Regression Analysis Task, Classification Task, Random Variable, L2-norm, Linear Support Vector Machine.

References

2017a

(Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent-sgd Retrieved:2017-09-17
- QUOTE: Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The partial_fit method allows only/out-of-core learning.
  The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. E.g., with loss="log", SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector machine (SVM).

2017b

(Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#regression Retrieved:2017-09-17
- The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.
  The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
  - loss="squared_loss": Ordinary least squares,
  - loss="huber": Huber loss for robust regression,
  - loss="epsilon_insensitive": linear Support Vector Regression.

The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.

SGDRegressor supports averaged SGD as SGDClassifier. Averaging can be enabled by setting `average=True`.

For regression with a squared loss and a l2 penalty, another variant of SGD with an averaging strategy is available with Stochastic Average Gradient (SAG) algorithm, available as a solver in Ridge.

2017c

(Scikit Learn, 2017) ⇒ http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation Retrieved:2017-09-17
- QUOTE: Given a set of training examples [math]\displaystyle{ (x_1, y_1), \ldots, (x_n, y_n) }[/math] where [math]\displaystyle{ x_i \in \mathbf{R}^m }[/math] and [math]\displaystyle{ y_i \in \{-1,1\} }[/math], our goal is to learn a linear scoring function [math]\displaystyle{ f(x) = w^T x + b }[/math] with model parameters [math]\displaystyle{ w \in \mathbf{R}^m }[/math] and intercept [math]\displaystyle{ b \in \mathbf{R} }[/math]. In order to make predictions, we simply look at the sign of [math]\displaystyle{ f(x) }[/math]. A common choice to find the model parameters is by minimizing the regularized training error given by
  [math]\displaystyle{ E(w,b) = \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \alpha R(w) }[/math]
  where [math]\displaystyle{ L }[/math] is a loss function that measures model (mis)fit and [math]\displaystyle{ R }[/math] is a regularization term (aka penalty) that penalizes model complexity; [math]\displaystyle{ \alpha \gt 0 }[/math] is a non-negative hyperparameter.
  Different choices for L entail different classifiers such as
  - Hinge: (soft-margin) Support Vector Machines.
  - Log: Logistic Regression.
  - Least-Squares: Ridge Regression.
  - Epsilon-Insensitive: (soft-margin) Support Vector Regression.

(...)

Popular choices for the regularization term R include:

L2 norm: [math]\displaystyle{ R(w) := \frac{1}{2} \sum_{i=1}^{n} w_i^2 }[/math],
L1 norm: [math]\displaystyle{ R(w) := \sum_{i=1}^{n} |w_i| }[/math], which leads to sparse solutions.
Elastic Net: [math]\displaystyle{ R(w) := \frac{\rho}{2} \sum_{i=1}^{n} w_i^2 + (1-\rho) \sum_{i=1}^{n} |w_i| }[/math], a convex combination of L2 and L1, where [math]\displaystyle{ \rho }[/math] is given by 1 - l1_ratio.

Stochastic Gradient Descent (SGD)-based Regression System

References

2017a

2017b

2017c

Navigation menu

Search