# Stochastic Gradient Descent (SGD) Algorithm

A Stochastic Gradient Descent (SGD) Algorithm is an approximate gradient descent algorithm that is a stochastic optimization algorithm which can be implemented by an SGD System (to solve an SGD task).

**Context:**- It can pick a random training example [math](x_t, y_t)[/math] at each iteration step.
- It can update the current estimate [math]w[/math] at step [math]t+1[/math] by:
- [math]w_{t+1} \leftarrow w_t - \frac{\eta}{t} \frac{\partial}{\partial w} \ell(f_{w(t)}(x_t),y_t).[/math]
- [math]w_{t+1} \leftarrow w_t - \alpha \nabla Q(w) = w - \alpha \sum_{i=1}^n \nabla Q_i(w),[/math] where [math]Q_i(w)[/math] is the value of loss function at [math]i[/math]-th example, and [math]Q(w)[/math] is the empirical risk.

- It can range from being a First-Order Stochastic Gradient Descent Optimization Algorithm to being a Second-Order Stochastic Gradient Descent Optimization Algorithm.
- It can range from being an Offline Stochastic Gradient Descent Algorithm to being an Online Stochastic Gradient Descent Algorithm.
- It can range from being a Regularized SGD Algorithm to being a Non-Regularized SGD Algorithm.
- It can range from being a Parallelizable SGD Algorithm to being a Non-Parallelized SGD Algorithm.

**Example(s):**- SGD Backpropagation,
- SGD MCMC,
- Kalman-based Stochastic Gradient Descent (kSGD).
- Mini-Batch Stochastic Gradient Descent Algorithm (MBSGD).
- Stochastic Average Gradient (SAG),
- Semi-Stochastic Gradient Descent (SSGD),
- Stochastic Recursive Gradient Algorithm (SARAH),
- Stochastic Variance Reduced Gradient (SVRG).
- …

**Counter-Example(s):****See:**Stochastic Algorithm, Sum of Differentiable Functions, Batch Gradient Descent Algorithm.

## References

### 2018a

- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent Retrieved:2018-4-29.
**Stochastic gradient descent**(often shortened to**SGD**), also known as**incremental**gradient descent, is a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions.In other words, SGD tries to find minima or maxima by iteration.

### 2018b

- (Wijaya et al., 2018) ⇒ Galih Praja Wijaya, Dendi Handian, Imam Fachmi Nasrulloh, Lala Septem Riza, Rani Megasari, Enjun Junaeti (2018), "gradDescent: Gradient Descent for Regression Tasks" (PDF).
- QUOTE: An implementation of various learning algorithms based on Gradient Descent for dealing with regression tasks. The variants of gradient descent algorithm are: Mini-Batch Gradient Descent (MBGD), which is an optimization to use training data partially to reduce the computation load. Stochastic Gradient Descent (SGD), which is an optimization to use a random data in learning to reduce the computation load drastically. Stochastic Average Gradient (SAG), which is a SGD-based algorithm to minimize stochastic step to average. Momentum Gradient Descent (MGD), which is an optimization to speed-up gradient descent learning. Accelerated Gradient Descent (AGD), which is an optimization to accelerate gradient descent learning. Adagrad, which is a gradient-descent-based algorithm that accumulate previous cost to do adaptive learning. Adadelta, which is a gradient-descent-based algorithm that use hessian approximation to do adaptive learning. RMSprop, which is a gradient-descent-based algorithm that combine Adagrad and Adadelta adaptive learning ability. Adam, which is a gradient-descent-based algorithm that mean and variance moment to do adaptive learning. Stochastic Variance Reduce Gradient (SVRG), which is an optimization SGD-based algorithm to accelerates the process toward converging by reducing the gradient. Semi Stochastic Gradient Descent (SSGD),which is a SGD-based algorithm that combine GD and SGD to accelerates the process toward converging by choosing one of the gradients at a time. Stochastic Recursive Gradient Algorithm (SARAH), which is an optimization algorithm similarly SVRG to accelerates the process toward converging by accumulated stochastic information. Stochastic Recursive Gradient Algorithm+ (SARAHPlus), which is a SARAH practical variant algorithm to accelerates the process toward converging provides a possibility of earlier termination.

### 2018c

- (ML Glossary) ⇒ stochastic gradient descent (SGD) https://developers.google.com/machine-learning/glossary/#stochastic_gradient_descent_(SGD) Retrieved: 2018-04-29
- QUOTE: A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a data set to calculate an estimate of the gradient at each step.

### 2015

- (Ge et al., 2015) ⇒ Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. (2015). “Escaping From Saddle Points-Online Stochastic Gradient for Tensor Decomposition.” In: COLT, pp. 797-842.

### 2014

- (Scikit-learn, 2014) ⇒ http://scikit-learn.org/0.11/modules/sgd.html
- QUOTE: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

- The advantages of Stochastic Gradient Descent are:
- Efficiency.
- Ease of implementation (lots of opportunities for code tuning).

- The disadvantages of Stochastic Gradient Descent include:
- SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
- SGD is sensitive to feature scaling.

- QUOTE: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.

### 2012a

- (Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Stochastic_gradient_descent
**Stochastic gradient descent**is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions.

### 2012b

- http://en.wikipedia.org/wiki/Stochastic_gradient_descent#Background
- Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: :[math]Q(w) = \sum_{i=1}^n Q_i(w),[/math] where the parameter [math]w[/math] is to be estimated and where typically each summand function [math]Q_i( )[/math] is associated with the [math]i[/math]-th observation in the data set (used for training).
In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation, as shown for example by Thomas Ferguson's example.

^{[1]}Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).The sum-minimization problem also arises for empirical risk minimization: In this case, [math]Q_i(w)[/math] is the value of loss function at [math]i[/math]-th example, and [math]Q(w)[/math] is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: : [math]w := w - \alpha \nabla Q(w) = w - \alpha \sum_{i=1}^n \nabla Q_i(w),[/math] where [math]\alpha[/math] is a step size (sometimes called the

*learning rate*in machine learning). In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is veryveffective in the case of large-scale machine learning problems.

^{[2]}

- Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: :[math]Q(w) = \sum_{i=1}^n Q_i(w),[/math] where the parameter [math]w[/math] is to be estimated and where typically each summand function [math]Q_i( )[/math] is associated with the [math]i[/math]-th observation in the data set (used for training).

### 2010

- (Bottou, 2010) ⇒ Léon Bottou. (2010). “Large-scale Machine Learning with Stochastic Gradient Descent.” In: Proceedings of COMPSTAT'2010.
- QUOTE: Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
- Presentation Slides: http://rocq.inria.fr/axis/COMPSTAT2010/slides/slides_17.pdf

### 2007

- (Bottou & Bousquet, 2007) ⇒ Léon Bottou, and Olivier Bousquet. (2007). “The Tradeoffs of Large Scale Learning.” In: Advances in Neural Information Processing Systems (NIPS 2007)
- QUOTE: Stochastic Gradient Descent (SGD) picks a random training example [math](x_t, y_t)[/math] at each iteration and updates the parameter [math]w[/math] on the basis of this example only, [math]w(t + 1) = w(t) - \frac{\eta}{t} \frac{\partial}{\partial w} \ell(f_{w(t)}(x_t),y_t).[/math]

### 2006

- (Vishwanathan et al., 2006) ⇒ S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. (2006). “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In: Proceedings of the 23rd International Conference on Machine learning (ICML-2006). doi:10.1145/1143844.1143966
- QUOTE: We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques.
... (In this paper, “training” specifically means penalized maximum likelihood parameter estimation)

... Stochastic gradient methods, on the other hand, are online and scale sub-linearly with the amount of training data, making them very attractive for large data sets;

- QUOTE: We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques.

### 1992

- (Polyak et al., 1992) ⇒ Boris T. Polyak, and Anatoli B. Juditsky. (1992). “Acceleration of Stochastic Approximation by Averaging.” In: SIAM Journal on Control and Optimization, 30(4). doi:10.1137/0330046

### 1952

- (Kiefer & Wolfovitz, 1952) ⇒ E. Kiefer, and J. Wolfovitz. (1952). “Stochastic Estimation of the Maximum of a Regression Function.” In: Ann. Math. Statist., 23.

### 1951

- (Robbins & Monroe, 1951) ⇒ H. Robbins, and S. Monroe. (1951). “A Stochastic-Approximation Method.” In: Ann. Math. Statist., 22.

- ↑ Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate".
*J. Am. Stat. Assoc.***77**(380): 831–834. JSTOR 2287314. - ↑ Template:Citation