# ADAptive Moment (ADAM) Estimation Algorithm

An ADAptive Moment (ADAM) Estimation Algorithm is a gradient descent-based learning algorithm is based on first- and second-order statistical moments, i.e. mean and variance.

**AKA:**ADAm Optimizer.**Context:**- It has the following variants:
- Sparse Adam , e.g.
`torch.optim.SparseAdam`

[1]; - Adamax, e.g.
`torch.optim.Adamax`

[2].

- Sparse Adam , e.g.

- It has the following variants:
**Example(s):****Counter-Example(s):****See:**Stochastic Optimization, Convex Optimization, Learning Rate, Gradient Descent, Outer Product, Hadamard Matrix Product, Euclidean Norm, Proximal Function.

## References

### 2018a

- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam Retrieved:2018-4-29.
*Adam*^{[1]}(short for Adaptive Moment Estimation) is an update to the*RMSProp*optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters [math] w^ {(t)} [/math] and a loss function [math] L ^ {(t)} [/math] , where [math] t [/math] indexes the current training iteration (indexed at [math] 1 [/math] ), Adam's parameter update is given by: : [math] m_w ^ {(t+1)} \leftarrow \beta_1 m_w ^ {(t)} + (1 - \beta_1) \nabla _w L ^ {(t)} [/math] : [math] v_w ^ {(t+1)} \leftarrow \beta_2 v_w ^ {(t)} + (1 - \beta_2) (\nabla _w L ^ {(t)} )^2 [/math] : [math] \hat{m}_w = \frac{m_w ^ {(t+1)}}{1 - \beta_1 ^t} [/math] : [math] \hat{v}_w = \frac{ v_w ^ {(t+1)}}{1 - \beta_2 ^t} [/math] : [math] w ^ {(t+1)} \leftarrow w ^ {(t)} - \eta \frac{\hat{m}_w}{\sqrt{\hat{v}_w} + \epsilon} [/math] where [math] \epsilon [/math] is a small number used to prevent division by 0, and [math] \beta_1 [/math] and [math] \beta_2 [/math] are the forgetting factors for gradients and second moments of gradients, respectively.

### 2018b

- (sklearn, 2018) ⇒ http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms Retrieved: 2018-04-22.
- QUOTE: MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
[math]w \leftarrow w - \eta (\alpha \frac{\partial R(w)}{\partial w}+ \frac{\partial Loss}{\partial w})[/math]

where [math]\eta[/math] is the learning rate which controls the step-size in the parameter space search. Loss is the loss function used for the network.

More details can be found in the documentation of SGD.

Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can automatically adjust the amount to update parameters based on adaptive estimates of lower-order moments.

With SGD or Adam, training supports online and mini-batch learning.

L-BFGS is a solver that approximates the Hessian matrix which represents the second-order partial derivative of a function. Further it approximates the inverse of the Hessian matrix to perform parameter updates. The implementation uses the Scipy version of L-BFGS.

- QUOTE: MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates parameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.

### 2018c

- (DL4J, 2018) ⇒ https://deeplearning4j.org/updater#adam Retrieved: 2018-04-29.
- QUOTE: ADAM uses both first-order moment
`mt`

and second-order moment [math]g_t[/math], but they both decay over time. Step size is approximately [math]\pm\alpha[/math]. Step size will decrease as we approach the error minimum.

- QUOTE: ADAM uses both first-order moment

### 2018d

- (Redii et al.) ⇒ Reddi, S. J., Kale, S., & Kumar, S. (2018, February). "On the convergence of adam and beyond. In International Conference on Learning Representations (PDF)" [10].
- ABSTRACT: Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with “long-term memory
*of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.*

- ABSTRACT: Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with “long-term memory

### 2015

- (Kingma & Ba, 2015) ⇒ Diederik P. Kingma, and Jimmy Ba. (2015). “Adam: A Method for Stochastic Optimization.” In: Proceedings of the 3rd International Conference for Learning Representations (ICLR-2015).
- ABSTRACT: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

- ↑ Diederik, Kingma; Ba, Jimmy (2014). “Adam: A method for stochastic optimization". arXiv:1412.6980