ADAptive Moment (ADAM) Estimation Algorithm

Context:
- It has the following variants:
  - Sparse Adam , e.g. torch.optim.SparseAdam [1];
  - Adamax, e.g. torch.optim.Adamax[2].
- …
Example(s):
- tf.train.AdamOptimizer[3],
- torch.optim.Adam [4],
- chainer.optimizers.Adam [5]
- tflearn.optimizers.Adam [6]
- ADAM from CRAN gradDescent Repository [7],
- Adam in Deeplearning4j [8].
- …
Counter-Example(s):
See: Stochastic Optimization, Convex Optimization, Learning Rate, Gradient Descent, Outer Product, Hadamard Matrix Product, Euclidean Norm, Proximal Function.

References

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam Retrieved:2018-4-29.
- Adam^[1] (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters [math]\displaystyle{ w^ {(t)} }[/math] and a loss function [math]\displaystyle{ L ^ {(t)} }[/math] , where [math]\displaystyle{ t }[/math] indexes the current training iteration (indexed at [math]\displaystyle{ 1 }[/math] ), Adam's parameter update is given by: : [math]\displaystyle{ m_w ^ {(t+1)} \leftarrow \beta_1 m_w ^ {(t)} + (1 - \beta_1) \nabla _w L ^ {(t)} }[/math] : [math]\displaystyle{ v_w ^ {(t+1)} \leftarrow \beta_2 v_w ^ {(t)} + (1 - \beta_2) (\nabla _w L ^ {(t)} )^2 }[/math] : [math]\displaystyle{ \hat{m}_w = \frac{m_w ^ {(t+1)}}{1 - \beta_1 ^t} }[/math] : [math]\displaystyle{ \hat{v}_w = \frac{ v_w ^ {(t+1)}}{1 - \beta_2 ^t} }[/math] : [math]\displaystyle{ w ^ {(t+1)} \leftarrow w ^ {(t)} - \eta \frac{\hat{m}_w}{\sqrt{\hat{v}_w} + \epsilon} }[/math] where [math]\displaystyle{ \epsilon }[/math] is a small number used to prevent division by 0, and [math]\displaystyle{ \beta_1 }[/math] and [math]\displaystyle{ \beta_2 }[/math] are the forgetting factors for gradients and second moments of gradients, respectively.

(DL4J, 2018) ⇒ https://deeplearning4j.org/updater#adam Retrieved: 2018-04-29.
- QUOTE: ADAM uses both first-order moment mt and second-order moment [math]\displaystyle{ g_t }[/math], but they both decay over time. Step size is approximately [math]\displaystyle{ \pm\alpha }[/math]. Step size will decrease as we approach the error minimum.
  - AdamUpdater in Deeplearning4j [9]

(Redii et al.) ⇒ Reddi, S. J., Kale, S., & Kumar, S. (2018, February). "On the convergence of adam and beyond. In International Conference on Learning Representations (PDF)" [10].
- ABSTRACT: Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with “long-term memory of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

(Kingma & Ba, 2015) ⇒ Diederik P. Kingma, and Jimmy Ba. (2015). “Adam: A Method for Stochastic Optimization.” In: Proceedings of the 3rd International Conference for Learning Representations (ICLR-2015).
- ABSTRACT: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

↑ Diederik, Kingma; Ba, Jimmy (2014). “Adam: A method for stochastic optimization". arXiv:1412.6980