ADAptive Moment (ADAM) Estimation Algorithm

From GM-RKB
(Redirected from ADAptive with Momentum)
Jump to navigation Jump to search

An ADAptive Moment (ADAM) Estimation Algorithm is a gradient descent-based learning algorithm is based on first- and second-order statistical moments, i.e. mean and variance.



References

2018a

  • (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam Retrieved:2018-4-29.
    • Adam[1] (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters [math]\displaystyle{ w^ {(t)} }[/math] and a loss function [math]\displaystyle{ L ^ {(t)} }[/math] , where [math]\displaystyle{ t }[/math] indexes the current training iteration (indexed at [math]\displaystyle{ 1 }[/math] ), Adam's parameter update is given by: : [math]\displaystyle{ m_w ^ {(t+1)} \leftarrow \beta_1 m_w ^ {(t)} + (1 - \beta_1) \nabla _w L ^ {(t)} }[/math] : [math]\displaystyle{ v_w ^ {(t+1)} \leftarrow \beta_2 v_w ^ {(t)} + (1 - \beta_2) (\nabla _w L ^ {(t)} )^2 }[/math] : [math]\displaystyle{ \hat{m}_w = \frac{m_w ^ {(t+1)}}{1 - \beta_1 ^t} }[/math] : [math]\displaystyle{ \hat{v}_w = \frac{ v_w ^ {(t+1)}}{1 - \beta_2 ^t} }[/math] : [math]\displaystyle{ w ^ {(t+1)} \leftarrow w ^ {(t)} - \eta \frac{\hat{m}_w}{\sqrt{\hat{v}_w} + \epsilon} }[/math] where [math]\displaystyle{ \epsilon }[/math] is a small number used to prevent division by 0, and [math]\displaystyle{ \beta_1 }[/math] and [math]\displaystyle{ \beta_2 }[/math] are the forgetting factors for gradients and second moments of gradients, respectively.

2018b

2018c

2018d

2015


  1. Diederik, Kingma; Ba, Jimmy (2014). “Adam: A method for stochastic optimization". arXiv:1412.6980