An ADAptive Moment (ADAM) Estimation Algorithm is a gradient descent-based learning algorithm is based on first- and second-order statistical moments, i.e. mean and variance.

## References

### 2018a

• Adam[1] (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters $w^ {(t)}$ and a loss function $L ^ {(t)}$ , where $t$ indexes the current training iteration (indexed at $1$ ), Adam's parameter update is given by: : $m_w ^ {(t+1)} \leftarrow \beta_1 m_w ^ {(t)} + (1 - \beta_1) \nabla _w L ^ {(t)}$ : $v_w ^ {(t+1)} \leftarrow \beta_2 v_w ^ {(t)} + (1 - \beta_2) (\nabla _w L ^ {(t)} )^2$ : $\hat{m}_w = \frac{m_w ^ {(t+1)}}{1 - \beta_1 ^t}$ : $\hat{v}_w = \frac{ v_w ^ {(t+1)}}{1 - \beta_2 ^t}$ : $w ^ {(t+1)} \leftarrow w ^ {(t)} - \eta \frac{\hat{m}_w}{\sqrt{\hat{v}_w} + \epsilon}$ where $\epsilon$ is a small number used to prevent division by 0, and $\beta_1$ and $\beta_2$ are the forgetting factors for gradients and second moments of gradients, respectively.