Root Mean Square Propagation Algorithm (RMSprop)

References

2018a

• (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp Retrieved:2018-4-29.
• RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. [1] So, first the running average is calculated in terms of means square,

$v(w,t):=\gamma v(w,t-1)+(1-\gamma)(\nabla Q_i(w))^2$

where, $\gamma$ is the forgetting factor. And the parameters are updated as,

$w:=w-\frac{\eta}{\sqrt{v(w,t)}}\nabla Q_i(w)$

RMSProp has shown excellent adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.

2015

• (Misra, 2015) ⇒ Ishan Misra (2015)."Optimization for Deep Networks" (PDF)
• QUOTE: RMSProp = Rprop + SGD
• Tieleman & Hinton et al., 2012 (Coursera slide 29, Lecture 6)
• Scale updates similarly across mini-batches,
• Scale by decaying average of squared gradient,
$r_t=(1-\gamma)f'(\theta)^2+\gamma r_{t-1}$
$v_{t+1}=\frac{\alpha}{\sqrt{r_t}f'(\theta_t)}$,
$\theta_{t+1}=\theta_t-v_{t+1}$