Adaptive Gradient (AdaGrad) Algorithm

From GM-RKB
(Redirected from adaptive gradient algorithm)
Jump to navigation Jump to search

An Adaptive Gradient (AdaGrad) Algorithm is a gradient descent-based learning algorithm with a learning rate per parameter.



References

2018a

2018b

  • (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad Retrieved:2018-4-22.
    • AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011.[1] Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition. It still has a base learning rate , but this is multiplied with the elements of a vector {Gj,j} which is the diagonal of the outer product matrix. : [math]\displaystyle{ G = \sum_{\tau=1}^t g_\tau g_\tau^\mathsf{T} }[/math] where [math]\displaystyle{ g_\tau = \nabla Q_i(w) }[/math], the gradient, at iteration . The diagonal is given by : [math]\displaystyle{ G_{j,j} = \sum_{\tau=1}^t g_{\tau,j}^2 }[/math] . This vector is updated after every iteration. The formula for an update is now : [math]\displaystyle{ w := w - \eta\, \mathrm{diag}(G)^{-\frac{1}{2}} \circ g }[/math] or, written as per-parameter updates, : [math]\displaystyle{ w_j := w_j - \frac{\eta}{\sqrt{G_{j,j}}} g_j. }[/math] Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since the denominator in this factor, [math]\displaystyle{ \sqrt{G_i} = \sqrt{\sum_{\tau=1}^t g_\tau^2} }[/math] is the 2 norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[2] While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.

2018c

2018d

  • (DL4J) ⇒ https://deeplearning4j.org/updater#adagrad Retrieved: 2018-04-29
    • QUOTE: Adagrad scales alpha for each parameter according to the history of gradients (previous steps) for that parameter. That’s basically done by dividing the current gradient in the update rule by the sum of previous gradients. As a result, when the gradient is very large, alpha is reduced, and vice-versa.

2017

2016

2011