Momentum Gradient Descent (MGD)

From GM-RKB
Jump to navigation Jump to search

A Momentum Gradient Descent (MGD) is a Gradient Descent-based Learning Algorithm that is based on Nesterov Momentum method.



References

2018a

  • (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum Retrieved:2018-4-29.
    • Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' seminal paper on backpropagation learning.[1] Stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a linear combination of the gradient and the previous update:[2] [3]

      [math]\displaystyle{ \Delta w := \alpha \Delta w - \eta \nabla Q_i(w) }[/math]

      [math]\displaystyle{ w := w + \Delta w }[/math]

      that leads to:

      [math]\displaystyle{ w := w - \eta \nabla Q_i(w) + \alpha \Delta w }[/math] where the parameter [math]\displaystyle{ w }[/math] which minimizes [math]\displaystyle{ Q(w) }[/math] is to be estimated, and [math]\displaystyle{ \eta }[/math] is a step size (sometimes called the learning rate in machine learning).

      The name momentum stems from an analogy to momentum in physics: the weight vector [math]\displaystyle{ w }[/math], thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss ("force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfullyfor several decades.[4]

2018b

2018c


  1. Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). “Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0.
  2. Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy Dasgupta and David Mcallester, ed. On the importance of initialization and momentum in deep learning (PDF). In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). 28. Atlanta, GA. pp. 1139–1147. Retrieved 14 January 2016.
  3. Sutskever, Ilya (2013). Training recurrent neural networks (PDF) (Ph.D.). University of Toronto. p. 74.
  4. Zeiler, Matthew D. (2012). “ADADELTA: An adaptive learning rate method". arXiv:1212.5701 Freely accessible.