# Gradient-Descent Optimization Algorithm

A Gradient-Descent Optimization Algorithm is an first-order iterative optimization algorithm (to solve a gradient descent-based optimization task) that uses a differentiable gradient function.

**AKA:**Method of Steepest Descent.**Context:**- It can range from being a Batch Gradient Descent Algorithm to being an Online Gradient Descent Algorithm.
- It can range from being an Exact Gradient Descent Algorithm to being a Approximate Gradient Descent Algorithm (such as SGD).
- It can be implemented by a Gradient Descent-based Optimization System (that can solve a gradient descent-based optimization task).
- …

**Example(s):**- Newton's Method;
- Conjugate Gradient Method;
- Gradient Descent-based Learning Algorithm, such as:
- a gradient-descent boosted learning algorithm,
- an Accelerated Gradient Descent (AGD),
- an Adaptive Gradient Algorithm (AdaGrad),
- an Adaptive Learning Rate Algorithm (ADADELTA),
- an Adaptive Moment Estimation Algorithm (Adam),
- a Mini-Batch Gradient Descent Algorithm (MBGD).
- a Momentum Gradient Descent (MGD),
- a Root Mean Square Propagation Algorithm (RMSprop);

- a Stochastic Gradient Descent-based Algorithm such as:
- …

**Counter-Example(s):**- Gauss-Newton Method.
- Genetic Algorithm, which does not use the concept of gradients for optimization.
- Simulated Annealing Algorithm, which follows a probabilistic method for approximating the global optimum of a given function.

**See:**Finite Element Algorithm, Error Surface, Steepest-Descent Numerical Integration Algorithm.

## References

### 2024

- GPT-4

Procedure: Gradient-Descent Optimization Description: This algorithm finds the minimum of a differentiable function using the gradient descent method, adjusting parameters iteratively to minimize the function's output.

Initialization: - Input: - f: A differentiable function whose minimum is to be found. - grad_f: The gradient function of f. - x_0: Initial guess for the location of the minimum. - alpha: Initial learning rate. - tolerance: The tolerance level for stopping criteria. - max_iterations: Maximum number of iterations allowed. - Output: - x_min: The estimated location of the minimum. - f_min: The value of f at x_min.

Algorithm Steps: - Initialize: - x := x_0 - iteration := 0 - Repeat until convergence or maximum iterations reached: - if |grad_f(x)| < tolerance or iteration >= max_iterations: - break - Update x: - x := x - alpha * grad_f(x) - Update iteration count: - iteration := iteration + 1 - Finalize: - x_min := x - f_min := f(x_min)

Output Results: - Return: (x_min, f_min)

### 2019

- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Gradient_descent Retrieved:2019-9-20.
**Gradient descent**is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the*negative*of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the*positive*of the gradient, one approaches a local maximum of that function; the procedure is then known as**gradient ascent**. Gradient descent was originally proposed by Cauchy in 1847.^{[1]}^{[2]}Gradient descent is also known as

**steepest descent**. However, gradient descent should not be confused with the method of steepest descent for approximating integrals.

### 2018b

- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm#Related_algorithms Retrieved:2018-8-19.
- In a quasi-Newton method, such as that due to Davidon, Fletcher and Powell or Broyden–Fletcher–Goldfarb–Shanno (BFGS method) an estimate of the full Hessian [math]\displaystyle{ \frac{\partial^2 S}{\partial \beta_j \partial\beta_k} }[/math] is built up numerically using first derivatives [math]\displaystyle{ \frac{\partial r_i}{\partial\beta_j} }[/math] only so that after
*n*refinement cycles the method closely approximates to Newton's method in performance. Note that quasi-Newton methods can minimize general real-valued functions, whereas Gauss–Newton, Levenberg–Marquardt, etc. fits only to nonlinear least-squares problems.Another method for solving minimization problems using only first derivatives is gradient descent. However, this method does not take into account the second derivatives even approximately. Consequently, it is highly inefficient for many functions, especially if the parameters have strong interactions.

- In a quasi-Newton method, such as that due to Davidon, Fletcher and Powell or Broyden–Fletcher–Goldfarb–Shanno (BFGS method) an estimate of the full Hessian [math]\displaystyle{ \frac{\partial^2 S}{\partial \beta_j \partial\beta_k} }[/math] is built up numerically using first derivatives [math]\displaystyle{ \frac{\partial r_i}{\partial\beta_j} }[/math] only so that after

### 2018a

- (Wijaya et al., 2018) ⇒ Galih Praja Wijaya, Dendi Handian, Imam Fachmi Nasrulloh, Lala Septem Riza, Rani Megasari, Enjun Junaeti (2018), "gradDescent: Gradient Descent for Regression Tasks" (PDF).
- QUOTE: An implementation of various learning algorithms based on Gradient Descent for dealing with regression tasks. The variants of gradient descent algorithm are: Mini-Batch Gradient Descent (MBGD), which is an optimization to use training data partially to reduce the computation load. Stochastic Gradient Descent (SGD), which is an optimization to use a random data in learning to reduce the computation load drastically. Stochastic Average Gradient (SAG), which is a SGD-based algorithm to minimize stochastic step to average. Momentum Gradient Descent (MGD), which is an optimization to speed-up gradient descent learning. Accelerated Gradient Descent (AGD), which is an optimization to accelerate gradient descent learning. Adagrad, which is a gradient-descent-based algorithm that accumulate previous cost to do adaptive learning. Adadelta, which is a gradient-descent-based algorithm that use hessian approximation to do adaptive learning. RMSprop, which is a gradient-descent-based algorithm that combine Adagrad and Adadelta adaptive learning ability. Adam, which is a gradient-descent-based algorithm that mean and variance moment to do adaptive learning. Stochastic Variance Reduce Gradient (SVRG), which is an optimization SGD-based algorithm to accelerates the process toward converging by reducing the gradient. Semi Stochastic Gradient Descent (SSGD),which is a SGD-based algorithm that combine GD and SGD to accelerates the process toward converging by choosing one of the gradients at a time. Stochastic Recursive Gradient Algorithm (SARAH), which is an optimization algorithm similarly SVRG to accelerates the process toward converging by accumulated stochastic information. Stochastic Recursive Gradient Algorithm+ (SARAHPlus), which is a SARAH practical variant algorithm to accelerates the process toward converging provides a possibility of earlier termination.

### 2018b

- (ML Glossary) ⇒ gradient descent https://developers.google.com/machine-learning/glossary/#gradient_descent Retrieved: 2018-04-29
- QUOTE: A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

### 2012

- http://en.wikipedia.org/wiki/Gradient_descent#Description
- QUOTE: Gradient descent is based on the observation that if the multivariable function [math]\displaystyle{ F(\mathbf{x}) }[/math] is defined and differentiable in a neighborhood of a point [math]\displaystyle{ \mathbf{a} }[/math], then [math]\displaystyle{ F(\mathbf{x}) }[/math] decreases
*fastest*if one goes from [math]\displaystyle{ \mathbf{a} }[/math] in the direction of the negative gradient of [math]\displaystyle{ F }[/math] at [math]\displaystyle{ \mathbf{a} }[/math], [math]\displaystyle{ -\nabla F(\mathbf{a}) }[/math]. It follows that, if :[math]\displaystyle{ \mathbf{b} = \mathbf{a}-\gamma\nabla F(\mathbf{a}) }[/math] for [math]\displaystyle{ \gamma \to 0 }[/math] a small enough number, then [math]\displaystyle{ F(\mathbf{a})\geq F(\mathbf{b}) }[/math]. With this observation in mind, one starts with a guess [math]\displaystyle{ \mathbf{x}_0 }[/math] for a local minimum of [math]\displaystyle{ F }[/math], and considers the sequence [math]\displaystyle{ \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \dots }[/math] such that :[math]\displaystyle{ \mathbf{x}_{n+1}=\mathbf{x}_n-\gamma_n \nabla F(\mathbf{x}_n),\ n \ge 0. }[/math] We have :[math]\displaystyle{ F(\mathbf{x}_0)\ge F(\mathbf{x}_1)\ge F(\mathbf{x}_2)\ge \cdots, }[/math] so hopefully the sequence [math]\displaystyle{ (\mathbf{x}_n) }[/math] converges to the desired local minimum. Note that the value of the*step size*[math]\displaystyle{ \gamma }[/math] is allowed to change at every iteration. With certain assumptions on the function [math]\displaystyle{ F }[/math] (for example, [math]\displaystyle{ F }[/math] convex and [math]\displaystyle{ \nabla F }[/math] Lipschitz) and particular choices of [math]\displaystyle{ \gamma }[/math] (e.g., chosen via a line search that satisfies the Wolfe conditions), convergence to a local minimum can be guaranteed. When the function [math]\displaystyle{ F }[/math] is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.This process is illustrated in the picture to the right. Here [math]\displaystyle{ F }[/math] is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of [math]\displaystyle{ F }[/math] is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is orthogonal to the contour line going through that point. We see that gradient

*descent*leads us to the bottom of the bowl, that is, to the point where the value of the function [math]\displaystyle{ F }[/math] is minimal.

- QUOTE: Gradient descent is based on the observation that if the multivariable function [math]\displaystyle{ F(\mathbf{x}) }[/math] is defined and differentiable in a neighborhood of a point [math]\displaystyle{ \mathbf{a} }[/math], then [math]\displaystyle{ F(\mathbf{x}) }[/math] decreases

### 1999

- (Orr, 1999a) ⇒ Genevieve Orr. (1999). “Linear Neural Networks.” In: "CS-449: Neural Networks." Fall 99
- To find the gradient G for the entire data set, we sum at each weight the contribution given by equation 6 over all the data points. We can then subtract a small proportion µ (called the learning rate) of G from the weights to perform gradient descent.
- 1. Initialize all weights to small random values.
- 2. REPEAT until done
- 1. For each weight wij set
- 2. For each data point (x, t)p
- 1. set input units to x
- 2. compute value of output units
- 3. For each weight wij set

- 3. For each weight wij set
- An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.