# Backpropagation of Errors (BP)-based Training Algorithm

A Backpropagation of Errors (BP)-based Training Algorithm is a feed-forward supervised neural network training algorithm that applies a delta rule to feed the intermediate NNets loss function gradient (which is used to update the neuron weights to minimize the loss function).

**Context:**- It can range from being a Batch Backpropagation Algorithm to being an Iterative Backpropagation Algorithm.
- It can (typically) be guaranteed to convergence.
- It cab (often) be a Slow Algorithm (because it can converge slowly to a solution).
- It can converge to a local minimum on the Error Surface.
- It can make use of the Chain Rule.
- …

**Example(s):****Counter-Example(s):****See:**Gradient Descent Algorithm; Artificial Neural Networks, Chain Rule, Activation Function.

## References

### 2018

- (Google ML Glossary, 2018) ⇒ (2018). Backpropagation. In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved: 2018-04-22.
- QUOTE: The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.

### 2017a

- (Munro, 2017) ⇒ Munro P. (2017) "Backpropagation". In: Sammut, C., Webb, G.I. (eds) "Encyclopedia of Machine Learning and Data Mining". Springer, Boston, MA
- QUOTE: Backpropagation of error (henceforth BP) is a method for training feed-forward neural networks see Artificial Neural Networks. A specific implementation of BP is an iterative procedure that adjusts network weight parameters according to the gradient of an error measure. The procedure is implemented by computing an error value for each output unit, and by backpropagating the error values through the network.

### 2017b

- (DL Stanford, 2017) ⇒ http://deeplearning.stanford.edu/wiki/index.php/Backpropagation_Algorithm Retrieved:2017-12-17
- QUOTE: Suppose we have a fixed training set [math]\displaystyle{ \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} }[/math] of [math]\displaystyle{ m }[/math] training examples. We can train our neural network using batch gradient descent. In detail, for a single training example [math]\displaystyle{ (x,y) }[/math], we define the cost function with respect to that single example to be:

- [math]\displaystyle{ \begin{align}J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.\end{align} }[/math]
This is a (one-half) squared-error cost function. Given a training set of [math]\displaystyle{ m }[/math] examples, we then define the overall cost function to be:

- [math]\displaystyle{ \begin{align}J(W,b)&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right]
+ \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2
\\
&= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right]
+ \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2
\end{align} }[/math]
The first term in the definition of [math]\displaystyle{ J(W,b) }[/math] is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

- [math]\displaystyle{ \begin{align}J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.\end{align} }[/math]

### 2014

- (Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/backpropagation Retrieved:2014-9-23.
**Backpropagation**, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respects to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.Backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient. It is therefore usually considered to be a supervised learning method, although it is also used in some unsupervised networks such as autoencoders. It is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the activation function used by the artificial neurons (or "nodes") be differentiable.

This article provides details on how backpropagation works. For programmers attempting to implement a system, the tutorials linked in the External Links may be more useful.

### 2014b

- (IEEE Spectrum, 2014) ⇒ http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts/
- QUOTE: Michael I. Jordan: … The algorithm that has proved the most successful for deep learning is based on a technique called back propagation. You have these layers of processing units, and you get an output from the end of the layers, and you propagate a signal backwards through the layers to change all the parameters. It’s pretty clear the brain doesn’t do something like that. …

### 1986

- (Rumlhart et al., 1986) ⇒ David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. (1986). “internal representations by error propagation.” In: D. E. Rumelhart (editor) & James L. McClelland (editor). “Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations.” MIT Press. ISBN:026268053X
- ABSTRACT: We describe a new learning procedure, back-propagation, for networks of neuron-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ’hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions or these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.