Backpropagation of Errors (BP)-based Training Algorithm

A Backpropagation of Errors (BP)-based Training Algorithm is a feed-forward supervised neural network training algorithm that applies a delta rule to feed the intermediate NNets loss function gradient (which is used to update the neuron weights to minimize the loss function).

Context:
- It can (typically) be guaranteed to convergence to a solution (although this may be a local minimum on the error surface).
- It can (typically) be used with Differentiable Activation Functions like ReLU, Sigmoid, and Tanh (to ensure the gradients can be propagated through the network).
- It can (often) be a Slow Algorithm (because it can converge slowly to a solution).
- It can (often) use Normalization Techniques such as Batch Normalization (to improve training stability and speed).
- It can (often) use Regularization Techniques like dropout technique (to prevent overfitting in large networks).
- ...
- It can range from being a Batch Backpropagation Algorithm to being an Iterative Backpropagation Algorithm.
- ...
- It can make use of the Chain Rule to compute gradients for each layer in the network.
- It can use Gradient Clipping Techniques to mitigate exploding gradients.
- It can benefit from using Gradient Descent Momentum in order to help escape local minima on the error surface.
- It can face issues with saddle points, where gradients are near zero, slowing down convergence significantly.
- It can be enhanced by using Adaptive Learning Rate Algorithms like Adam or RMSprop to dynamically adjust learning rates during training.
- ...
Example(s):
- Stochastic Gradient Descent Backpropagation Algorithm.
- Back-Propagation Through Time (BPTT) Algorithm.
- BRNN Backpropagation Algorithm.
- as applied to Convolutional Neural Networks (CNNs) trained for image classification using backpropagation, which iteratively adjusts filters to minimize the classification error.
- as applied to Recurrent Neural Networks (RNNs) that apply Back-Propagation Through Time (BPTT) for sequence prediction tasks, adjusting weights based on errors propagated through time steps.
- ...
Counter-Example(s):
- Perceptron Training Algorithm.
- Human Brain Learning Algorithm.
- Boltzmann Machine.
- Evolutionary Algorithms, which do not rely on gradient-based optimization but rather on genetic operations like mutation and crossover.
- Reinforcement Learning Algorithms, where learning is driven by reward signals rather than error backpropagation.
- Support Vector Machines (SVMs), which use a different optimization approach based on maximizing the margin between classes rather than minimizing a loss function via backpropagation.
See: Gradient Descent Algorithm, Chain Rule, Activation Function, Gradient Vanishing Problem, Learning Rate Scheduling, Second-Order Optimization Methods

References

2018

(Google ML Glossary, 2018) ⇒ (2018). Backpropagation. In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved: 2018-04-22.
- QUOTE: The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.

2017a

(Munro, 2017) ⇒ Munro P. (2017) "Backpropagation". In: Sammut, C., Webb, G.I. (eds) "Encyclopedia of Machine Learning and Data Mining". Springer, Boston, MA
- QUOTE: Backpropagation of error (henceforth BP) is a method for training feed-forward neural networks see Artificial Neural Networks. A specific implementation of BP is an iterative procedure that adjusts network weight parameters according to the gradient of an error measure. The procedure is implemented by computing an error value for each output unit, and by backpropagating the error values through the network.

2017b

(DL Stanford, 2017) ⇒ http://deeplearning.stanford.edu/wiki/index.php/Backpropagation_Algorithm Retrieved:2017-12-17
- QUOTE: Suppose we have a fixed training set [math]\displaystyle{ \{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \} }[/math] of [math]\displaystyle{ m }[/math] training examples. We can train our neural network using batch gradient descent. In detail, for a single training example [math]\displaystyle{ (x,y) }[/math], we define the cost function with respect to that single example to be:

[math]\displaystyle{ \begin{align}J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.\end{align} }[/math]

This is a (one-half) squared-error cost function. Given a training set of [math]\displaystyle{ m }[/math] examples, we then define the overall cost function to be:

[math]\displaystyle{ \begin{align}J(W,b)&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \\ &= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \end{align} }[/math]

The first term in the definition of [math]\displaystyle{ J(W,b) }[/math] is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

2014

(Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/backpropagation Retrieved:2014-9-23.
- Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respects to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.
  Backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient. It is therefore usually considered to be a supervised learning method, although it is also used in some unsupervised networks such as autoencoders. It is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the activation function used by the artificial neurons (or "nodes") be differentiable.
  This article provides details on how backpropagation works. For programmers attempting to implement a system, the tutorials linked in the External Links may be more useful.

2014b

(IEEE Spectrum, 2014) ⇒ http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts/
- QUOTE: Michael I. Jordan: … The algorithm that has proved the most successful for deep learning is based on a technique called back propagation. You have these layers of processing units, and you get an output from the end of the layers, and you propagate a signal backwards through the layers to change all the parameters. It’s pretty clear the brain doesn’t do something like that. …

1986

(Rumlhart et al., 1986) ⇒ David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. (1986). “internal representations by error propagation.” In: D. E. Rumelhart (editor) & James L. McClelland (editor). “Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations.” MIT Press. ISBN:026268053X
- ABSTRACT: We describe a new learning procedure, back-propagation, for networks of neuron-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ’hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions or these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.