# Recurrent Neural Network (RNN) Training Algorithm

A Recurrent Neural Network (RNN) Training Algorithm is a deep neural network training algorithm that can be applied by a recurrent neural network training system (for recurrent neural network training of a recurrent neural network model).

**Context:**- It can range from being a Short-Memory RNN Algorithm to being a Long-Memory RNN Algorithm.
- It can handle data with different dimensions (such as text items).
- It can support an RNN-based Algorithm.

**Example(s):****Counter-Example(s):****See:**Gated Recurrent Neural Network, Vanishing Gradient Problem, First Order Method, Backpropagation Through Time, Particle Swarm Optimization, Simulated Annealing.

## References

### 2018

- (Weiss et al., 2018) ⇒ Gail Weiss, Yoav Goldberg, and Eran Yahav. (2018). “On the Practical Computational Power of Finite Precision RNNs for Language Recognition.” In: arXiv:1805.04908 Journal.
- QUOTE: While Recurrent Neural Networks (RNNs) are famously known to be Turing complete, this relies on infinite precision in the states and unbounded computation time. We consider the case of RNNs with finite precision whose computation time is linear in the input length. Under these limitations, we show that different RNN variants have different computational power. In particular, we show that the LSTM and the Elman-RNN with ReLU activation are strictly stronger than the RNN with a squashing activation and the GRU. ...

### 2017

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Gradient_descent Retrieved:2017-10-7.
- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-cvxlinear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.
The standard method is called “backpropagation through time” or BPTT, and is a generalization of back-propagation for feed-forward networks. which is an instance of automatic differentiation in the forward accumulation mode with stacked tangent vectors. Unlike BPTT this algorithm is local in time but not local in space. In this context, local in space means that a unit's weight vector can be updated using only information stored in the connected units and the unit itself such that update complexity of a single unit is linear in the dimensionality of the weight vector. Local in time means that the updates take place continually (on-line) and depend only on the most recent time step rather than on multiple time steps within a given time horizon as in BPTT. Biological neural networks appear to be local with respect to both time and space.

^{[1]}For recursively computing the partial derivatives, RTRL has a time-complexity of O(number of hidden x number of weights) per time step for computing the Jacobian matrices, while BPTT only takes O(number of weights) per time step, at the cost of storing all forward activations within the given time horizon.^{[2]}An online hybrid between BPTT and RTRL with intermediate complexity exists, along with variants for continuous time. A major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events.^{[3]}^{[4]}LSTM combined with a BPTT/RTRL hybrid learning method attempts to overcome these problems.^{[5]}The on-line algorithm called causal recursive backpropagation (CRBP), implements and combines BPTT and RTRL paradigms for locally recurrent networks. It works with the most general locally recurrent networks. The CRBP algorithm can minimize the global error term. This fact improves stability of the algorithm, providing a unifying view on gradient calculation techniques for recurrent networks with local feedback. One approach to the computation of gradient information in RNNs with arbitrary architectures is based on signal-flow graphs diagrammatic derivation. It uses the BPTT batch algorithm, based on Lee's theorem for network sensitivity calculations.^{[6]}It was proposed by Wan and Beaufays, while its fast online version was proposed by Campolucci, Uncini and Piazza.^{[6]}

- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-cvxlinear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.

### 2017b

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Global_optimization_methods Retrieved:2017-10-8.
- Training the weights in a neural network can be modeled as a non-linear global optimization problem. A target function can be formed to evaluate the fitness or error of a particular weight vector as follows: First, the weights in the network are set according to the weight vector. Next, the network is evaluated against the training sequence. Typically, the sum-squared-difference between the predictions and the target values specified in the training sequence is used to represent the error of the current weight vector. Arbitrary global optimization techniques may then be used to minimize this target function.
The most common global optimization method for training RNNs is genetic algorithms, especially in unstructured networks.

Initially, the genetic algorithm is encoded with the neural network weights in a predefined manner where one gene in the chromosome represents one weight link. The whole network is represented as a single chromosome. The fitness function is evaluated as follows:

- Each weight encoded in the chromosome is assigned to the respective weight link of the network.
- The training set is presented to the network which propagates the input signals forward.
- The mean-squared-error is returned to the fitness function.
- This function drives the genetic selection process.

- Many chromosomes make up the population; therefore, many different neural networks are evolved until a stopping criterion is satisfied. A common stopping scheme is:
- When the neural network has learnt a certain percentage of the training data or
- When the minimum value of the mean-squared-error is satisfied or
- When the maximum number of training generations has been reached.

- The stopping criterion is evaluated by the fitness function as it gets the reciprocal of the mean-squared-error from each network during training. Therefore, the goal of the genetic algorithm is to maximize the fitness function, reducing the mean-squared-error.
Other global (and/or evolutionary) optimization techniques may be used to seek a good set of weights, such as simulated annealing or particle swarm optimization.

- Training the weights in a neural network can be modeled as a non-linear global optimization problem. A target function can be formed to evaluate the fitness or error of a particular weight vector as follows: First, the weights in the network are set according to the weight vector. Next, the network is evaluated against the training sequence. Typically, the sum-squared-difference between the predictions and the target values specified in the training sequence is used to represent the error of the current weight vector. Arbitrary global optimization techniques may then be used to minimize this target function.

### 2012

- (Graves, 2012) ⇒ Alex Graves. (2012). “Supervised Sequence Labelling with Recurrent Neural Networks.” Springer Berlin Heidelberg. ISBN:9783642247965
- QUOTE: ... Recurrent neural networks are powerful sequence learning tools ― robust to input noise and distortion, able to exploit long-range contextual information―that would seem ideally suited to such problems. However their role in large-scale sequence labelling systems has so far been auxiliary. ...

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`PríncipeEuliano2000`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Ollivier2015`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`hochreiter1991`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`HOCH2001`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`lstm`

- ↑
^{6.0}^{6.1}Cite error: Invalid`<ref>`

tag; no text was provided for refs named`ReferenceA`