Recurrent Neural Network (RNN) Training Algorithm

(Redirected from RNNs)

References

2017

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Gradient_descent Retrieved:2017-10-7.
• Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In neural networks, it can be used to minimize the error term by changing each weight in proportion to the derivative of the error with respect to that weight, provided the non-cvxlinear activation functions are differentiable. Various methods for doing so were developed in the 1980s and early 1990s by Werbos, Williams, Robinson, Schmidhuber, Hochreiter, Pearlmutter and others.

The standard method is called “backpropagation through time” or BPTT, and is a generalization of back-propagation for feed-forward networks. which is an instance of automatic differentiation in the forward accumulation mode with stacked tangent vectors. Unlike BPTT this algorithm is local in time but not local in space. In this context, local in space means that a unit's weight vector can be updated using only information stored in the connected units and the unit itself such that update complexity of a single unit is linear in the dimensionality of the weight vector. Local in time means that the updates take place continually (on-line) and depend only on the most recent time step rather than on multiple time steps within a given time horizon as in BPTT. Biological neural networks appear to be local with respect to both time and space. [1] For recursively computing the partial derivatives, RTRL has a time-complexity of O(number of hidden x number of weights) per time step for computing the Jacobian matrices, while BPTT only takes O(number of weights) per time step, at the cost of storing all forward activations within the given time horizon.[2] An online hybrid between BPTT and RTRL with intermediate complexity exists, along with variants for continuous time. A major problem with gradient descent for standard RNN architectures is that error gradients vanish exponentially quickly with the size of the time lag between important events.[3][4] LSTM combined with a BPTT/RTRL hybrid learning method attempts to overcome these problems.[5] The on-line algorithm called causal recursive backpropagation (CRBP), implements and combines BPTT and RTRL paradigms for locally recurrent networks. It works with the most general locally recurrent networks. The CRBP algorithm can minimize the global error term. This fact improves stability of the algorithm, providing a unifying view on gradient calculation techniques for recurrent networks with local feedback. One approach to the computation of gradient information in RNNs with arbitrary architectures is based on signal-flow graphs diagrammatic derivation. It uses the BPTT batch algorithm, based on Lee's theorem for network sensitivity calculations.[6] It was proposed by Wan and Beaufays, while its fast online version was proposed by Campolucci, Uncini and Piazza.[6]

2017b

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Recurrent_neural_network#Global_optimization_methods Retrieved:2017-10-8.
• Training the weights in a neural network can be modeled as a non-linear global optimization problem. A target function can be formed to evaluate the fitness or error of a particular weight vector as follows: First, the weights in the network are set according to the weight vector. Next, the network is evaluated against the training sequence. Typically, the sum-squared-difference between the predictions and the target values specified in the training sequence is used to represent the error of the current weight vector. Arbitrary global optimization techniques may then be used to minimize this target function.

The most common global optimization method for training RNNs is genetic algorithms, especially in unstructured networks.

Initially, the genetic algorithm is encoded with the neural network weights in a predefined manner where one gene in the chromosome represents one weight link. The whole network is represented as a single chromosome. The fitness function is evaluated as follows:

• Each weight encoded in the chromosome is assigned to the respective weight link of the network.
• The training set is presented to the network which propagates the input signals forward.
• The mean-squared-error is returned to the fitness function.
• This function drives the genetic selection process.
• Many chromosomes make up the population; therefore, many different neural networks are evolved until a stopping criterion is satisfied. A common stopping scheme is:
• When the neural network has learnt a certain percentage of the training data or
• When the minimum value of the mean-squared-error is satisfied or
• When the maximum number of training generations has been reached.
• The stopping criterion is evaluated by the fitness function as it gets the reciprocal of the mean-squared-error from each network during training. Therefore, the goal of the genetic algorithm is to maximize the fitness function, reducing the mean-squared-error.

Other global (and/or evolutionary) optimization techniques may be used to seek a good set of weights, such as simulated annealing or particle swarm optimization.

2012

1. Cite error: Invalid `<ref>` tag; no text was provided for refs named `PríncipeEuliano2000`
2. Cite error: Invalid `<ref>` tag; no text was provided for refs named `Ollivier2015`
3. Cite error: Invalid `<ref>` tag; no text was provided for refs named `hochreiter1991`
4. Cite error: Invalid `<ref>` tag; no text was provided for refs named `HOCH2001`
5. Cite error: Invalid `<ref>` tag; no text was provided for refs named `lstm`
6. Cite error: Invalid `<ref>` tag; no text was provided for refs named `ReferenceA`