Back-Propagation Through Time (BPTT) Algorithm

From GM-RKB
(Redirected from BPTT)
Jump to navigation Jump to search

A Back-Propagation Through Time (BPTT) Algorithm is a Gradient Descent Algorithm that can be used to train some recurrent neural networks.



References

2021

  • (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Backpropagation_through_time Retrieved:2021-3-21.
    • Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers.

      (...)

      The training data for a recurrent neural network is an ordered sequence of [math]\displaystyle{ k }[/math] input-output pairs, [math]\displaystyle{ \langle \mathbf{a}_0,\mathbf{y}_0 \rangle, \langle\mathbf{a}_1,\mathbf{y}_1 \rangle,\langle\mathbf{a}_2,\mathbf{y}_2\rangle,...,\langle\mathbf{a}_{k-1},\mathbf{y}_{k-1}\rangle }[/math]. An initial value must be specified for the hidden state [math]\displaystyle{ \mathbf{x}_0 }[/math]. Typically, a vector of all zeros is used for this purpose.

      BPTT begins by unfolding a recurrent neural network in time. The unfolded network contains [math]\displaystyle{ k }[/math] inputs and outputs, but every copy of the network shares the same parameters. Then the backpropagation algorithm is used to find the gradient of the cost with respect to all the network parameters.

      Consider an example of a neural network that contains a recurrent layer [math]\displaystyle{ f }[/math] and a feedforward layer [math]\displaystyle{ g }[/math]. There are different ways to define the training cost, but the total cost is always the average of the costs of each of the time steps. The cost of each time step can be computed separately. The figure above shows how the cost at time [math]\displaystyle{ t + 3 }[/math] can be computed, by unfolding the recurrent layer [math]\displaystyle{ f }[/math] for three time steps and adding the feedforward layer [math]\displaystyle{ g }[/math]. Each instance of [math]\displaystyle{ f }[/math] in the unfolded network shares the same parameters. Thus the weight updates in each instance ([math]\displaystyle{ f_1, f_2, f_3 }[/math]) are summed together.

2020

2016a

2016b

1990a

1990b

1985