- (Lipton, 2015) ⇒ Zachary C Lipton. (2015). “A Critical Review of Recurrent Neural Networks for Sequence Learning.” In: arXiv preprint arXiv:1506.00019.
Countless learning tasks require dealing with sequential data. Image captioning, speech synthesis, music generation, and video game playing all require that a model generate sequences of outputs. In other domains, such as time series prediction, video analysis, and music information retrieval, a model must learn from sequences of inputs. Significantly more interactive tasks, such as natural language translation, engaging in dialogue, and robotic control, often demand both. Recurrent neural networks (RNNs) are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Unlike feedforward neural networks, recurrent networks can process examples one at a time, retaining a state, or memory, that reflects an arbitrarily long context window. While these networks have long been difficult to train and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled large-scale learning with recurrent nets. Over the past few years, systems based on state of the art long short-term memory (LSTM) and bidirectional recurrent neural network (BRNN) architectures have demonstrated record-setting performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this review of the literature we synthesize the body of research that over the past three decades has yielded and reduced to practice these powerful models. When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a mostly self-contained explication of state of the art systems, together with a historical perspective and ample references to the primary research.
Recurrent neural networks (RNNs) are a superset of feedforward neural networks, augmented with the ability to pass information across time steps. They are a rich family of models capable of nearly arbitrary computation. A well-known result by Siegelman and Sontag from 1991 demonstrated that a finite sized recurrent neural network with sigmoidal activation functions can simulate a universal Turing machine (Siegelmann and Sontag, 1991). In practice, the ability to model temporal dependencies makes recurrent neural networks especially suited to tasks where input and/or output consist of sequences of points that are not independent.
1.1 Why Recurrent Nets ?
In this section, we address the fundamental reasons why recurrent neural networks warrant serious study for modeling sequential input and output. To be clear, we are motivated by a desire to achieve empirical results. This warrants clarification because recurrent nets have roots in both cognitive modeling and supervised machine learning, and owing to this difference of perspectives, many of these papers have different aims and priorities. In the foundational papers, generally published in cognitive science and computational neuroscience journals (Hopeld, 1982, Jordan, 1997, Elman, 1990), biologically plausible mechanisms are emphasized. In other papers (Schuster and Paliwal, 1997, Socher et al., 2014, Karpathy and Fei-Fei, 2014), biological inspiration is downplayed in favor of achieving empirical results on important tasks and datasets. Given the empirical aim, we now address three significant questions that one might reasonably want answered before reading further.
- Why Model Time Explicitly?
In light of the practical success and economic value of time-agnostic models, this is a fair question. Support vector machines, logistic regression, and feedforward networks have proved immensely useful without explicitly modeling time. Arguably, it is precisely the assumption of independence that has led to much recent progress in machine learning. Further, many models implicitly capture time by concatenating each input with some number of its immediate predecessors and successors, presenting the machine learning model with a sliding window of context about each point of interest. This approach has been used with deep belief nets for speech modeling by Maas et al. . Unfortunately, despite the usefulness of the independence assumption, it precludes modeling long-range time-dependencies. For example, a model trained using a finite-length context window of length 5 could never be trained to answer the simple question, \ what was the data point seen 6 time steps ago ? " For a practical application such as call center automation, such a limited system might learn to route calls, but could never participate in an extended dialogue. Since the earliest conception of artificial intelligence, we have sought to build systems that interact with humans in time. In Alan Turing's groundbreaking paper Computing Machinery and Intelligence, he proposes an \ Imitation Game " which judges a machine's intelligence by its ability to convincingly engage in dialogue [ Turing, 1950]. Besides dialogue systems, modern interactive systems of economic importance include self-driving cars and robotic surgery, among others. Ignoring an explicit model of time, it seems unlikely that any combination of classifiers or regressors can be cobbled together to provide this functionality.
- Why Neural Networks and Not Markov Models?
Recurrent neural networks are not the first models to capture time dependencies. Markov chains, which model transitions between observed sequences of states (s (1), s (2),..., s (T)), were first described by mathematician Andrey Markov in 1906. Hidden Markov models (HMMs), which model observed data (o (1); o (2);:::; o (T)) as probabilistically dependent upon unobserved states, were described in the 1950s and have been widely studied since the 1960s. However, traditional Markov model approaches are limited because their states must be drawn from a modestly sized discrete state space sj 2 S. The Viterbi algorithm, which is used to perform efficient inference on hidden Markov models, scales in time O (jSj2). Further, the transition table capturing the probability of moving between any two adjacent states is of size jSj2. Thus, standard operations are infeasible with an HMM when the set of possible hidden states is larger than roughly 106 states. Further, each hidden state s (t) can depend only on the previous state s (t ? ?1). While it is possible to extend any Markov model to account for a larger context window by creating a new state-space equal to the cross product of the possible states at each time in the window, this procedure grows the state space exponentially with the size of the window, rendering Markov models computationally impractical for modeling long-range dependencies. Given the limitations of Markov models, we ought to explain why it is sensible that connectionist models, i.e., artificial neural networks, should fare better. First, recurrent neural networks can capture long-range time dependencies, overcoming the chief limitation of Markov models. This point requires a careful explanation. As in Markov models, any state in a traditional RNN depends only on the current input as well as the state of the network at the previous time step 1. However, the hidden state at any time step can contain information from an arbitrarily long context window. This is possible because the number of distinct states that can be represented in a hidden layer of nodes grows exponentially with the number of nodes in the layer. Even if each node took only binary values, the network could represent 2N states where N is the number of nodes in the hidden layer. Given real-valued outputs, even assuming the limited precision of 64 bit numbers, a single hidden layer of nodes can represent 264N distinct states. While the potential expressive power grows exponentially with the number of nodes in the hidden representation, the complexity of both inference and training grows only quadratically.
|2015 ACriticalReviewofRecurrentNeura||Zachary C Lipton||A Critical Review of Recurrent Neural Networks for Sequence Learning||2015|