2018 OnthePracticalComputationalPowe

(Weiss et al., 2018) ⇒ Gail Weiss, Yoav Goldberg, and Eran Yahav. (2018). “On the Practical Computational Power of Finite Precision RNNs for Language Recognition.” In: arXiv:1805.04908 Journal.

Subject Headings:

Notes

Cited By

http://scholar.google.com/scholar?q=%222018%22+On+the+Practical+Computational+Power+of+Finite+Precision+RNNs+for+Language+Recognition

Quotes

Abstract

While Recurrent Neural Networks (RNNs) are famously known to be Turing complete, this relies on infinite precision in the states and unbounded computation time. We consider the case of RNNs with finite precision whose computation time is linear in the input length. Under these limitations, we show that different RNN variants have different computational power. In particular, we show that the LSTM and the Elman-RNN with ReLU activation are strictly stronger than the RNN with a squashing activation and the GRU. This is achieved because LSTMs and ReLU-RNNs can easily implement counting behavior. We show empirically that the LSTM does indeed learn to effectively use the counting mechanism.

1 Introduction

Recurrent Neural Network (RNNs) emerge as very strong learners of sequential data. A famous result by Siegelmann and Sontag (1992; 1994), and its extension in (Siegelmann, 1999), demonstrates that an Elman-RNN (Elman, 1990) with a sigmoid activation function, rational weights and infinite precision states can simulate a Turingmachine in real-time, making RNNs Turingcomplete. Recently, Chen et al (2017) extended the result to the ReLU activation function. However, these constructions (a) assume reading the entire input into the RNN state and only then performing the computation, using unbounded time; and (b) rely on having infinite precision in the network states. As argued by Chen et al (2017), this is not the model of RNN computation used in NLP applications. Instead, RNNs are often used by feeding an input sequence into the RNN one item at a time, each immediately returning a statevector that corresponds to a prefix of the sequence and which can be passed as input for a subsequent feed-forward prediction network operating in constant time. The amount of tape used by a Turing machine under this restriction is linear in the input length, reducing its power to recognition of context-sensitive language. More importantly, computation is often performed on GPUs with 32bit floating point computation, and there is increasing evidence that competitive performance can be achieved also for quantized networks with 4-bit weights or fixed-point arithmetics (Hubara et al., 2016). The construction of (Siegelmann, 1999) implements pushing 0 into a binary stack by the operation g g=4 + 1=4. This allows pushing roughly 15 zeros before reaching the limit of the 32bit floating point precision. Finally, RNN solutions that rely on carefully orchestrated mathematical constructions are unlikely to be found using backpropagation-based training.

In this work we restrict ourselves to inputbound recurrent neural networks with finiteprecision states (IBFP-RNN), trained using backpropagation. This class of networks is likely to coincide with the networks one can expect to obtain when training RNNs for NLP applications. An IBFP Elman-RNN is finite state. But what about other RNN variants? In particular, we consider the Elman RNN (SRNN) (Elman, 1990) with squashing and with ReLU activations, the Long Short- Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al., 2014; Chung et al., 2014).

The common wisdom is that the LSTM and GRU introduce additional gating components that handle the vanishing gradients problem of training SRNNs, thus stabilizing training and making it more robust. The LSTM and GRU are often considered as almost equivalent variants of each other.

(a) anbn-LSTM on a1000b1000 (b) anbncn-LSTM on a100b100c100
(c) anbn-GRU on a1000b1000 (d) anbncn-GRU on a100b100c100

Figure 1: Activations — c for LSTM and h for GRU — for networks trained on anbn and anbncn. The LSTM has clearly learned to use an explicit counting mechanism, in contrast with the GRU.

We show that in the input-bound, finiteprecision case, there is a real difference between the computational capacities of the LSTM and the GRU: the LSTM can easily perform unbounded counting, while the GRU (and the SRNN) cannot. This makes the LSTM a variant of a k-counter machine (Fischer et al., 1968), while the GRU remains finite-state. Interestingly, the SRNN with ReLU activation followed by an MLP classifier also has power similar to a k-counter machine. These results suggest there is a class of formal languages that can be recognized by LSTMs but not by GRUs. In section 5, we demonstrate that for at least two such languages, the LSTM manages to learn the desired concept classes using back-propagation, while using the hypothesized control structure. Figure 1 shows the activations of 10-d LSTM and GRU trained to recognize the languages anbn and anbncn. It is clear that the LSTM learned to dedicate specific dimensions for counting, in contrast to the GRU.1

1Is the ability to perform unbounded counting relevant to “real world” NLP tasks? In some cases it might be. For example, processing linearized parse trees (Vinyals et al., 2015; Choe and Charniak, 2016; Aharoni and Goldberg, 2017) requires counting brackets and nesting levels. Indeed, previous works that process linearized parse trees report using LSTMs and not GRUs for this purpose. Our work here suggests that this may not be a coincidence.

2 The RNN Models

An RNN is a parameterized function R that takes as input an input vector xt and a state vector ht􀀀1 and returns a state vector ht: ht = R(xt; ht􀀀1) (1)

The RNN is applied to a sequence x1; :::; xn by starting with an initial vector h0 (often the 0 vector) and applying R repeatedly according to equation (1). Let � be an input vocabulary (alphabet), and assume a mapping E from every vocabulary item to a vector x (achieved through a 1- hot encoding, an embedding layer, or some other means). Let RNN(x1; :::; xn) denote the state vector h resulting from the application of R to the sequence E(x1); :::;E(xn). An RNN recognizer (or RNN acceptor) has an additional function f mapping states h to 0; 1. Typically, f is a log-linear classifier or multi-layer perceptron. We say that an RNN recognizes a language L� �� if f(RNN(w)) returns 1 for all and only words w = x1; :::; xn 2 L.

Elman-RNN (SRNN) In the Elman-RNN (Elman, 1990), also called the Simple RNN (SRNN), the function R takes the form of an affine transform followed by a tanh nonlinearity:

ht = tanh(Wxt + Uht􀀀1 + b) (2)

Elman-RNNs are known to be at-least finitestate. Siegelmann (1996) proved that the tanh can be replaced by any other squashing function without sacrificing computational power. IRNN The IRNN model, explored by (Le et al., 2015), replaces the tanh activation with a nonsquashing ReLU:

ht = max(0; (Wxt + Uht􀀀1 + b)) (3)

The computational power of such RNNs (given infinite precision) is explored in (Chen et al., 2017). Gated Recurrent Unit (GRU) In the GRU (Cho et al., 2014), the function R incorporates a gating mechanism, taking the form:

zt = �(Wzxt + Uzht􀀀1 + bz) (4)

rt = �(Wrxt + Urht􀀀1 + br) (5)

~h t = tanh(Whxt + Uh(rt � ht􀀀1) + bh)(6) ht = zt � ht􀀀1 + (1 􀀀 zt) �~ht (7) Where � is the sigmoid function and � is the Hadamard product (element-wise product). Long Short Term Memory (LSTM) In the LSTM (Hochreiter and Schmidhuber, 1997), R uses a different gating component configuration: ft = �(Wfxt + Ufht􀀀1 + bf ) (8) it = �(Wixt + Uiht􀀀1 + bi) (9) ot = �(Woxt + Uoht􀀀1 + bo) (10) ~ct = tanh(Wcxt + Ucht􀀀1 + bc) (11) ct = ft � ct􀀀1 + it � ~ct (12) ht = ot � g(ct) (13)

where g can be either tanh or the identity. Equivalences The GRU and LSTM are at least as strong as the SRNN: by setting the gates of the GRU to zt = 0 and rt = 1 we obtain the SRNN computation. Similarly by setting the LSTM gates to it = 1,ot = 1, and ft = 0. This is easily achieved by setting the matrices W and U to 0, and the biases b to the (constant) desired gate values. Thus, all the above RNNs can recognize finitestate languages.

3 Power of Counting

Power beyond finite state can be obtained by introducing counters. Counting languages and kcounter machines are discussed in depth in (Fischer et al., 1968). When unbounded computation is allowed, a 2-counter machine has Turing power. However, for computation bound by input length (real-time) there is a more interesting hierarchy. In particular, real-time counting languages cut across the traditional Chomsky hierarchy: real-time k-counter machines can recognize at least one context-free language (anbn), and at least one context-sensitive one (anbncn). However, they cannot recognize the context free language given by the grammar S ! xjaSajbSb (palindromes).

SKCM For our purposes, we consider a simplified variant of k-counter machines (SKCM). A counter is a device which can be incremented by a fixed amount (INC), decremented by a fixed amount (DEC) or compared to 0 (COMP0). Informally, 2 an SKCM is a finite-state automaton extended with k counters, where at each step of the computation each counter can be incremented, decremented or ignored in an input-dependent way, and state-transitions and accept/reject decisions can inspect the counters’ states using COMP0. The results for the three languages discussed above hold for the SKCM variant as well, with proofs provided in Appendix A.

4 RNNs as SKCMs

In what follows, we consider the effect on the state-update equations on a single dimension, ht[j]. We omit the index [j] for readability. LSTM The LSTM acts as an SKCM by designating k dimensions of the memory cell ct as counters. In non-counting steps, set it = 0; ft = 1 through equations (8-9). In counting steps, the counter direction (+1 or -1) is set in ~ ct (equation 11) based on the input xt and state ht􀀀1. The counting itself is performed in equation (12), after setting it = ft = 1. The counter can be reset to 0 by setting it = ft = 0.

Finally, the counter values are exposed through ht = otg(ct), making it trivial to compare the counter’s value to 0.3

2Formal definition is given in Appendix A.

3Some further remarks on the LSTM: LSTM supports both increment and decrement in a single dimension. The counting dimensions in ct are exposed through a function Accepted as a short paper in ACL 2018 We note that this implementation of the SKCM operations is achieved by saturating the activations to their boundaries, making it relatively easy to reach and maintain in practice.

SRNN The finite-precision SRNN cannot designate unbounded counting dimensions.

The SRNN update equation is: ht = tanh(Wx + Uht􀀀1 + b) ht[i] = tanh( Xdx j=1 Wijx[j]+ Xdh j=1 Uijht􀀀1[j]+b[i])

By properly setting U and W, one can get certain dimensions of h to update according to the value of x, by ht[i] = tanh(ht􀀀1[i]+wix+b[i]). However, this counting behavior is within a tanh activation. Theoretically, this means unbounded counting cannot be achieved without infinite precision. Practically, this makes the counting behavior inherently unstable, and bounded to a relatively narrow region. While the network could adapt to set w to be small enough such that counting works for the needed range seen in training without overflowing the tanh, attempting to count to larger n will quickly leave this safe region and diverge.

IRNN Finite-precision IRNNs can perform unbounded counting conditioned on input symbols. This requires representing each counter as two dimensions, and implementing INC as incrementing one dimension, DEC as incrementing the other, and COMP0 as comparing their difference to 0. Indeed, Appendix A in (Chen et al., 2017) provides concrete IRNNs for recognizing the languages anbn and anbncn. This makes IBFP-RNN with ReLU activation more powerful than IBFP-RNN with a squashing activation. Practically, ReLUactivated RNNs are known to be notoriously h5ard g. For both g(x) = x and g(x) = tanh(x), it is trivial to do compare 0. Another operation of interest is comparing two counters (for example, checking the difference between them). This cannot be reliably achieved with g(x) = tanh(x), due to the non-linearity and saturation properties of the tanh function, but is possible in the g(x) = x case. LSTM can also easily set the value of a counter to 0 in one step. The ability to set the counter to 0 gives slightly more power for real-time recognition, as discussed by Fischer et al. (1968).

Relation to known architectural variants: Adding peephole connections (Gers and Schmidhuber, 2000) essentially sets g(x) = x and allows comparing counters in a stable way. Coupling the input and the forget gates (it = 1 􀀀 ft) (Greff et al., 2017) removes the single-dimension unbounded counting ability, as discussed for the GRU. to train because of the exploding gradient problem. GRU Finite-precision GRUs cannot implement unbounded counting on a given dimension. The tanh in equation (6) combined with the interpolation (tying zt and 1 􀀀 zt) in equation (7) restricts the range of values in h to between -1 and 1, precluding unbounded counting with finite precision. Practically, the GRU can learn to count up to some bound m seen in training, but will not generalize well beyond that.4 Moreover, simulating forms of counting behavior in equation (7) require consistently setting the gates zt; rt and the proposal ~ht to precise, non-saturated values, making it much harder to find and maintain stable solutions. Summary We show that LSTM and IRNN can implement unbounded counting in dedicated counting dimensions, while the GRU and SRNN cannot. This makes the LSTM and IRNN at least as strong as SKCMs, and strictly stronger than the SRNN and the GRU.5

5 Experimental Results

Can the LSTM indeed learn to behave as a kcounter machine when trained using backpropagation? We show empirically that:

1. LSTMs can be trained to recognize anbn and anbncn.

2. These LSTMs generalize to much higher n than seen in the training set (though not infinitely so).

3. The trained LSTM learn to use the perdimension counting mechanism.

4. The GRU can also be trained to recognize anbn and anbncn, but they do not have clear counting dimensions, and they generalize to much smaller n than the LSTMs, often failing to generalize correctly even for n within their training domain.

4One such mechanism could be to divide a given dimension by k > 1 at each symbol encounter, by setting zt = 1=k and ~ ht = 0. Note that the inverse operation would not be implementable, and counting down would have to be realized with a second counter.

5One can argue that other counting mechanisms — involving several dimensions — are also possible. Intuitively, such mechanisms cannot be trained to perform unbounded counting based on a finite sample as the model has no means of generalizing the counting behavior to dimensions beyond those seen in training. We discuss this more in depth in Appendix B, where we also prove that an SRNN cannot represent a binary counter.

5. Trained LSTM networks outperform trained GRU networks on random test sets for the languages anbn and anbncn. Similar empirical observations regarding the ability of the LSTM to learn to recognize anbn and anbncn are described also in (Gers and Schmidhuber, 2001).

We train 10-dimension, 1-layer LSTM and GRU networks to recognize anbn and anbncn. For anbn the training samples went up to n = 100 and for anbncn up to n = 50.6 Results On anbn, the LSTM generalizes well up to n = 256, after which it accumulates a deviation making it reject anbn but recognize anbn+1 for a while, until the deviation grows.7 The GRU does not capture the desired concept even within its training domain: accepting anbn+1 for n > 38, and also accepting anbn+2 for n > 97. It stops accepting anbn for n > 198.

On anbncn the LSTM recognizes well until n = 100. It then starts accepting also anbn+1cn. At n > 120 it stops accepting anbncn and switches to accepting anbn+1cn, until at some point the deviation grows. The GRU accepts already a9b10c12, and stops accepting anbncn for n > 63. Figure 1a plots the activations of the 10 dimensions of the anbn-LSTM for the input a1000b1000. While the LSTM misclassifies this example, the use of the counting mechanism is clear. Figure 1b plots the activation for the anbncn LSTM on a100b100c100. Here, again, the two counting dimensions are clearly identified — indicating the LSTM learned the canonical 2-counter solution — although the slightly-imprecise counting also starts to show. In contrast, Figures 1c and 1d show the state values of the GRU-networks. The GRU behavior is much less interpretable than the LSTM. In the anbn case, some dimensions may be performing counting within a bounded range, but move to erratic behavior at around t = 1750 (the 6Implementation in DyNet, using the SGD Optimizer. Positive examples are generated by sampling n in the desired range. For negative examples we sample 2 or 3 n values independently, and ensuring at least one of them differs from the others. We dedicate a portion of the examples as the dev set, and train up to 100% dev set accuracy. 7These fluctuations occur as the networks do not fully saturate their gates, meaning the LSTM implements an imperfect counter that accumulates small deviations during computation, e.g.: increasing the counting dimension by 0.99 but decreasing only by 0.98. Despite this, we see that the its solution remains much more robust than that found by the GRU

— the LSTM has learned the essence of the counting based

solution, but its implementation is imprecise. network starts to misclassify on sequences much shorter than that). The anbncn state dynamics are even less interpretable.

Finally, we created 1000-sample test sets for each of the languages. For anbn we used words with the form an+ibn+j where n 2 rand(0; 200) and i; j 2 rand(􀀀2; 2), and for anbncn we use words of the form an+ibn+jcn+k where n 2 rand(0; 150) and i; j; k 2 rand(􀀀2; 2). The LSTM’s accuracy was 100% and 98.6% on anbn and anbncn respectively, as opposed to the GRU’s 87.0% and 86.9%, also respectively. All of this empirically supports our result, showing that IBFP-LSTMs can not only theoretically implement “unbounded” counters, but also learn to do so in practice (although not perfectly), while IBFP-GRUs do not manage to learn proper counting behavior, even when allowing floating point computations.

6 Conclusions

We show that the IBFP-LSTM can model a realtime SKCM, both in theory and in practice. This makes it more powerful than the IBFP-SRNN and the IBFP-GRU, which cannot implement unbounded counting and are hence restricted to recognizing regular languages. The IBFP-IRNN can also perform input-dependent counting, and is thus more powerful than the IBFP-SRNN. We note that in addition to theoretical distinctions between architectures, it is important to consider also the practicality of different solutions: how easy it is for a given architecture to discover and maintain a stable behavior in practice. We leave further exploration of this question for future work.

Appendix

A Simplified K-Counter Machines

We use a simplified variant of the k-counter machines (SKCM) defined in (Fischer et al., 1968), which has no autonomous states and makes classification decisions based on a combination of its current state and counter values. This variant consumes input sequences on a symbol by symbol basis, updating at each step its state and its counters, the latter of which may be manipulated by increment, decrement, zero, or no-ops alone, and observed only by checking equivalence to zero. To define the transitions of this model its accepting configurations, we will introduce the following notations:

Notations We define z : Zk ! f0; 1gk as follows: for every n 2 Zk, for every 1 � i � k, z(n)i = 0 iff ni = 0 (this function masks a set of integers such that only their zero-ness is observed). For a vector of operations, o 2 f􀀀1; +1;�0;�1gk, we denote by o(n) the pointwise application of the operations to the vector n 2 Zk, e.g. for o = (+1;�0;�1), o((5; 2; 3)) = (6; 0; 3).

We now define the model. An SKCM is a tuple M = h�; Q; qo; k; �; u; Fi containing: 1. A finite input alphabet � 2. A finite state set Q 3. An initial state q0 2 Q 4. k 2 N, the number of counters 5. A state transition function � : Q � � � f0; 1gk ! Q 6. A counter update function8 u : � ! f􀀀1; +1;�0;�1gk 7. A set of accepting masked9 configurations F � Q � f0; 1gk The set of configurations of an SKCM is the set C = Q � Zk, and the initial configuration is c0 = (q0; �0) (i.e., the counters are initiated to zero). The 8 We note that in this definition, the counter update function depends only on the input symbol. In practice we see that the LSTM is not limited in this way, and can also update according to some state-input combinations — as can be seen when it it is taught, for instance, the language anban We do not explore this here however, leaving a more complete characterization of the learnable models to future work. 9i.e., counters are observed only by zero-ness. transitions of an SKCM are as follows: given a configuration ct = (q; n) (n 2 Zk) and input wt 2 �, the next configuration of the SKCM is ct+1 = (�(q;wt; z(n)); u(wt)(n)). The language recognized by a k-counter machine is the set of words w for which the machine reaches an accepting configuration — a configuration c = (q; n) for which (q; z(n)) 2 F. Note that while the counters can and are increased to various non-zero values, the transition function � and the accept/reject classification of the configurations observe only their zero-ness.

A.1 Computational Power of SKCMs

We show that the SKCM model can recognize the context-free and context-sensitive languages anbn and anbncn, but not the context free language of palindromes, meaning its computational power differs from the language classes defined in the Chomsky hierarchy. Similar proofs appear in (Fischer et al., 1968) for their variant of the kcounter machine. anbn: We define the following SKCM over the alphabet fa; bg: 1. Q = fqa; qb; qrg 2. q0 = qa 3. k = 1 4. u(a) = +1; u(b) = 􀀀1 5. for any z 2 f0; 1g: �(qa; a; z) = qa; �(qa; b; z) = qb; �(qb; a; z) = qr; �(qb; b; z) = qb �(qr; a; z) = qr; �(qr; b; z) = qr 6. C = f(qb; 0)g The state qr is a rejecting sink state, and the states qa and qb keep track of whether the sequence is currently in the “a” or “b” phase. If an a is seen after moving to the b phase, the machine moves to (and stays in) the rejecting state. The counter is increased on input a and decreased on input b, and the machine accepts only sequences that reach the state qb with counter value zero, i.e., that have increased and decreased the counter an equal number of times, without switching from b to a. It follows easily that this machine recognizes exactly the language anbn. anbncn: We define the following SKCM over the alphabet fa; bg. As its state transition function ignores the counter values, we use the shorthand �(q; �) for �(q; �; z), for all z 2 f0; 1g2. 1. Q = fqa; qb; qc; qrg Accepted as a short paper in ACL 2018 2. q0 = qa 3. k = 2 4. u(a) = (+1; ); u(b) = (􀀀1; +1); u(c) = (􀀀1) 5. for any z 2 f0; 1g: �(qa; a) = qa; �(qa; b) = qb; �(qa; c) = qr; �(qb; a) = qr; �(qb; b) = qb; �(qb; c) = qc; �(qc; a) = qr; �(qc; b) = qr; �(qc; c) = qc; �(qr; a) = qr; �(qr; b) = qr; �(qr; c) = qr 6. C = f(qc; 0; 0)g By similar reasoning as that for anbn, we see that this machine recognizes exactly the language anbncn. We note that this construction can be extended to build an SKCM for any language of the sort an1 an2

an

m, using k = m 􀀀 1 counters and k + 1 states.

Palindromes: We prove that no SKCM can recognize the language of palindromes defined over the alphabet fa; b; xg by the grammar S ! xjaSajbSb. The intuition is that in order to correctly recognize this language in an one-way setting, one must be able to reach a unique configuration for every possible input sequence over fa; bg (requiring an exponential number of reachable configurations), whereas for any SKCM, the number of reachable configurations is always polynomial in the input length.10

Let M be an SKCM with k counters. As its counters are only manipulated by steps of 1 or resets, the maximum and minimum values that each counter can attain on any input w 2 �� are +jwj and 􀀀jwj, and in particular the total number of possible values a counter could reach at the end of input w is 2jwj + 1. This means that the total number of possible configurations M could reach on input of length n is c(n) = jQj � (2n + 1)k. c(n) is polynomial in n, and so there exists a value m for which the number of input sequences of length m over fa; bg — 2m — is greater than c(m). It follows by the pigeonhole principle that there exist two input sequences w1 6= w2 2 fa; bgm for which M reaches the same configuration. This means that for any suffix w 2 ��, and in particular for w = x � w􀀀1 1 where w􀀀1 1 is the reverse of w1, M classifies w1 � w and w2 � w

10This will hold even if the counter update function can rely on any state-input combination.

identically — despite the fact that w1 � x �w􀀀1 1 is in the language and w2 � x � w􀀀1 1 is not. This means that M necessarily does not recognize this palindrome language, and ultimately that no such M exists. Note that this proof can be easily generalized to any palindrome grammar over 2 or more characters, with or without a clear ‘midpoint’ marker. B Impossibility of Counting in Binary While we have seen that the SRNN and GRU cannot allocate individual counting dimensions, the question remains whether they can count using a more elaborate mechanism, perhaps over several dimensions. We show here that one such mechanism

— a binary counter — is not implementable

in the SRNN.

For the purposes of this discussion, we first define a binary counter in an RNN. Binary Interpretation In an RNN with hidden state values in the range (􀀀1; 1), the binary interpretation of a sequence of dimensions d1; :::; dn of its hidden state is the binary number obtained by replacing each positive hidden value in the sequence with a ‘1’ and each negative value with a ‘0’. For instance: the binary interpretation of the dimensions 3,0,1 in the hidden state vector (0:5;􀀀0:1; 0:3; 0:8) is 110, i.e., 6. Binary Counting We say that the dimensions d1; d2; :::; dn in an RNN’s hidden state implement a binary counter in the RNN if, in every transition, their binary interpretation either increases, decreases, resets to 0, or doesn’t change.11 A similar pair of definitions can be made for state values in the range (0; 1). We first note intuitively that an SRNN would not generalize binary counting to a counter with dimensions beyond those seen in training — as it would have no reason to learn the ‘carry’ behavior between the untrained dimensions. We prove further that we cannot reasonably implement such counters regardless.

We now present a proof sketch that a singlelayer SRNN with hidden size n � 3 cannot implement an n-dimensional binary counter that will consistently increase on one of its input symbols. After this, we will prove that even with helper 11We note that the SKCMs presented here are more restricted in their relation between counter action and transition, but prefer here to give a general definition. Our proof will be relevant even within the restrictions. Accepted as a short paper in ACL 2018 dimensions, we cannot implement a counter that will consistently increase on one input token and decrease on another — as we might want in order to classify the language of all words w for which

a(w) = #b(w).12

Consistently Increasing Counter: The proof relies on the linearity of the affine transform Wx + Uh + b, and the fact that ‘carry’ is a non-linear operation. We work with state values in the range (􀀀1; 1), but the proof can easily be adapted to (0; 1) by rewriting h as h0 + 0:5, where h0 = h 􀀀 0:5 is a vector with values in the range (􀀀0:5; 0:5).

Suppose we have a single-layer SRNN with hidden size n = 3, such that its entire hidden state represents a binary counter that increases every time it receives the input symbol a. We denote by xa the embedding of a, and assume w.l.o.g. that the hidden state dimensions are ordered from MSB to LSB, e.g. the hidden state vector (1; 1;􀀀1) represents the number 110=6. Recall that the binary interpretation of the hidden state relies only on the signs of its values. We use p and n to denote ‘some’ positive or negative value, respectively. Then the number 6 can be represented by any state vector (p; p; n).

Recall also that the SRNN state transition is ht = tanh(Wxt + Uht􀀀1 + b) and consider the state vectors (􀀀1; 1; 1) and (1;􀀀1;􀀀1), which represent 3 and 4 respectively. Denoting~b = Wxa +b, we find that the constants U and~b must satisfy: tanh(U(􀀀1; 1; 1) +~ b) = (p; n; n) tanh(U(1;􀀀1;􀀀1) +~ b) = (p; n; p) As tanh is sign-preserving, this simplifies to: U(􀀀1; 1; 1) = (p; n; n) 􀀀~b U(1;􀀀1;􀀀1) = (p; n; p) 􀀀~b Noting the linearity of matrix multiplication and that (1;􀀀1;􀀀1) = 􀀀(􀀀1; 1; 1), we obtain: U(􀀀1; 1; 1) = U(􀀀(1;􀀀1;􀀀1)) = 􀀀U(1;􀀀1;􀀀1) 12Of course a counter could also be ‘decreased’ by incrementing a parallel, ‘negative’ counter, and implementing compare-to-zero as a comparison between these two. As intuitively no RNN could generalize binary counting behavior to dimensions not used in training, this approach could quickly find both counters outside of their learned range even on a sequence where the difference between them is never larger than in training. (p; n; n) 􀀀~b =~b 􀀀 (p; n; p) i.e. for some assignment to each p and n, 2~b = (p; n; n) + (p; n; p), and in particular~ b[1] < 0. Similarly, for (􀀀1;􀀀1; 1) and (1; 1;􀀀1), we obtain U(􀀀1;􀀀1; 1) = (n; p; n) 􀀀~b U(1; 1;􀀀1) = (p; p; p) 􀀀~b i.e. (n; p; n) 􀀀~b =~b 􀀀 (p; p; p) or 2~b = (p; p; p) + (n; p; n), and in particular that ~ b [ 1] > 0, leading to a contradiction and proving that such an SRNN cannot exist. The argument trivially extends to n > 3 (by padding from the MSB).

We note that this proof does not extend to the case where additional, non counting dimensions are added to the RNN — at least not without further assumptions, such as the assumption that the counter behave correctly for all values of these dimensions, reachable and unreachable. One may argue then that, with enough dimensions, it could be possible to implement a consistently increasing binary counter on a subset of the SRNN’s state.13 We now show a counting mechanism that cannot be implemented even with such ‘helper’ dimensions. Bi-Directional Counter: We show that for n � 3, no SRNN can implement an n-dimensional binary counter that increases for one token, �up, and decreases for another, �down. As before, we show the proof explicitly for n = 3, and note that it can be simply expanded to any n > 3 by padding. Assume by contradiction we have such an SRNN, with m � 3 dimensions, and assume w.l.o.g. that a counter is encoded along the first 3 of these. We use the shorthand (v1; v2; v3)c to show the values of the counter dimensions explicitly while abstracting the remaining state dimensions, e.g. we write the hidden state (􀀀0:5; 0:1; 1; 1; 1) as (􀀀0:5; 0:1; 1)c where c = (1; 1).

Let xup and xdown be the embeddings of �up and �down, and as before denote bup = Wxup + b and bdown = Wxdown + b. Then for some reachable state h1 2 R where the counter value is 1 (e.g., the state reached on the input sequence �up 14)), we find that the constants U; bdown, and 13(By storing processing information on the additional, ‘helper’ dimensions) 14(Or whichever appropriate sequence if the counter is not initiated to zero.) Accepted as a short paper in ACL 2018 bup must satisfy: tanh(Uh1 + bup) = (n; p; n)c1 tanh(Uh1 + bdown) = (n; n; n)c2 (i.e., �up increases the counter and updates the additional dimensions to the values c1, while �down decreases and updates to c2.) Removing the signpreserving function tanh we obtain the constraints Uh1 + bup = (n; p; n)sign(c1) Uh1 + bdown = (n; n; n)sign(c2) i.e. (bup 􀀀 bdown)[0 : 2] = (n; p; n) 􀀀 (n; n; n), and in particular (bup 􀀀 bdown)[1] > 0. Now consider a reachable state h3 for which the counter value is 3. Similarly to before, we now obtain Uh3 + bup = (p; n; n)sign(c3) Uh3 + bdown = (n; p; n)sign(c4) from which we get (bup 􀀀 bdown)[0 : 2] = (p; n; n) 􀀀 (n; p; n), and in particular (bup 􀀀 bdown)[1] < 0, a contradiction to the previous statement. Again we conclude that no such SRNN can exist.

References

Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 132–140, Vancouver, Canada. Association for Computational Linguistics. Yining Chen, Sorcha Gilroy, Kevin Knight, and Jonathan May. 2017. Recurrent neural networks as weighted language recognizers. CoRR, abs/1711.05408. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder– Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics. Do Kook Choe and Eugene Charniak. 2016. Parsing as language modeling. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2331–2336, Austin, Texas. Association for Computational Linguistics. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555 [cs]. Jeffrey L. Elman. 1990. Finding Structure in Time. Cognitive Science, 14(2):179–211. Patrick C. Fischer, Albert R. Meyer, and Arnold L. Rosenberg. 1968. Counter machines and counter languages. Mathematical systems theory, 2(3):265– 283. F. A. Gers and E. Schmidhuber. 2001. Lstm recurrent networks learn simple context-free and contextsensitive languages. Transactions on Neural Networks, 12(6):1333–1340. F. A. Gers and J. Schmidhuber. 2000. Recurrent nets that time and count. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, volume 3, pages 189–194 vol.3. K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmidhuber. 2017. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4107–4115. Curran Associates, Inc. Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. 2015. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv:1504.00941 [cs]. Hava Siegelmann. 1999. Neural Networks and Analog Computation: Beyond the Turing Limit, 1 edition. Birkh¨auser Basel. Hava T. Siegelmann. 1996. Recurrent neural networks and finite automata. Computational Intelligence, 12:567–574. Hava T. Siegelmann and Eduardo D. Sontag. 1992. On the computational power of neural nets. In: Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, Pittsburgh, PA, USA, July 27-29, 1992., pages 440–449. Hava T. Siegelmann and Eduardo D. Sontag. 1994. Analog computation via neural networks. Theor. Comput. Sci., 131(2):331–360. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2773–2781, Cambridge, MA, USA. MIT Press.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2018 OnthePracticalComputationalPowe	Yoav Goldberg Gail Weiss Eran Yahav			On the Practical Computational Power of Finite Precision RNNs for Language Recognition						2018