2013 HybridSpeechRecognitionwithDeep

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Bidirectional Language Model.

Notes

Cited By

Quotes

Author Keywords

DBLSTM, HMM-RNN hybrid

Abstract

Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in frame-level accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.

1. INTRODUCTION

Deep Bidirectional LSTM was recently introduced to speech recognition, giving the lowest recorded error rates on the TIMIT database (Graves et al., 2013). In that work the networks were trained with two end-to-end training methods designed for discriminative sequence transcription with recurrent neural networks, namely Connectionist Temporal Classification [2] and Sequence Transduction [3]. These methods have several advantages: they do not require forced alignments to pre-segment the acoustic data, they directly optimise the probability of the target sequence conditioned on the input sequence, and (especially in the case of Sequence Transduction) they are able to learn an implicit language model from the acoustic training data. However neither method can readily be integrated into existing large vocabulary speech recognition systems, which were designed around the GMM-HMM paradigm. In particular, it is not straightforward to combine them with word-level language models, which play a vital role in real-world tasks.

The standard solution to the problem of training neural networks for speech recognition is to merge them with HMMs in the so-called hybrid [4] or tandem [5] models. The hybrid approach, in particular, has gained prominence in recent years with the performance improvements yielded by deep networks [6, 7]. In this framework a forced alignment given by a GMM-HMM system is used to create frame-level acoustic targets which the network attempts to classify. Using a forced alignment has the advantage of providing a straightforward objective function (cross-entropy error) for network training. Recognition is performed by combining the acoustic probabilities yielded by the network with the state transition probabilities from the HMM and the word transition probabilities from the language model [1], which can be done efficiently for large vocabulary speech using weighted finite state transducers.

One of the original motivations for the hybrid approach was to increase the amount of context used by the acoustic model. In modern hybrid systems, acoustic context windows of 11 to 21 frames are typically provided to the network. Recurrent Neural Networks (RNNs) can learn how much context to use for each prediction, and are in principle are able to access information from anywhere in the acoustic sequence. It is therefore unsurprising that HMM-RNN hybrids have been considered for almost twenty years [8, 9, 10]. So far however, they have not become a staple of large vocabulary speech recognition.

The two main goals of this paper are to compare the performance of DBLSTM-HMM hybrids with the end-to-end sequence training defined in (Graves et al., 2013), and to assess the potential of DBLSTM-HMM hybrids for large vocabulary speech recognition. The network architecture is described in Section 2 and the training method is described in 3. An experimental comparison with end-to-end training on the TIMIT database is given in Section 4, and a comparison with deep networks and GMMs on the Wall Street Journal corpus is provided by Section 5. Section 6 contains discussion of the results and their implications for DBLSTM training and concluding remarks are given in Section 7.

2. NETWORK ARCHITECTURE

Given an input sequence [math]\displaystyle{ x = (x_1, ..., x_T) }[/math], a standard recurrent neural network (RNN) computes the hidden vector sequence [math]\displaystyle{ h = (h_1, ..., h_T) }[/math] and output vector sequence [math]\displaystyle{ y = (y_1, ..., y_T) }[/math] by iterating the following equations from t = 1 to T:

[math]\displaystyle{ h_t = \mathcal{H}(W_xh_{xt} + W_{hh}h_{t-1} + b_h) (1) }[/math]
[math]\displaystyle{ y_t = W_hy_{ht} + b_y (2) }[/math]

where the W terms denote weight matrices (e.g. Wxh is the input-hidden weight matrix), the b terms denote bias vectors (e.g. bh is hidden bias vector) and H is the hidden layer function. H is usually an elementwise application of a sigmoid function. However we have found that the Long Short-Term Memory (LSTM) architecture [11], which uses purpose-built memory cells to store information, is better at finding and exploiting long range context. Fig. 1 illustrates a single LSTM memory cell. For the version of LSTM used in this paper [12] H is implemented by the following composite function:

it = � (Wxixt +Whiht􀀀1 +Wcict􀀀1 + bi) (3)
ft = � (Wxfxt +Whfht􀀀1 +Wcf ct􀀀1 + bf ) (4)
ct = ftct􀀀1 + it tanh (Wxcxt +Whcht􀀀1 + bc) (5)
ot = � (Wxoxt +Whoht􀀀1 +Wcoct + bo) (6)
ht = ot tanh(ct) (7)

where � is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden vector h. The weight matrices from the cell to gate vectors (e.g. Wsi) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. One shortcoming of conventional RNNs is that they are only able to make use of previous context. In speech recognition, where whole utterances are transcribed at once, there is no reason not to exploit future context as well. Bidirectional RNNs (BRNNs) [13] do this by processing the data in both directions with two separate hidden layers, which are then fed forwards to the same output layer. As illustrated in Fig. 2, a BRNN computes the forward hidden sequence 􀀀!h , the backward hidden sequence

􀀀

h and the output sequence y by iterating the backward layer from t = T to 1, the forward layer from t = 1 to T and then updating the output layer:

􀀀!h t = H � Wx 􀀀!h xt +W􀀀!h 􀀀!h 􀀀!h t􀀀1 + b􀀀!h � (8)

􀀀

h t = H � Wx

􀀀

h xt +W 􀀀 h

􀀀

h

􀀀

h t+1 + b 􀀀 h � (9) yt = W􀀀!h y 􀀀!h t +W 􀀀 h y

􀀀

h t + by (10)

Fig. 1. Long Short-term Memory Cell

Fig. 2. Bidirectional Recurrent Neural Network

Combing BRNNs with LSTM gives bidirectional LSTM [14], which can access long-range context in both input directions. A crucial element of the recent success of hybrid systems is the use of deep architectures, which are able to build up progressively higher level representations of acoustic data. Deep RNNs can be created by stacking multiple RNN hidden layers on top of each other, with the output sequence of one layer forming the input sequence for the next, as shown in Fig. 3. Assuming the same hidden layer function is used for all N layers in the stack, the hidden vector sequences hn are iteratively computed from n = 1 to N and t = 1 to T:

hnt = H 􀀀 Whn􀀀1hnhn􀀀1 t +Whnhnhn t􀀀1 + bnh � (11) where we define h0 = x. The network outputs yt are yt = WhNyhNt + by (12)

Deep bidirectional RNNs can be implemented by replacing each hidden sequence hn with the forward and backward sequences 􀀀!h n and

􀀀

h n, and ensuring that every hidden layer receives input from both the forward and backward layers at the level below. If LSTM is used for the hidden layers we get deep bidirectional LSTM, as illustrated in Fig. 4.

References

[1] (Graves et al., 2013) ⇒ A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. ICASSP 2013, Vancouver, Canada, May 2013.

[2] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in ICML, Pittsburgh, USA, 2006.

[3] A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Worksop, 2012.

[4] H.A. Bourlard and N. Morgan, Connnectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, 1994.

[5] Qifeng Zhu, Barry Chen, Nelson Morgan, and Andreas Stolcke, “Tandem connectionist feature extraction for conversational speech recognition,” in International Conference on Machine Learning for Multimodal Interaction, Berlin, Heidelberg, 2005, MLMI’04, pp. 223– 231, Springer-Verlag.

[6] A. Mohamed, G.E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 14 –22, jan. 2012.

[7] G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82 –97, nov. 2012.

[8] A. J. Robinson, “An Application of Recurrent Nets to Phone Probability Estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.

[9] Oriol Vinyals, Suman Ravuri, and Daniel Povey, “Revisiting Recurrent Neural Networks for Robust ASR,” in ICASSP, 2012.

[10] A. Maas, Q. Le, T. O’Neil, O. Vinyals, P. Nguyen, and A. Ng, “Recurrent neural networks for noise reduction in robust asr,” in INTERSPEECH, 2012.

[11] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735– 1780, 1997.

[12] F. Gers, N. Schraudolph, and J. Schmidhuber, “Learning Precise Timing with LSTM Recurrent Networks,” Journal of Machine Learning Research, vol. 3, pp. 115–143, 2002.

[13] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.

[14] A. Graves and J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, June/July 2005.

[15] Kam-Chuen Jim, C.L. Giles, and B.G. Horne, “An analysis of noise in recurrent neural networks: convergence and generalization,” Neural Networks, IEEE Transactions on, vol. 7, no. 6, pp. 1424 –1438, nov 1996.

[16] Geoffrey E. Hinton and Drew van Camp, “Keeping the neural networks simple by minimizing the description length of the weights,” in COLT, 1993, pp. 5–13.

[17] A. Graves, “Practical variational inference for neural networks,” in NIPS, pp. 2348–2356. 2011.

[18] DARPA-ISTO, The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), speech disc cd1- 1.1 edition, 1990.

[19] Kai fu Lee and Hsiao wuen Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989.

[20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Dec. 2011, IEEE Signal Processing Society.


;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 HybridSpeechRecognitionwithDeepAlex Graves
Navdeep Jaitly
Abdel-rahman Mohamed
Hybrid Speech Recognition with Deep Bidirectional LSTM
  1. In practice the HMM state transitions have become less signficant as linguistic and acoustic models have improved, and many current systems ignore them altogether.