2015 AttentionbasedModelsforSpeechRe

From GM-RKB
(Redirected from Chorowski et al., 2015)
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [ Graves, 2013, Bahdanau et al., 2015 ] and image caption generation [ 3 ]. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in [ 2 ] reaches a competitive 18.7% phoneme error rate (PER) on the TIMET phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the attention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

References

  • 1. A. Graves. Generating Sequences with Recurrent Neural Networks. arXiv:1308.0850, August 2013.
  • 2. D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. of the 3rd ICLR, 2015. ArXiv:1409.0473.
  • 3. K. Xu, J. Ba, R. Kiros, Et Al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. of the 32nd ICML, 2015. ArXiv:1502.03044.
  • 4. V. Mnih, N. Heess, A. Graves, Et Al. Recurrent Models of Visual Attention. In Proc. of the 27th NIPS, 2014. ArXiv:1406.6247.
  • 5. J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end Continuous Speech Recognition Using Attention-based Recurrent NN: First Results. arXiv:1412.1602 [cs, Stat], December 2014.
  • 6. A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. arXiv:1410.5401, 2014.
  • 7. J. Weston, S. Chopra, and A. Bordes. Memory Networks. arXiv:1410.3916, 2014.
  • 8. Mark Gales, Steve Young, The Application of Hidden Markov Models in Speech Recognition, Foundations and Trends in Signal Processing, v.1 n.3, p.195-304, January 2008
  • 9. G. Hinton, L. Deng, D. Yu, Et Al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6):82-97, November 2012.
  • 10. A. Hannun, C. Case, J. Casper, Et Al. Deepspeech: Scaling Up End-to-end Speech Recognition. arXiv:1412.5567, 2014.
  • 11. Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, Neural Computation, v.9 n.8, p.1735-1780, November 15, 1997
  • 12. K. Cho, B. Van Merrienboer, C. Gulcehre, Et Al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation. In EMNLP, October 2014. to Appear.
  • 13. Alex Graves, Santiago Fernández, Faustino Gomez, Jürgen Schmidhuber, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, Proceedings of the 23rd International Conference on Machine Learning, p.369-376, June 25-29, 2006, Pittsburgh, Pennsylvania, USA
  • 14. A. Graves. Sequence Transduction with Recurrent Neural Networks. In Proc. of the 29th ICML, 2012.
  • 15. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based Learning Applied to Document Recognition. Proc. IEEE, 1998.
  • 16. A. Graves, A.-r. Mohamed, and G. Hinton. Speech Recognition with Deep Recurrent Neural Networks. In ICASSP 2013, Pages 6645-6649. IEEE, 2013.
  • 17. A. Graves and N. Jaitly. Towards End-to-end Speech Recognition with Recurrent Neural Networks. In Proc. of the 31st ICML, 2014.
  • 18. S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. Weakly Supervised Memory Networks. arXiv:1503.08895, 2015.
  • 19. J. S. Garofolo, L. F. Lamel, W. M. Fisher, Et Al. DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus, 1993.
  • 20. D. Povey, A. Ghoshal, G. Boulianne, Et Al. The Kaldi Speech Recognition Toolkit. In Proc. ASRU, 2011.
  • 21. M. D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701, 2012.
  • 22. A. Graves. Practical Variational Inference for Neural Networks. In Proc of the 24th NIPS, 2011.
  • 23. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv:1207.0580, 2012.
  • 24. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks. In Proc. of the 27th NIPS, 2014. ArXiv:1409.3215.
  • 25. L. Tóth. Combining Time-and Frequency-domain Convolution in Convolutional Neural Network-based Phone Recognition. In Proc. ICASSP, 2014.
  • 26. C. Gulcehre, O. Firat, K. Xu, Et Al. On Using Monolingual Corpora in Neural Machine Translation. arXiv:1503.03535, 2015.
  • 27. J. Bergstra, O. Breuleux, F. Bastien, Et Al. Theano: A CPU and GPU Math Expression Compiler. In Proc. SciPy, 2010.
  • 28. F. Bastien, P. Lamblin, R. Pascanu, Et Al. Theano: New Features and Speed Improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
  • 29. I. J. Goodfellow, D. Warde-Farley, P. Lamblin, Et Al. Pylearn2: A Machine Learning Research Library. arXiv Preprint ArXiv:1308.4214, 2013.
  • 30. B. Van Merriënboer, D. Bahdanau, V. Dumoulin, Et Al. Blocks and Fuel: Frameworks for Deep Learning. arXiv:1506.00619 [cs, Stat], June 2015.

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 AttentionbasedModelsforSpeechReYoshua Bengio
Kyunghyun Cho
Dzmitry Bahdanau
Jan Chorowski
Dmitriy Serdyuk
Attention-based Models for Speech Recognition2015