2018 AnalysingDropoutandCompoundingE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Natural Language Model; Text Wikification System; Compounding Error; Dropout Regularization Task.

Notes

Cited By

Quotes

Abstract

This paper carries out an empirical analysis of various dropout techniques for language modelling, such as Bernoulli dropout, Gaussian dropout, Curriculum Dropout, Variational Dropout and Concrete Dropout. Moreover, we propose an extension of variational dropout to concrete dropout and curriculum dropout with varying schedules. We find these extensions to perform well when compared to standard dropout approaches, particularly variational curriculum dropout with a linear schedule. Largest performance increases are made when applying dropout on the decoder layer. Lastly, we analyze where most of the errors occur at test time as a post-analysis step to determine if the well-known problem of compounding errors is apparent and to what end do the proposed methods mitigate this issue for each dataset. We report results on a 2-hidden layer LSTM, GRU and Highway network with embedding dropout, dropout on the gated hidden layers and the output projection layer for each model. We report our results on Penn-TreeBank and WikiText-2 word-level language modelling datasets, where the former reduces the long-tail distribution through preprocessing and one which preserves rare words in the training and test set.

References

  1. Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non—convex optimization. In Conference on Learning Theory, pp. 817102, 2016.
  2. Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 3084—3092, 2013.
  3. Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor—critic algorithm for sequence prediction. arXiv preprinl arXiv:1607.07086, 2016.
  4. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi—recurrent neural networks. arXiv preprinl arXiv:1611.01576, 2016.
  5. Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration—exploitation tradeoffs in gaussian process bandit optimization. The Journal ofMach/‘ne Learning Research, 15 (l):3873—3923, 2014.
  6. Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1019—1027, 2016.
  7. Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information Processing Systems, pp. 3581—3590, 2017.
  8. Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. Google Vizier: A service for black—box optimization. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487—1495. ACM, 2017.
  9. Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. arXiv preprinl arXiv:1612.04426, 2016.
  10. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdi— nov. Improving neural networks by preventing co—adaptation of feature detectors. arXiv preprinl arXiv:1207.0580, 2012.
  11. Jiri Hron, Alexander G de G Matthews, and Zoubin Ghahramani. Variational gaussian dropout is not bayesian. arXiv preprinl arXiv:1711.02989, 2017.
  12. Hakan Inan, Khashayar KhosraVi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprinl arXiv:1611.01462, 2016.
  13. Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparame— terization trick. In Advances in Neural Information Processing Systems, pp. 2575—2583, 2015.
  14. Akshay Krishnamurthy, Hal Daume CMU EDU III, and UMD EDU. Learning to search better than your teacher. arXiv preprinl arXiv:I 502 02206, 2015.
  15. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprinl arXiv:1608.03983, 2016.
  16. Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. arXiv preprinl arXiv:I 707.05589, 2017.
  17. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprinl arXiv:1609.07843, 2016.
  18. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm lan— guage models. arXivprepr/‘nl arXiv:I 708.02182, 2017.
  19. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXivprepr/‘nl arXiv:1803.08240, 2018.
  20. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprinl arXiv:1701.05369, 2017.
  21. Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, René Vidal, and Vittorio Murino. Curriculum dropout. arXiv preprinl arXiv:1703.06229, 2017.
  22. James O’ Neill and Danushka Bollegala. Curriculum—based neighborhood sampling for sequence prediction. arXivpreprinl arXiv:1809.05916, 2018.
  23. Mohammad Norouzi, Samy Bengio, Naneep J aitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723—1731, 2016.
  24. Bharat Singh, Soham De, Yangmuzi Zhang, Thomas Goldstein, and Gavin Taylor. Layer—specific adaptive learning rates for deep networks. In Machine Learning and Applications (ICMLA), 2015 IEEE 14m International Conference 011, pp. 364—368. IEEE, 2015.
  25. Leslie N Smith. Cyclical learning rates for training neural networks. In Applications of Computer V/sion (WACV), 2017 IEEE Winter Conference 011, pp. 464—472. IEEE, 2017.
  26. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):l929—l958, 2014.
  27. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprinl arXiv:1409.2329, 2014.;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2018 AnalysingDropoutandCompoundingEJames O' Neill
Danushka Bollegala
Analysing Dropout and Compounding Errors in Neural Language Models2018