2019 ErrorCorrectingNeuralSequencePr

(Neill & Bollegala, 2019) ⇒ James O' Neill, and Danushka Bollegala. (2019). “Error-Correcting Neural Sequence Prediction.”

Subject Headings: ECOC-NLM System; Text Error Correction System; Artificial Neural Network, Neural Language Model, Latent Mixture Sampling.

Notes

Cited By

http://scholar.google.com/scholar?q=%222019%22+Error-Correcting+Neural+Sequence+Prediction

Quotes

Abstract

We propose a novel neural language modelling (NLM) method based on error-correcting output codes (ECOC), abbreviated as ECOC-NLM. This latent variable based approach provides a principled way to choose a varying amount of latent output codes and avoids exact softmax normalization. Instead of minimizing measures between the predicted probability distribution and true distribution, we use error-correcting codes to represent both predictions and outputs. Secondly, we propose multiple ways to improve accuracy and convergence rates by maximizing the separability between codes that correspond to classes proportional to word embedding similarities. Lastly, we introduce a novel method called Latent Mixture Sampling, a technique that is used to mitigate exposure bias and can be integrated into training latent-based neural language models. This involves mixing the latent codes (i.e variables) of past predictions and past targets in one of two ways: (1) according to a predefined sampling schedule or (2) a differentiable sampling procedure whereby the mixing probability is learned throughout training by replacing the greedy argmax operation with a smooth approximation. In evaluating Codeword Mixture Sampling for ECOC-NLM, we also baseline it against CWMS in a closely related Hierarchical Softmax-based NLM.

1. Introduction

Language modelling (LM) is a fundamental task in natural language that requires a parametric model to generate tokens given past tokens. LM underlies all other types of structured modelling tasks in natural language, such as Named Entity Recognition, Constituency/Dependency Parsing, Coreference Resolution, Machine Translation (Sutskever et al., 2014) and Question Answering (Mikolov et al., 2010). The goal is to learn a joint probability distribution for a sequence of length [math]\displaystyle{ T }[/math] containing words from a vocabulary [math]\displaystyle{ V }[/math]. This distribution can be decomposed into the conditional distributions of current tokens given past tokens using the chain rule, as shown in Equation 1. In Neural Language Modelling (NLM), a Recurrent Neural Network (RNN) [math]\displaystyle{ f_\theta(\cdot) }[/math] parameterized by [math]\displaystyle{ \theta }[/math] is used to encode the information at each timestep [math]\displaystyle{ t }[/math] into a hidden state vector [math]\displaystyle{ h^l_t }[/math] which is followed by a decoder [math]\displaystyle{ z_t^l = h^l_tW^l + b^l }[/math] and a normalization function [math]\displaystyle{ \phi(z_t^l) }[/math] which forms a probability distribution [math]\displaystyle{ \hat{p}_{\theta}\left(y_t|x_t,h_{t-1}\right), \forall_t \in T }[/math].

[math]\displaystyle{ P(w_1, \cdots, w_T)=\prod_{t-1}^T P (w_t|w_{t1}, \cdots, w_1) \quad }[/math] (1)

However, training can be slow when [math]\displaystyle{ |V| }[/math] is large while also leaving a large memory footprint for the respective input embedding matrices. Conversely, in cases where the decoder is limited by an information bottleneck (Yang et al., 2017), the opposite is required where more degrees of freedom are necessary to alleviate information loss in the decoder bottleneck. Both scenarios correspond to a trade-off between computation complexity and out-of-sample performance. Hence, we require that a newly proposed model has the property that the decoder can be easily configured to deal with this trade-off in a principle way.

Lastly, standard supervised learning (self—supervised for language modelling) assumes inputs are i.i.d. However, in sequence prediction, the model has to rely on its own predictions at test time, instead of past targets that are used as input at training time. This difference is known as exposure bias and can lead to errors compounding along a generated sequence. This approach to sequence prediction is also known as teacher forcing where the teacher provides targets that are used at training time. We also require that exposure bias is addressed in our approach while dealing with the aforementioned challenges related to computation and performance tradeoffs in the decoder.

We propose an error-correcting output code (ECOC) based NLM (ECOC-NLM) that address this desiderata. In the approximate case where codeword dimensionality [math]\displaystyle{ c\lt |V| }[/math], we show that given sufficient error codes ([math]\displaystyle{ |V| \ll |c| \gg log_2(|V|)) }[/math], we maintain accuracy compared to traditional NLMs that use the full softmax and other approximate methods. Lastly, we show that this latent-based NLM approach can be extended to mitigate the aforementioned problem of compounding errors by using Latent Mixture Sampling (LMS). LMS in an ECOC—NLM model also outperforms an equivalent Hierarchical Softmax—based NLM that uses Scheduled Sampling (Bengio et al., 2015) and other closely related baselines. To our knowledge, this is the first latent-based technique to mitigating compounding errors in recurrent neural networks (RNNs).

Our main contributions are summarized as the following:

An error-correcting output coded neural language model that requires less parameters than its softmax-based language modelling counterpart given sufficient separability between classes via error—checks.
An embedding cosine similarity rank ordered codebook that leads to well-separated codewords.
A Latent-Mixture Sampling method to mitigate exposure bias in latent-variable models. This is then extended to Differentiable Latent Mixture Sampling that uses the Gumbel—Softmax so that discrete categorical variables can be backpropogated through.
Novel baselines such as Scheduled Hierarchical Sampling (SS-HS) and Scheduled Adaptive Sampling (SSAS), are introduced in the evaluation of our proposed ECOC method. This applies scheduled sampling to two closely related softmax approximation methods.

2 Background

3 Methodology

4 Experimental Setup

5 Results

6 Conclusion

This work proposed an error-correcting neural language model and a novel Latent Variable Mixture Sampling method for latent variable models. We find that performance is maintained compared to using the full conditional and related approximate methods, given a sufficient code-word size to account for correlations among classes. This corresponds to 40 bits for PTB and 100 bits for WikiText-2 and WikiText-103. Furthermore, we find that performance is improved when rank-ordering the codebook via embedding similarity where the query is the embedding of the most frequent word.

Lastly, we introduced Latent Variable Mixture Sampling to mitigate exposure bias. This can be easily integrated into training latent variable-based language models, such as the ECOC-based language model. We show that this method outperforms the well-known scheduled sampling method with a full softmax, hierarchical softmax and an adaptive softmax on an image captioning task, with less decoder parameters than the full softmax with only 200 bits, 2% of the original number of output dimensions.

References

1. Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171—1179, 2015.
2. Bengio, Y. and Senécal, .I.—S. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4): 713—722, 2008.
3. Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):ll37—1155, 2003a.
4. Bengio, Y., Senécal, J.—S., et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pp. 1—9, 2003b.
5. Berger, A. Error—correcting output coding for text classiﬁcation. In IJCAI—99: Workshop on machine learning for information ﬁltering, 1999.
6. Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural informationprocessing systems, pp. 1019—1027, 2016.
7. Goodman, J. T. A bit of progress in language modeling. Computer Speech & Language, 15(4):4037434, 2001.
8. Goyal, K., Dyer, C., and Berg—Kirkpatrick, T. Differentiable scheduled sampling for credit assignment. arXiv preprinl arXiv:1 704.06970, 2017.
9. Grandvalet, Y. and Bengio, Y. Serni—supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, pp. 529—536, 2005.
10. Grave, E., Joulin, A., Cissé, M., Grangier, D., and .Iégou, H. Efﬁcient softmax approximation for gpus. arXiv preprinl arXiv:1609.04309, 2016.
11. Gutmann, M. and Hyv’arinen, A. Noise—contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth Inter- national Conference on Artiﬁcial Intelligence and Statistics, pp. 297—304, 2010.
12. Hamming, R. W. Error detecting and error correcting codes. Bell System technical journal, 29(2):l47—l60, 1950.
13. Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel—softmax. arXiv preprint arXiv:1611.01144, 2016.
14. Jean, S., Cho, K., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine translation. arXiv preprinl arXiv:1412.2007, 2014.
15. Kong, E. B. and Dietterich, T. G. Error—correcting output coding corrects bias and variance. In Machine Learning Proceedings 1995, pp. 313—321. Elsevier, 1995.
16. Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprinl arXiv:1611.00712, 2016.
17. (Mikolov et al., 2010) ⇒ Mikolov, T., Karaﬁat, M., Burget, L., Cernocky, J., and Khudanpur, S. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
18. Mikolov, T., Sutskever, 1., Chen, K., Corrado, G. S., and Dean, .1. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111—3119, 2013.
19. Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. arXiv preprinl arXiv:1206.6426, 2012.
20. Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246— 252, 2005.
21. Sejnowski, T. .1. and Rosenberg, C. R. Parallel networks that learn to pronounce english text. Complex systems, 1(1): 145—168, 1987.
22. Shi, K. and Yu, K. Structured word embedding for low memory neural network language model. Proc. Inter-speech 2018, pp. 1254—1258, 2018.
23. Shu, R. and Nakayama, H. Compressing word embeddings Via deep compositional code learning. arXiv preprint arXiv:1 711.01068, 2017.
24. (Sutskever et al., 2014) ⇒ Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104—3112, 2014.
25. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer Vision. In: Proceedings oflhe IEEE conference on computer vision and pattern recognition, pp. 2818—2826, 2016.
26. Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A high—rank rnn language model. arXivpreprinl arXiv:1711.03953, 2017.
27. Zhang, L., Zhang, Y., Tang, J., Lu, K., and Tian, Q. Binary code ranking with weighted hamming distance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1586—1593, 2013.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2019 ErrorCorrectingNeuralSequencePr	James O' Neill Danushka Bollegala			Error-Correcting Neural Sequence Prediction						2019