2014 OneBillionWordBenchmarkforMeasu

(Chelba et al., 2014) ⇒ Ciprian Chelba, Tomáš Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. (2014). “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.” In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014).

Subject Headings: One Billion Word Language Modeling Benchmark Task, One Billion Word Benchmark Dataset, GPT-2 Benchmark Task, Text Corpus, Language Model, Language Modeling Algorithm.

Notes

Other Version(s):
- (Chelba et al., 2013) ⇒ Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. (2013). “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.” Technical Report, Google Research.

Cited By

Google Scholar: ~ 762 Citations, Retrieved:2021-01-03.
Semantic Scholar: ~ 721 Citations, Retrieved:2021-01-03.

Quotes

Author Keywords

Language modeling; benchmark; reproducible research.

Abstract

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a this http://code.google.com/ project; besides the scripts needed to rebuild the training / held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

1 Introduction

Statistical language modeling has been applied to a wide range of applications and domains with great success. To name a few, automatic speech recognition, machine translation, spelling correction, touch-screen “soft" keyboards and many natural language processing applications depend on the quality of language models (LMs).

The performance of LMs is determined mostly by several factors: the amount of training data, quality and match of the training data to the test data, and choice of modeling technique for estimation from the data. It is widely accepted that the amount of data, and the ability of a given estimation algorithm to accomodate large amounts of training are very important in providing a solution that competes successfully with the entrenched n-gram LMs. At the same time, scaling up a novel algorithm to a large amount of data involves a large amount of work, and provides a significant barrier to entry for new modeling techniques. By choosing one billion words as the amount of training data we hope to strike a balance between the relevance of the benchmark in the world of abundant data, and the ease with which any researcher can evaluate a given modeling approach. This follows the work of Goodman (2001a), who explored performance of various language modeling techniques when applied to large data sets. One of the key contributions of our work is that the experiments presented in this paper can be reproduced by virtually anybody with an interest in LM, as we use a data set that is freely available on the web.

Another contribution is that we provide strong baseline results with the currently very popular neural network LM (Bengio et al., 2003). This should allow researchers who work on competitive techniques to quickly compare their results to the current state of the art.

The paper is organized as follows: Section 2 describes how the training data was obtained; Section 3 provides a short overview of the language modeling techniques evaluated; finally, Section 4 presents results obtained and Section 5 concludes the paper.

2 Description of the Benchmark Data

In the following experiments, we used text data obtained from the WMT11 website^[1]. The data preparation process was performed as follows:

All training monolingual/English corpora were selected
Normalization and tokenization was performed using scripts distributed from the WMT11 site, slightly augmented to normalize various UTF-8 variants for common punctuation, e.g. ’
Duplicate sentences were removed, dropping the number of words from about 2.9 billion to about 0.8 billion (829250940, more exactly, counting sentence boundary markers < s >, < \ s >)
Vocabulary (793471 words including sentence boundary markers ＜S＞, <\ S>) was constructed by discarding all words with count below 3
Words outside of the vocabulary were mapped to <UNK> token, also part of the vocabulary.
Sentence order was randomized, and the data was split into 100 disjoint partitions
One such partition (1%) of the d[ata]] was chosen as the held-out set.
The held-out set was then randomly shuffled and split again into 50 disjoint partitions to be used as development/test data.
One such resulting partition (2%, amounting to 159658 words without counting the sentence beginning marker ＜S＞ which is never predicted by the language model) of the held-out data were used as test data in our experiments; the remaining partitions are reserved for future experiments
The out-of-vocabulary (OoV) rate on the test set was 0.28%.

The benchmark is available as a code.google.com project ^[2]. A copy of the corpus was reproduced externally, and is now hosted at www.statmt.org^[3]. Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline N-gram models.

Because the original data had already randomized sentence order, the benchmark is not useful for experiments with models that capture long context dependencies across sentence boundaries.

3 Baseline Language Models

As baselines, we chose to use (Katz, 1995), and Interpolated (Kneser and Ney, 1995) (KN) 5-gram LMs, as they are the most prevalent. Since in practice these models are pruned, often quite aggressively, we also illustrate the negative effect of (Stolcke, 1998) entropy pruning on both models, similar to (Chelba et al., 2010). In particular, KN smoothing degrades much more rapidly than Katz, calling for a discerning choice in a given application.

4 Advanced Language Modeling Techniques

The number of advanced techniques for statistical language modeling is very large. It is out of scope of this paper to provide their detailed description, but we mention some of the most popular ones:

Below, we provide a short description of models that we used in our comparison using the benchmark data.

4.1 Normalized Stupid Backoff

The Stupid Backoff LM proposed in (Brants et al., 2007) as a simplified version of backoff LM, suited to client-server architectures in a distributed computing environment does not apply any discounting to relative frequencies, and it uses a single backoff weight instead of context-dependent backoff weights. As a result, the Stupid Backoff model does not generate normalized probabilities. For the purpose of computing perplexity as reported in Table 1, values output by the model were normalized over the entire LM vocabulary.

**Table 1:** Results on the 1B Word Benchmark test set with various types of language models.
Model	Num. Params	Training Time		Perplexity
	[billions]	[hours]	[CPUs]
Interpolated KN 5-gram, 1.1B n-grams (KN)	1.76	3	100	67.6
Katz 5-gram, 1.1B n-grams	1.74	2	100	79.9
Stupid Backoff 5-gram (SBO)	1.13	0.4	200	87.9
Interpolated KN 5-gram, 15M n—grams	0.03	3	100	243.2
Katz 5-gram, 15M n-grams	0.03	2	100	127.5
Binary MaXEnt 5-gram (n-gram features)	1.13	1	5000	115.4
Binary MaXEnt 5-gram (n-gram + skip-1 features)	1.8	1.25	5000	107.1
Hierarchical Softmax MaXEnt 4-gram (HME)	6	3	1	101.3
Recurrent NN-256 + MaXEnt 9-gram	20	60	24	58.3
Recurrent NN-512 + MaXEnt 9-gram	20	120	24	54.5
Recurrent NN-1024 + MaXEnt 9-gram	20	240	24	51.3

4.2 Binary Maximum Entropy Language Model

The Binary MaXEnt proposed in (Xu et al., 2011) and aims to avoid the expensive probability normalization during training by using independent binary predictors. Each predictor is trained using all the positive examples, but the negative examples are dramatically down-sampled. This type of model is attractive for parallel training, thus we explored its performance further.

We trained two models with a sampling rate of 0.001 for negative examples, one uses n-gram features only and the other uses n-gram and skip-1 n-gram features. We separated the phases of generating negative examples and tuning model parameters such that the output of the first phase can be shared by two models. The generation of the negative examples took 7.25 hours using 500 machines, while tuning the parameters using 5000 machines took 50 minutes, and 70 minutes for the two models, respectively.

4.3 Maximum Entropy Language Model with Hierarchical Softmax

Another option to reduce training complexity of the MaXEnt models is to use a hierarchical softmax (Goodman, 2001b; Morin and Bengio, 2005). The idea is to estimate probabilities of groups of words, like in a class based model -- only the classes that contain the positive examples need to be evaluated. In our case, we explored a binary Huffman tree representation of the vocabulary, such that evaluation of frequent words takes less time. The idea of using frequencies of words for a hierarchical softmax was presented previously in (Mikolov et al., 2011a).

4.4 Recurrent Neural Network Language Model

The Recurrent Neural Network (RNN) based LM have recently achieved outstanding performance on a number of tasks (Mikolov, 2012). It was shown that RNN LM significantly outperforms many other language modeling techniques on the Penn Treebank data set (Mikolov et al., 2011b). It was also shown that RNN models scale very well to data sets with hundreds of millions of words (Mikolov eta1., 2011c), although the reported training times for the largest models were in the order of weeks.

We cut down training times by a factor of 20-50 for large problems using a number of techniques, which allow RNN training in typically 1-10 days with billions of words, > 1M vocabularies and up to 20B parameters on a single standard machine without GPUs.

These techniques were in order of importance: a) Parallelization of trainingacross available CPU threads, b) Making use of SIMD instructions where possible, c) Reducing number of output parameters by 90%, d) Running a Maximum Entropy model in parallel to the RNN. Because of space limitations we cannot describe the exact details of the speed-ups here -- they will be reported in an upcoming paper.

We trained several models with varying number of neurons (Table 1) using regular SGD with a learning rate of 0.05 to 0.001 using 10 iterations over the data. The MaXEnt models running in parallel to the RNN capture a history of 9 previous words, and the models use as additional features the previous 15 words independently of order. While training times approach 2 weeks for the most complex model, slightly worse models can be trained in a few days. Note that we didn’t optimize for model size nor training speed, only test performance.

5 Results

5.1 Performance of Individual Models

Results achieved on the benchmark data with various types of LM are reported in Table 1. We focused on minimizing the perplexity when choosing hyper-parameters, however, we also report the time required to train the models. Training times are not necessarily comparable as they depend on the underlying implementation. Mapreduce can potentially process larger data sets than single-machine implementations, but come with a large overhead of communication and file I/O. Discussing details of the implementations is outside the scope as this paper.

5.2 Model Combination

The best perplexity results were achieved by linearly interpolating together probabilities from all models. However, only some models had significant weight in the combination; the weights were tuned on the held-out data. As can be seen in Table 2, the best perplexity is about 35% lower than the baseline -- the modified Kneser-Ney smoothed 5-gram model with no count cutoffs. This corresponds to about 10% reduction of cross-entropy (bits).

Somewhat surprisingly the SB0 model receives a relatively high weight in the linear combination of models, despite its poor performance in perplexity, whereas the KN baseline receives a fairly small weight relative to the other models in the combination.

**Table 2**: Model combination on the 1B Word Benchmark test set. The weights were tuned to minimize perplexity on held-out data. The optimal interpolation weights for the KN, rnn1024, rnn5l2, rnn256, SBO, HME were, respectively: 0.06, 0.61, 0.13, 0.00, 0.20, 0.00.
Model	Perplexity
Interpolated KN 5-gram, 1.1B n-grams	67.6
All models	43.8

6 Conclusion

We introduced a new data set for measuring research progress in statistical language modeling. The benchmark data set is based on resources that are freely available on the web, thus fair comparison of various techniques is easily possible. The importance of such effort is unquestionable: it has been seen many times in the history of research that significant progress can be achieved when various approaches are measurable, reproducible, and the barrier to entry is low.

The choice of approximately one billion words might seem somewhat restrictive. Indeed, it can be hardly expected that new techniques will be immediately competitive on a large data set. Computationally expensive techniques can still be compared using for example just the first or the first 10 partitions of this new dataset, corresponding to approx. 10 million and 100 million words. However, to achieve impactful results in domains such as speech recognition and machine translation, the language modeling techniques need to be scaled up to large data sets.

Another contribution of this paper is the comparison of a few novel modeling approaches when trained on a large data set. As far as we know, we were able to train the largest recurrent neural network language model ever reported. The performance gain is very promising; the perplexity reduction of 35% is large enough to let us hope for significant improvements in various applications.

In the future, we would like to encourage other researchers to participate in our efforts to make language modeling research more transparent. This would greatly help to transfer the latest discoveries into real-world applications. In the spirit of a benchmark our first goal was to achieve the best possible test perplexities regardless of model sizes or training time. However, this was a relatively limited collaborative effort, and some well known techniques are still missing. We invite other researchers to complete the picture by evaluating new, and well-known techniques on this corpus. Ideally the benchmark would also contain ASR or SMT lattices/N-best lists, such that one can evaluate application-specific performance as well.

Footnotes

References

2013

[Zweig and Makarychev2013] G. Zweig and K. MakarycheV. 2013. Speed Regularization and Optimality in Word Classing. In: Proceedings of ICASSP.

2012a

[Mikolov2012] T. Mikolov. 2012. Statistical Language Models based on Neural Networks. PhD. thesis, Brno University of Technology.

2012b

[Sundermeyer et al.2012] M. Sundermeyer, R. Schluter, and H. Ney. 2012. LSTM Neural Networksfor Language Modeling. In: Proceedings of Interspeech.

2012c

[Wu et al.2012] Y. Wu, H. Yamamoto, X. Lu, S. Matsuda, C. Hori, and H. Kashioka. 2012. Factored Recurrent Neural Network Language Model in TED Lecture Transcription. In: Proceedings of IWSLT.

2011a

[Mikolov et al.2011a] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur. 2011. Extensions of Recurrent Neural Network Language Model. In: Proceedings of ICASSP.

2011b

[Mikolov et al.2011b] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Cernocky. 2011a. Empirical Evaluation and Combination of Advanced Language Modeling Techniques. In: Proceedings of Interspeech.

2011c

[Mikolov et al.20llc] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. éernocky. 2011b. Strategies for Training Large Scale Neural Network Language Models. In: Proceedings of ASRU.

2011d

[Xu et al.2011] Puyang Xu. A. Gunawardana, and S. Khudanpur. 2011. Efficient Subsampling for Training Complex Language Models. In: Proceedings of EMNLP.

2010a

[Chelba et al.2010] C. Chelba, T. Brants, W. Neveitt, and P. Xu. 2010. Study 2011 Interaction between Entropy Pruning and Kneser-Ney Smoothing. In: Proceedings of Interspeech.

2010b

[Mikolov et al.2010] T. Mikolov, M. Karafiat, L. Burget, J . Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In: Proceedings of Interspeech.

2009

[Chen2009] S. F. Chen. 2009. Shrinking exponential language models. In: Proceedings of NAACL—HLT.

2007a

[Brants et al.2007] T. Brants, A. C. Popat, P. Xu. F. J. Och, and J. Dean. 2007. Large language models in machine translation. In: Proceedings of EMNLP.

2007b

[Mnih and Hinton2007] A. Mnih and G. Hinton. 2007. Three new graphical models for statistical language modelling. In: Proceedings of ICML.

2007c

[Schwenk2007] H. Schwenk. 2007. Continuous space language models. Computer Speech and Language, vol. 21.

2006a

[Emami2006] A. Emami. 2006. A Neural Syntactic Language Model. PhD. thesis, J ohns Hopkins University.

2006b

[Teh2006] Y. W. Teh. 2006. A hierarchical Bayesian language model based on PitmanYorprocesses. In: Proceedings of Coling/ACL.

2005a

[Morin and Bengio2005] F. Morin and Y. Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In: Proceedings of AISTATS.

2005b

[Xu2005] Peng Xu. 2005. Random forests and the data sparseness problem in language modeling. Ph.D. thesis, Johns Hopkins University.

2004

[Roark et al.2004] B. Roark, M. Saralar, M. Collins, and M. Johnson. 2004. Discriminative language modeling with conditional randomfields and the perceptron algorithm. In: Proceedings of ACL.

2003

(Bengio et al., 2003) ⇒ Y. Bengio, R. Ducharme, and P. Vincent. 2003. A neural probabilistic language model. Journal of Machine Learning Research, 3: 1 137—1155.

2001a

[Goodman2001a] J. T. Goodman. 2001a. A bit of progress in language modeling, extended version. Technical report MSR—TR—200 1—72.

2001b

[Goodman2001b] J. T. Goodman. 2001b. Classes for fast maximum entropy training. In: Proceedings of ICASSP.

2000

[Chelba and Jelinek2000] C. Chelba and F. J elinek. 2000. Structured language modeling. Computer Speech & Language.

1998

[Stolcke1998] A. Stolcke. 1998. Entropy-based Pruning ofBack-offLanguage Models. In: Proceedings of News Transcription and Understanding Workshop.

1996

[Chen and Goodmanl996] S. F. Chen and J . T. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In: Proceedings of ACL.

1995a

[Katzl995] S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing.

1995b

[Kneser and Neyl995] R. Kneser and H. Ney. 1995. Im- proved Backing—Off For M—Gmm Language Modeling. In: Proceedings of ICASSP.

1994

[Rosenfeldl994] R. Rosenfeld. 1994. Adaptive Statistical Language Modeling: A Maximum Entropy Approach. PhD. thesis, Carnegie Mellon University.

1992

[Brown et al.1992] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. 1992. Class- Based n-gmm Models ofNatural Language. Computational Linguistics, 18, 467—479.

1991

[Jelinek et al.1991] F. Jelinek. B. Merialdo, S. Roukos, and M. Strauss. 1991. A Dynamic Language Model for Speech Recognition. In: Proceedings of the DARPA Workshop on Speech and Natural Language.

1990

[Elman1990] J. Elman. 1990. Finding Structure in Time. Cognitive Science, 14, 179—211.

1986

[Rumelhart et al.1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by back-propagating errors. Nature, 323:533— 536.

BibTeX

@inproceedings{2014_OneBillionWordBenchmarkforMeasu,
  author    = {Ciprian Chelba and
               Tomas Mikolov and
               Mike Schuster and
               Qi Ge and
               Thorsten Brants and
               Phillipp Koehn and
               Tony Robinson},
  editor    = {Haizhou Li and
               Helen M. Meng and
               Bin Ma and
               Engsiong Chng and
               Lei Xie},
  title     = {One Billion Word Benchmark for Measuring Progress in Statistical Language
               Modeling},
  booktitle = {Proceedings of the 15th Annual Conference of the International Speech
               Communication Association (INTERSPEECH 2014)},
  pages     = {2635--2639},
  publisher = {ISCA},
  year      = {2014},
  url       = {http://www.isca-speech.org/archive/interspeech\_2014/i14\_2635.html},
}

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2014 OneBillionWordBenchmarkforMeasu	Ciprian Chelba Mike Schuster Qi Ge Thorsten Brants Phillipp Koehn Tony Robinson Tomáš Mikolov			One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling						2013 2014

[1] ttp://statmt.org/wmt11/training-monolingual.tgz

[ftn-2-2] ttps://code.google.com/p/1-billion-word-language-modeling-benchmark/

[ftn-3-3] ttp://www.statmt.org/lm-benchmark

[1]

[2]

[3]