2001 ABitofProgressinLanguageModelin

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Language Modeling, N-Gram Language Model, Trigram Language Model, Smoothed N-Gram Language Model, Katz-Smoothed Trigram Model, Caching-based Language Model, Clustering-based Language Model, Higher-Order N-Gram Language Model, Skipping Language Model, Sentence-Mixture Language Model.

Notes

Cited By

Quotes

Abstract

In the past several years, a number of different language modeling improvements over simple trigram models have been found, including caching, higher-order n-grams, skipping, interpolated Kneser-Ney smoothing, and clustering. We present explorations of variations on, or of the limits of, each of these techniques, including showing that sentence mixture models may have more potential. While all of these techniques have been studied separately, they have rarely been studied in combination. We compare a combination of all techniques together to a Katz smoothed trigram model with no count cutoffs. We achieve perplexity reductions between 38 and 50% (1 bit of entropy), depending on training data size, as well as a word error rate reduction of 8.9%. Our perplexity reductions are perhaps the highest reported compared to a fair baseline.

Introduction

1.1 Overview

Language modeling is the art of determining the probability of a sequence of words. This is useful in a large variety of areas including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Hull, 1992; Kernighan et al., 1990; Srihari and Baltus, 1992). The most commonly used language models are very simple (e.g. Katz-smoothed trigram model). There are many improvements over this simple model however, including caching, clustering, higher-order n-grams, skipping models, and sentence-mixture models, all of which we will describe below. Unfortunately, these more complicated techniques have rarely been examined in combination. It is entirely possible that two techniques that work well separately will not work well together, and, as we will show, even possible that some techniques will work better together than either one does by itself. In this paper, we will first examine each of the aforementioned techniques separately, looking at variations on the technique, or its limits. Then we will examine the techniques in various combinations, and compare to a Katz smoothed trigram with no count cutoffs. On a small training data set, 100,000 words, we can get up to a 50% perplexity reduction, which is one bit of entropy. On larger data sets, the improvement declines, going down to 41% on our largest data set, 284,000,000 words. On a similar large set without punctuation, the reduction is 38%. On that data set, we achieve an 8.9% word error rate reduction. These are perhaps the largest reported perplexity reductions for a language model, versus a fair baseline.

The paper is organized as follows. First, in this section, we will describe our terminology, briefly introduce the various techniques we examined, and describe our evaluation methodology. In the following sections, we describe each technique in more detail, and give experimental results with variations on the technique, determining for each the best variation, or its limits. In particular, for caching, we show that trigram caches have nearly twice the potential of unigram caches. For clustering, we find variations that work slightly better than traditional clustering, and examine the limits. For n-gram models, we examine up to 20-grams, but show that even for the largest models, performance has plateaued by 5 to 7 grams. For skipping models, we give the first detailed comparison of different skipping techniques, and the first that we know of at the 5-gram level. For sentence mixture models, we show that mixtures of up to 64 sentence types can lead to improvements. We then give experiments comparing all techniques, and combining all techniques in various ways. All of our experiments are done on three or four data sizes, showing which techniques improve with more data, and which get worse. In the concluding section, we discuss our results. Finally, in the appendices, we give a proof helping to justify Kneser-Ney smoothing and we describe implementation tricks and details for handling large data sizes, for optimizing parameters, for clustering, and for smoothing.

1.2 Technique introductions

The goal of a language model is to determine the probability of a word sequence [math]\displaystyle{ w_1...w_n, P (w_1...w_n) }[/math]. This probability is typically broken down into its component probabilities:

[math]\displaystyle{ P (w_1...w_i) = P (w_1) × P (w_2 \mid w_1) ×... × P (w_i \mid w_1...w_{i−1}) }[/math]

Since it may be difficult to compute a probability of the form [math]\displaystyle{ P(w_i \mid w_1...w_{i−1}) }[/math] for large i, we typically assume that the probability of a word depends on only the two previous words, the trigram assumption:

[math]\displaystyle{ P (w_i \mid w_1...w_{i−1}) ≈ P (w_i \mid w_i−2w_{i−1}) }[/math]

which has been shown to work well in practice. The trigram probabilities can then be estimated from their counts in a training corpus. We let [math]\displaystyle{ C (w_i−2w_{i−1}w_i) }[/math] represent the number of occurrences of [math]\displaystyle{ w_i−2w_{i−1}w_i }[/math] in our training corpus, and similarly for [math]\displaystyle{ C (w_i−2w_{i−1}) }[/math]. Then, we can approximate:

[math]\displaystyle{ P (w_i \mid w_i−2w_{i−1}) ≈ C (w_i−2w_{i−1}w_i) C (w_i−2w_{i−1}) }[/math]

Unfortunately, in general this approximation will be very noisy, because there are many three word sequences that never occur. Consider, for instance, the sequence “party on Tuesday”. What is [math]\displaystyle{ P (\text{Tuesday} \mid \text{party on}) }[/math]? Our training corpus might not contain any instances of the phrase, so C (party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on”. Thus, we would predict [math]\displaystyle{ P (\text{Tuesday} \mid \text{party on}) = 0 }[/math], clearly an underestimate. This kind of 0 probability can be very problematic in many applications of language models. For instance, in a speech recognizer, words assigned 0 probability cannot be recognized no matter how unambiguous the acoustics.

Smoothing techniques take some probability away from some occurrences. Imagine that we have in our training data a single example of the phrase “party on Stan Chen’s birthday”.[1] Typically, when something occurs only one time, it is greatly overestimated. In particular,

[math]\displaystyle{ P (\text{Stan} \mid \text{party on}) \lt \frac{1}{20} = \frac{C (\text{party on Stan})} { C (\text{party on}) } }[/math]

By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided. In a smoothed trigram model, the extra probability is typically distributed according to a smoothed bigram model, etc. While the most commonly used smoothing techniques, Katz smoothing (Katz, 1987) and Jelinek-Mercer smoothing (Jelinek and Mercer, 1980) (sometimes called deleted interpolation) work fine, even better smoothing techniques exist. In particular, we have previously shown (Chen and Goodman, 1999) that versions of Kneser-Ney smoothing (Ney et al., 1994) outperform all other smoothing techniques. In the appendix, we give a proof partially explaining this optimality. In Kneser-Ney smoothing, the backoff distribution is modified: rather than a normal bigram distribution, a special distribution is used. Using Kneser-Ney smoothing instead of more traditional techniques is the first improvement we used.

The most obvious extension to trigram models is to simply move to higher-order n-grams, such as four-grams and five-grams. We will show that in fact, significant improvements can be gotten from moving to five-grams. Furthermore, in the past, we have shown that there is a significant interaction between smoothing and n-gram order (Chen and Goodman, 1999): higher-order n-grams work better with Kneser-Ney smoothing than with some other methods, especially Katz smoothing. We will also look at how much improvement can be gotten from higher order n-grams, examining up to 20-grams.

Another simple extension to n-gram models is skipping models (Rosenfeld, 1994; Huang et al., 1993; Ney et al., 1994), in which we condition on a different context than the previous two words. For instance, instead of computing [math]\displaystyle{ P (w_i \mid w_i−2w_{i−1}) }[/math], we could instead compute [math]\displaystyle{ P (w_i \mid w_i−3w_i−2) }[/math]. This latter model is probably not as good, but can be combined with the standard model to yield some improvements.

Clustering (also called classing) models attempt to make use of the similarities between words. For instance, if we have seen occurrences of phrases like “party on Monday” and “party on Wednesday”, then we might imagine that the word “Tuesday”, being similar to both “Monday” and “Wednesday”, is also likely to follow the phrase “party on.” The majority of the previous research on word clustering has focused on how to get the best clusters. We have concentrated our research on the best way to use the clusters, and will report results showing some novel techniques that work a bit better than previous methods. We also show significant interactions between clustering and smoothing.

Caching models (Kuhn, 1988; Kuhn and De Mori, 1990; Kuhn and De Mori, 1992) make use of the observation that if you use a word, you are likely to use it again. They tend to be easy to implement and to lead to relatively large perplexity improvements, but relatively small word-error rate improvements. We show that by using a trigram cache, we can get almost twice the improvement as from a unigram cache.

Sentence Mixture models (Iyer and Ostendorf, 1999; Iyer et al., 1994) make use of the observation that there are many different sentence types, and that making models for each type of sentence may be better than using one global model. Traditionally, only 4 to 8 types of sentences are used, but we show that improvements can be gotten by going to 64 mixtures, or perhaps more.

Footnotes

  1. Feb. 25. Stan will provide cake. You are all invited.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2001 ABitofProgressinLanguageModelinJoshua T. GoodmanA Bit of Progress in Language Modeling10.1006/csla.2001.01742001