N-Gram Language Model

A N-Gram Language Model is a Language Model that estimates n-gram probabilities based on the Markov Assumption.

Context:
- It can be mathematically represented by the statistical model $\mathcal{M}(\mathcal{S}, \mathcal{P})$ where:
  - $\mathcal{S}$ is a set of all n-grams within a vocabulary $\mathcal{V}$.
  - $\mathcal{P}$ is the n-gram probability distribution defined as a conditional probability on context history.
- It can be a Maximum Likelihood Expectation (MLE)-based Language Model.
- It can be automatically learned by a N-Gram Language Model Training System.
- It can range from being an Unsmoothed N-Gram Model to being a Smoothed N-Gram Model.
Example(s):
Counter-Example(s):
- a Neural Network Language Model.
- an Exponential Language Model.
See: Unigram, Probability Distribution, American English, n-Gram, Bag of Words Model, Relative Likelihood, Natural Language Processing, Speech Recognition, Machine Translation, Part-of-Speech Tagging, Parsing, Optical Character Recognition.

References

2019a

(Wikipedia, 2019) ⇒ https://www.wikiwand.com/en/Language_model#/n-gram Retrieved:2019-12-22.
- In an n-gram model, the probability [math]\displaystyle{ P(w_1,\ldots,w_m) }[/math] of observing the sentence [math]\displaystyle{ w_1,\ldots,w_m }[/math] is approximated as
  [math]\displaystyle{ P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=1} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) }[/math]
  It is assumed that the probability of observing the i^th word w_i in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (n^th order Markov property).
  The conditional probability can be calculated from n-gram model frequency counts:
  [math]\displaystyle{ P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})} }[/math]
  The terms bigram and trigram language models denote n-gram models with n = 2 and n = 3, respectively.^[1]
  Typically, the n-gram model probabilities are not derived directly from frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not been explicitly seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or n-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good-Turing discounting or back-off models.

↑ Craig Trim, What is Language Modeling?, April 26th, 2013.

2019b

(Jurafsky & Martin, 2019) ⇒ Dan Jurafsky and James H. Martin (2019). "N-gram Language Models". In: Speech and Language Processing (3rd ed. draft)
- QUOTE: An n-gram is a sequence of $N$ n-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. We’ll see how to use n-gram models to estimate the probability of the last word of an n-gram given the previous words, and also to assign probabilities to entire sequences.
  
  ...
  
  Thus, the general equation for this n-gram approximation to the conditional probability of the next word in a sequence is
  $P(w_n|w^{n-1}_1) \approx P(wn|w^{n-1}_{n-N+1} )$ (3.8)
  Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
  $P(w^n_1 ) \approx \displaystyle_{k=1}^n P(w_k |w_{k−1})$ (3.9)
  How do we estimate these bigram or n-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE.

2017

(Manjavacas et al.,2017) ⇒ Enrique Manjavacas, Jeroen De Gussem , Walter Daelemans, and Mike Kestemont (2017, September). "Assessing the Stylistic Properties of Neurally Generated Text in Authorship Attribution". In: Proceedings of the Workshop on Stylistic Variation. DOI:10.18653/v1/W17-4914.
- QUOTE: In short, a LM is a probabilistic model of linguistic sequences that, at each step in a sequence, assigns a probability distribution over the vocabulary conditioned on the prefix sequence. More formally, a LM is defined by Equation 1,
  $LM(w_t) = P(w_t |w_{t-n}, w_{t-(n-1)},\cdots, w_{t-1})$ (1)
  where $n$ refers to the scope of the model —i.e. the length of the prefix sequence taken into account to condition the output distribution at step $t$.
  (...) In the current study we compare two widely-used LM architectures – an ngram-based LM (NGLM) and a Recurrent Neural Network-based LM (RNNLM). An NGLM is basically a conditional probability table for Equation 1 that is estimated on the basis of the count data for ngrams of a given length $n$. Typically, NGLMs suffer from a data sparsity problem since for a large enough value of $n$ many possible conditioning prefixes will not be observed in the training data and the corresponding probability distribution will be missing. To alleviate the sparsity problem, two techniques — smoothing and backoff models — can be used that either reserve some probability mass and evenly redistribute it across unobserved ngrams (smoothing) or resort back to a lower-order model to provide an approximation to the conditional distribution of an unobserved ngram (backoff models).

2015

(Goldberg, 2015) ⇒ Yoav Goldberg. (2015). “The Unreasonable Effectiveness of Character-level Language Models (and Why RNNs Are Still Cool).” In: Blog Post.

1992

(Brown et al, 1992) ⇒ Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. (1992). “Class-based N-gram Models of Natural Language.” In: Computational Linguistics, 18(4).

[1] Craig Trim, What is Language Modeling?, April 26th, 2013.

[1]