Language Model (LM)

Jump to navigation Jump to search

A Language Model (LM) is a sequence probability prediction model for language unit sequences.






  • (Wikipedia, 2018) ⇒ Retrieved:2018-4-8.
    • A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability [math]\displaystyle{ P(w_1,\ldots,w_m) }[/math] to the whole sequence. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications, especially ones that generate text as an output. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

      In speech recognition, the computer tries to match sounds with word sequences. The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases "recognize speech" and "wreck a nice beach" are pronounced almost the same but mean very different things. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.

      Language models are used in information retrieval in the query likelihood model. Here a separate language model is associated with each document in a collection. Documents are ranked based on the probability of the query Q in the document's language model [math]\displaystyle{ P(Q\mid M_d) }[/math] . Commonly, the unigram language model is used for this purpose— otherwise known as the bag of words model.

      Data sparsity is a major problem in building language models. Most possible word sequences will not be observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. This is known as an n-gram model or unigram model when n = 1.





  • (Collins, 2013) ⇒ Michael Collins (2013). "Chapter1: Language Modeling".In: Course notes for NLP, Columbia University.
    • QUOTE: Assume that we have a corpus, which is a set of sentences in some language. For example, we might have several years of text from the New York Times, or we might have a very large amount of text from the web. Given this corpus, we'd like to estimate the parameters of a language model. A language model is defined as follows. First, we will define [math]\displaystyle{ \mathcal{V} }[/math] to be the set of all words in the language. For example, when building a language model for English we might have

      [math]\displaystyle{ \mathcal{V} = \{\text{the, dog, laughs, saw, barks, cat, . . .}\} }[/math]

      In practice [math]\displaystyle{ \mathcal{V} }[/math] can be quite large: it might contain several thousands, or tens of thousands, of words. We assume that [math]\displaystyle{ \mathcal{V} }[/math] is a finite set. A sentence in the language is a sequence of words

      [math]\displaystyle{ x_1 x_2 \cdots x_n }[/math]

      where the integer [math]\displaystyle{ n }[/math] is such that [math]\displaystyle{ n \geq 1 }[/math], we have [math]\displaystyle{ x_i \in \mathcal{V} }[/math] for [math]\displaystyle{ i \in \{1 \cdots (n - 1)\} }[/math], and we assume that [math]\displaystyle{ x_n }[/math] is a special symbol (...)

      We will define [math]\displaystyle{ \mathcal{V}^{\dagger} }[/math] to be the set of all sentences with the vocabulary [math]\displaystyle{ \mathcal{V} }[/math]: this is an infinite set, because sentences can be of any length.

      We then give the following definition:

      Definition 1 (Language Model) A language model consists of a finite set [math]\displaystyle{ V }[/math], and a function [math]\displaystyle{ p(x_1, x_2, \cdots, x_n) }[/math] such that:

1. For any [math]\displaystyle{ \langle x_1 \cdots x_n \rangle \in \mathcal{V}^{\dagger} ,\; p(x_1, x_2, \cdots x_n) \geq 0 }[/math]
2. In addition,

[math]\displaystyle{ \displaystyle \sum_{\langle x_1 \cdots x_n \rangle \in \mathcal{V}^{\dagger}} p(x_1, x_2, \cdots x_n) = 1 }[/math]
Hence [math]\displaystyle{ p(x_1, x_2, \cdots, x_n) }[/math] is a probability distribution over the sentences in [math]\displaystyle{ \mathcal{V}^{\dagger} }[/math] .


  1. Finite automata can have outputs attached to either their states or their arcs; we use states here, because that maps directly on to the way probabilistic automata are usually formalized.