Statistical Language Model (SLM)

Context:
- It can (typically) be a Maximum-Likelihood LM.
- It can be mathematically represented by the statistical model $\mathcal{M}(\mathcal{S}, \mathcal{P})$ where:
  - $\mathcal{S}$ is a set of all possible sequences of language model units (e.g. characters, words, strings) within a vocabulary $\mathcal{V}$.
  - $\mathcal{P}$ is the probability distribution on $\mathcal{S}$, i.e. $\mathcal{P}=\{P(w_1,w_2,...,w_n): w_1 \; w_2 \;... \; w_n \in \mathcal{S} \}$;
- ...
Example(s):
- a Bigram LM.
- a Trigram LM.
- ...
See: Neural LM.

References

chat
- A statistical language model (SLM) is a type of language model that estimates the probability distribution of natural language sequences using statistical methods. It aims to predict the likelihood of a word or a sequence of words occurring in a given context. SLMs are widely used in various natural language processing (NLP) tasks, such as speech recognition, machine translation, and information retrieval.
  SLMs typically use n-grams, which are continuous sequences of n items (words or characters) from a given text. The most common SLMs are unigram, bigram, and trigram models. These models are based on the assumption that the probability of a word occurring depends only on a fixed number of preceding words. The models are trained on large corpora of text, and they calculate the probabilities of words or sequences of words based on their occurrences in the training data.