Maximum Likelihood-based Language Model (LM) Training Algorithm

Context:
- It can be implemented by a MLE-based Language Model Training System.
- It can range from being a Character-Level MLE-based LM Algorithm to being a Word-Level MLE-based LM Algorithm.
Example(s):
- the one described in (Goldberg, 2015).
See: Neural-based Language Modeling Algorithm.

References

chat
- Both statistical language models (SLMs) and large language models (LLMs) can use maximum likelihood estimation during their training process.
  In the case of SLMs, maximum likelihood estimation is employed to calculate the probabilities of n-grams based on their frequency of occurrence in the training corpus. For example, in a bigram model, the probability of a word given its previous word can be estimated by counting the occurrences of the bigram in the training data and normalizing it by the total count of the preceding word.
  For LLMs, such as deep learning-based neural network models, maximum likelihood estimation is also used in training. These models are designed to optimize the likelihood of generating the observed data, given the model's parameters. During training, the objective is to maximize the log-likelihood of the training data, which is equivalent to minimizing the cross-entropy loss between the model's predictions and the ground truth. This process adjusts the model's weights to better predict the next word or token in the sequence based on the given context.