Perplexity Performance (PP) Measure
(Redirected from Perplexity)
Jump to navigation
Jump to search
An Perplexity Performance (PP) Measure is an intrinsic performance measure that measures how well a probability model predicts a sample.
- Context:
- output: Perplexity Score.
- It can be an input to a Perplexity Measuring Task.
- It can (often) be used for Language Model Evaluation.
- ...
- Example(s):
- [math]\displaystyle{ 2^{\it{Entropy}} = 2^{-\Sigma \ p \log p} }[/math].
- a Language Model Perplexity Measure.
- ...
- Counter-Example(s):
- an Extrinsic Performance Measure, such as word error rate for ASR, or BLEU for AMT.
- See: Entropy Measure, Empirical Analysis.
References
2020
- https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
- QUOTE: ... Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the following symbol. Consider a language model with an entropy of three bits, in which each bit encodes two possible outcomes of equal probability. This means that when predicting the next symbol, that language model has to choose among possible options. Thus, we can argue that this language model has a perplexity of 8.
Mathematically, the perplexity of a language model is defined as: ...
- QUOTE: ... Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the following symbol. Consider a language model with an entropy of three bits, in which each bit encodes two possible outcomes of equal probability. This means that when predicting the next symbol, that language model has to choose among possible options. Thus, we can argue that this language model has a perplexity of 8.
2019
- https://openreview.net/forum?id=HJePno0cYm¬eId=Hkla0-dp27
- QUOTE: This paper proposes a variant of transformer to train language model, ... Extensive experiments in terms of perplexity results are reported, specially on WikiText-103 corpus, significant perplexity reduction has been achieved.
Perplexity is not a gold standard for language model, the authors are encouraged to report experimental results on real world applications such as word error rate reduction on ASR or BLEU score improvement on machine translation.
- QUOTE: This paper proposes a variant of transformer to train language model, ... Extensive experiments in terms of perplexity results are reported, specially on WikiText-103 corpus, significant perplexity reduction has been achieved.
2018
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Perplexity Retrieved:2018-3-7.
- In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.
2016
- https://www.slideshare.net/alopezfoo/edinburgh-mt-lecture-11-neural-language-models
- QUOTE: Given: [math]\displaystyle{ \bf{w}, \it{p}_{\text{LM}} }[/math]; [math]\displaystyle{ \text{PPL} = 2 \frac{1}{\bf{w}} \log_w \it{p}_{\text{LM}} (\bf{w}) }[/math]; [math]\displaystyle{ 0 \le \text{PPL} \le \infty }[/math]
- Perplexity is a generation of the notion of branching factor: How many choices to I have at each position?
- State-of-the-art English LMs have a PPL of ~100 word choices per position
- A uniform LM has a perplexity of [math]\displaystyle{ |\Sigma| }[/math]
- Humans do much better … and bad models can do even worse than uniform!
2017
- https://web.stanford.edu/class/cs124/lec/languagemodeling2017.pdf
- QUOTE: The best language model is one that best predicts an unseen test setGives the highest P(sentence).
- Perplexity is the inverse probability of the test set, normalized by the number of words:
[math]\displaystyle{ \text{PP}(\bf{w}) = \it{p} (w_1,w_2, ..., w_n)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{ \it{p}(w_1,w_2, ..., w_n)}} }[/math] - Chain rule: [math]\displaystyle{ \text{PP}(\bf{w}) = \sqrt[N]{ \Pi^{N}_{i=1} \frac{1}{ \it{p}(w_i | w_1,w_2, ..., w_n)}} }[/math]
- For bigrams: [math]\displaystyle{ \text{PP}(\bf{w}) = \sqrt[N]{ \Pi^{N}_{i=1} \frac{1}{ \it{p}(w_i | w_{i-1})}} }[/math]
- Perplexity is the inverse probability of the test set, normalized by the number of words:
- Minimizing perplexity is the same as maximizing probability
- Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ: Unigram=162 ; Bigram=170 ; Trigram = 109.
- QUOTE: The best language model is one that best predicts an unseen test setGives the highest P(sentence).
2009
- (Jurafsky & Martin, 2009) ⇒ Daniel Jurafsky, and James H. Martin. (2009). “Speech and Language Processing, 2nd edition." Pearson Education. ISBN:0131873210
- Perplexity is the most common intrinsic evaluation metric for N-gram language models.
1977
- (Jelinek et al., 1977) ⇒ Fred Jelinek, Robert L. Mercer, Lalit R. Bahl, and James K. Baker. (1977). “Perplexity — a Measure of the Difficulty of Speech Recognition Tasks.” The Journal of the Acoustical Society of America 62, no. S1