Kullback-Leibler (KL) Divergence Measure

A Kullback-Leibler (KL) Divergence Measure is a non-symmetric probability distribution similarity measure that quantifies how one probability distribution diverges from a second, reference probability distribution.

AKA: KL Information Gain, [math]\displaystyle{ I }[/math], Directed Divergence, Discrimination Information, Relative Entropy.
Context:
- It can be used where [math]\displaystyle{ P }[/math] represents the "true" distribution of observations or a precisely calculated theoretical distribution.
- It can be used where [math]\displaystyle{ Q }[/math] represents an approximation, model, or theory of [math]\displaystyle{ P }[/math].
- It can measure the expected number of extra bits required to encode samples from [math]\displaystyle{ P }[/math] when using a code optimized for [math]\displaystyle{ Q }[/math], rather than using a code based on [math]\displaystyle{ P }[/math].
- It can be expressed mathematically as:
  - For Discrete Distributions: [math]\displaystyle{ D_{KL}(P \parallel Q) = \sum_i P(i) \log \left( \frac{P(i)}{Q(i)} \right) }[/math].
  - For Continuous Distributions: [math]\displaystyle{ D_{KL}(P \parallel Q) = \int_{-\infty}^\infty P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx }[/math].
- It can be used in information theory, statistics, and machine learning for tasks like variational inference and probability distribution analysis.
- It can be asymmetric, meaning that [math]\displaystyle{ D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P) }[/math].
- It can relate to the concept of information gain in decision trees.
- It can be used to derive other measures like Mutual Information and Jensen-Shannon Divergence.
- It can be referenced by a Kullback-Leibler Divergence Loss Function.
- ...
Example(s):
- The Mutual Information Metric, which measures the mutual dependence between two variables and is derived from the KL divergence.
- The Information Gain used in decision trees, based on the difference in entropy before and after a split.
- Calculating the KL divergence between two Gaussian distributions in machine learning models.
- ...
Counter-Example(s):
- A Jensen-Shannon Divergence, which is a symmetrized and smoothed version of the KL divergence.
- A Bhattacharyya Distance Function, used to measure the similarity between two probability distributions.
- A Hellinger Distance Function, another measure of the similarity between probability distributions.
- ...
See: Jensen-Shannon Divergence, Information Gain Impurity Function, Akaike Information Criterion (AIC) Statistic, Probability Distribution Analysis, Entropy, Cross-Entropy.

References

2023

chat
- Q: What are the similarities and differences between Kullback-Leibler (KL) Divergence and Jensen-Shannon Divergence?
- A: Kullback-Leibler (KL) Divergence and Jensen-Shannon Divergence are both measures of the difference between two probability distributions.
  The similarity between them is that they both compare two probability distributions and they both are symmetric.
  The main difference between them is that KL divergence is not a true metric, meaning it doesn't obey the triangle inequality, whereas the Jensen-Shannon divergence is a true metric. Additionally, the KL divergence is defined only when the distributions have non-zero probabilities at the same points, whereas the Jensen-Shannon divergence is defined for any two probability distributions. The KL divergence is also non-negative, meaning it always returns a positive or zero value, whereas the Jensen-Shannon divergence can return negative values.

2011

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- In probability theory and information theory, the Kullback–Leibler divergence^[1]^[2]^[3] (also information divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the difference between two probability distributions P and Q. KL measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P. Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not symmetric: the KL from P to Q is generally not the same as the KL from Q to P. KL divergence is a special case of a broader class of divergences called f-divergences. Originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions, it is not the same as a divergence in calculus. However, the KL divergence can be derived from the Bregman divergence.
  For probability distributions P and Q of a discrete random variable their K–L divergence is defined to be :[math]\displaystyle{ D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}. \! }[/math] In words, it is the average of the logarithmic difference between the probabilities P and Q, where the average is taken using the probabilities P. The K-L divergence is only defined if P and Q both sum to 1 and if [math]\displaystyle{ Q(i)>0 }[/math] for any i such that [math]\displaystyle{ P(i)>0 }[/math]. If the quantity [math]\displaystyle{ 0 \log 0 }[/math] appears in the formula, it is interpreted as zero. For distributions P and Q of a continuous random variable, KL-divergence is defined to be the integral:^[4] :[math]\displaystyle{ D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)} \, {\rm d}x, \! }[/math] where p and q denote the densities of P and Q.
  More generally, if P and Q are probability measures over a set X, and Q is absolutely continuous with respect to P, then the Kullback–Leibler divergence from P to Q is defined as :[math]\displaystyle{ D_{\mathrm{KL}}(P\|Q) = -\int_X \log \frac{{\rm d}Q}{{\rm d}P} \,{\rm d}P, \! }[/math] where [math]\displaystyle{ \frac{{\rm d}Q}{{\rm d}P} }[/math] is the Radon–Nikodym derivative of Q with respect to P, and provided the expression on the right-hand side exists.

↑ Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. doi:10.1214/aoms/1177729694. MR 39968.
↑ S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).
↑ Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987). "Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. JSTOR 2684769.
↑ C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.

2008

(Upton & Cook, 2008) ⇒ Graham Upton, and Ian Cook. (2008). “A Dictionary of Statistics, 2nd edition revised." Oxford University Press. ISBN:0199541450
- QUOTE: A measure of the difference between two probability density functions taking non-zero value s on the same interval. Usually denoted by L it is related to entropy and is defined as the expectation of the logarithm of the ratio of likelihoods.

2007

(Hershey & Olsen, 2007) ⇒ J.R. Hershey, and P.A. Olsen. (2007). “Approximating the Kullback Leibler divergence between Gaussian mixture models.” In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. doi:10.1109/ICASSP.2007.366913

2000

(Nigam et al., 2000) ⇒ Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell. (2000). “Text Classification from Labeled and Unlabeled Documents Using EM.” In: Machine Learning, 39(2/3). doi:10.1023/A:1007692713085
- QUOTE: 6) The weighted log likelihood ratio used to rank the words in Figure 3 is: [math]\displaystyle{ \Pr(w_t \mid c_j; \hat{\theta}) \log \left( \frac{\Pr(w_t \mid \lnot c_j; \hat{\theta})} {\Pr(w_t \mid c_j; \hat{\theta})} \right) }[/math] which can be understood in information-theoretic terms as word [math]\displaystyle{ w_t }[/math]'s contribution to the average inefficiency of encoding words from class [math]\displaystyle{ c_j }[/math] using a code that is optimal for the distribution of words in [math]\displaystyle{ \lnot c_j }[/math] . The sum of this quantity over all words is the Kullback-Leibler divergence between the distribution of words in [math]\displaystyle{ c_j }[/math] and the distribution of words in [math]\displaystyle{ \lnot c_j }[/math] (Cover & Thomas, 1991).

1989

(Joe, 1989) ⇒ Harry Joe. (1989). “Relative Entropy Measures of Multivariate Dependence.” In: Journal of the American Statistical Association, 84(405). doi:10.1080/01621459.1989.10478751
- QUOTE: The purpose of this article is to discuss measures of multivariate dependence and measures of conditional dependence based on relative entropies. These measures are conceptually very general, as they can be used for a set of variables that can be a mixture of continuous, ordinal-categorical, and nominal-categorical variables. For continuous or ordinal-categorical variables, a certain transformation of relative entropy to the interval [0,1] leads to generalizations of the correlation, multiple-correlation, and partial-correlation coefficients. If all variables are nominal categorical, the relative entropies are standardized to take a maximum of 1 and then transformed so that in the bivariate case, there is a relative reduction in variability interpretation like that for the correlation coefficient. The relative entropy measures of dependence are compared with commonly used bivariate measures of association such as Kendall's τb and Goodman and Kruskal's λ and with measures of dependence based on Pearson's ϕ2 distance.

1959

(Kullback, 1959) ⇒ Solomon Kullback. (1959). “Information Theory and Statistics." Wiley. ISBN:0486696847
- QUOTE: Information theory, as we shall be concerned with it, is a branch of mathematical theory of probability and statistics.

1951

(Kullback & Leibner, 1951).

[1] Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. doi:10.1214/aoms/1177729694. MR 39968.

[2] S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).

[3] Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987). "Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. JSTOR 2684769.

[4] C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.

[1]

[2]

[3]

[4]