# Kullback-Leibler (KL) Divergence Metric

A Kullback-Leibler (KL) Divergence Metric is a non-symmetric similarity measure of the difference between two probability distributions [math]P[/math] and [math]Q[/math].

**AKA:**Kullback-Leibler Information Gain, [math]I[/math], Directed Divergence, Discrimination Information.**Context**- It can measure the expected number of extra bits required to code samples from [math]P[/math] when using a code based on [math]Q[/math], rather than using a code based on [math]P[/math].
- It can (typically) be used such that [math]P[/math] represents the "true" distribution of observations or a precisely calculated theoretical distribution.
- It can (typically) be used such that the measure [math]Q[/math] represents a theory, model, description, or approximation of [math]P[/math].
- It can be expressed as the information in favor of f is :[math]I = \int_a^b In\bigg \{ \frac{f(x)}{g(x)}\bigg \} f(x)dx[/math]

**Example(s):**- Mutual Information Metric.
- Decision Tree Information Gain Metric (based on the entropy measure of parent node and the average entropy measure of the children nodes).

**Counter-Example(s):**

**See:** Jensen-Shannon Divergence, Information Gain Impurity Function, AIC Statistic.

## References

### 2011

- http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- In probability theory and information theory, the
**Kullback–Leibler divergence**^{[1]}^{[2]}^{[3]}(also**information divergence**,**information gain**, relative entropy, or**KLIC**) is a non-symmetric measure of the difference between two probability distributions*P*and*Q*. KL measures the expected number of extra bits required to code samples from*P*when using a code based on*Q*, rather than using a code based on*P*. Typically*P*represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure*Q*typically represents a theory, model, description, or approximation of*P*. Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not symmetric: the KL from*P*to*Q*is generally not the same as the KL from*Q*to*P*. KL divergence is a special case of a broader class of divergences called*f*-divergences. Originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions, it is not the same as a divergence in calculus. However, the KL divergence can be derived from the Bregman divergence.For probability distributions

*P*and*Q*of a discrete random variable their K–L divergence is defined to be :[math]D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}. \![/math] In words, it is the average of the logarithmic difference between the probabilities*P*and*Q*, where the average is taken using the probabilities*P*. The K-L divergence is only defined if*P*and*Q*both sum to 1 and if [math]Q(i)\gt 0[/math] for any*i*such that [math]P(i)\gt 0[/math]. If the quantity [math]0 \log 0[/math] appears in the formula, it is interpreted as zero. For distributions*P*and*Q*of a continuous random variable, KL-divergence is defined to be the integral:^{[4]}:[math]D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)} \, {\rm d}x, \![/math] where*p*and*q*denote the densities of*P*and*Q*.More generally, if

*P*and*Q*are probability measures over a set*X*, and*Q*is absolutely continuous with respect to*P*, then the Kullback–Leibler divergence from*P*to*Q*is defined as :[math] D_{\mathrm{KL}}(P\|Q) = -\int_X \log \frac{{\rm d}Q}{{\rm d}P} \,{\rm d}P, \![/math] where [math]\frac{{\rm d}Q}{{\rm d}P} [/math] is the Radon–Nikodym derivative of*Q*with respect to*P,*and provided the expression on the right-hand side exists.

- In probability theory and information theory, the

- ↑ Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency".
*Annals of Mathematical Statistics***22**(1): 79–86. doi:10.1214/aoms/1177729694. MR39968. - ↑ S. Kullback (1959)
*Information theory and statistics*(John Wiley and Sons, NY). - ↑ Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987). "Letter to the Editor: The Kullback–Leibler distance".
*The American Statistician***41**(4): 340–341. JSTOR 2684769. - ↑ C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.

### 2008

- (Upton & Cook, 2008) ⇒ Graham Upton, and Ian Cook. (2008). “A Dictionary of Statistics, 2nd edition revised." Oxford University Press. ISBN:0199541450
- QUOTE: A measure of the difference between two probability density functions taking non-zero value s on the same interval. Usually denoted by L it is related to entropy and is defined as the expectation of the logarithm of the ratio of likelihoods.

### 2007

- (Hershey & Olsen, 2007) ⇒ J.R. Hershey, and P.A. Olsen. (2007). “Approximating the Kullback Leibler divergence between Gaussian mixture models.” In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. doi:10.1109/ICASSP.2007.366913

### 2000

- (Nigam et al., 2000) ⇒ Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell. (2000). “Text Classification from Labeled and Unlabeled Documents Using EM.” In: Machine Learning, 39(2/3). doi:10.1023/A:1007692713085
- QUOTE: 6) The weighted log likelihood ratio used to rank the words in Figure 3 is: [math]\Pr(w_t \mid c_j; \hat{\theta}) \log \left( \frac{\Pr(w_t \mid \lnot c_j; \hat{\theta})} {\Pr(w_t \mid c_j; \hat{\theta})} \right)[/math] which can be understood in information-theoretic terms as word [math]w_t[/math]'s contribution to the average inefficiency of encoding words from class [math]c_j[/math] using a code that is optimal for the distribution of words in [math]\lnot c_j[/math] . The sum of this quantity over all words is the Kullback-Leibler divergence between the distribution of words in [math]c_j[/math] and the distribution of words in [math]\lnot c_j[/math] (Cover & Thomas, 1991).

### 1989

- (Joe, 1989) ⇒ Harry Joe. (1989). “Relative Entropy Measures of Multivariate Dependence.” In: Journal of the American Statistical Association, 84(405). doi:10.1080/01621459.1989.10478751
- QUOTE: The purpose of this article is to discuss measures of multivariate dependence and measures of conditional dependence based on relative entropies. These measures are conceptually very general, as they can be used for a set of variables that can be a mixture of continuous, ordinal-categorical, and nominal-categorical variables. For continuous or ordinal-categorical variables, a certain transformation of relative entropy to the interval [0,1] leads to generalizations of the correlation, multiple-correlation, and partial-correlation coefficients. If all variables are nominal categorical, the relative entropies are standardized to take a maximum of 1 and then transformed so that in the bivariate case, there is a relative reduction in variability interpretation like that for the correlation coefficient. The relative entropy measures of dependence are compared with commonly used bivariate measures of association such as Kendall's τb and Goodman and Kruskal's λ and with measures of dependence based on Pearson's ϕ2 distance.

### 1959

- (Kullback, 1959) ⇒ Solomon Kullback. (1959). “Information Theory and Statistics." Wiley. ISBN:0486696847
- QUOTE: Information theory, as we shall be concerned with it, is a branch of mathematical theory of probability and statistics.