# Kullback-Leibler (KL) Divergence Metric

(Redirected from Information Gain Measure)

A Kullback-Leibler (KL) Divergence Metric is a non-symmetric similarity measure of the difference between two probability distributions $P$ and $Q$.

## References

### 2011

• http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
• In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the difference between two probability distributions P and Q. KL measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P. Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not symmetric: the KL from P to Q is generally not the same as the KL from Q to P. KL divergence is a special case of a broader class of divergences called f-divergences. Originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions, it is not the same as a divergence in calculus. However, the KL divergence can be derived from the Bregman divergence.

For probability distributions P and Q of a discrete random variable their K–L divergence is defined to be :$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}. \!$ In words, it is the average of the logarithmic difference between the probabilities P and Q, where the average is taken using the probabilities P. The K-L divergence is only defined if P and Q both sum to 1 and if $Q(i)\gt 0$ for any i such that $P(i)\gt 0$. If the quantity $0 \log 0$ appears in the formula, it is interpreted as zero. For distributions P and Q of a continuous random variable, KL-divergence is defined to be the integral:[4] :$D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^\infty p(x) \log \frac{p(x)}{q(x)} \, {\rm d}x, \!$ where p and q denote the densities of P and Q.

More generally, if P and Q are probability measures over a set X, and Q is absolutely continuous with respect to P, then the Kullback–Leibler divergence from P to Q is defined as :$D_{\mathrm{KL}}(P\|Q) = -\int_X \log \frac{{\rm d}Q}{{\rm d}P} \,{\rm d}P, \!$ where $\frac{{\rm d}Q}{{\rm d}P}$ is the Radon–Nikodym derivative of Q with respect to P, and provided the expression on the right-hand side exists.

1. Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86. doi:10.1214/aoms/1177729694. MR39968.
2. S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).
3. Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987). "Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. JSTOR 2684769.
4. C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.