Multiclass Cross-Entropy Measure
A Multiclass Cross-Entropy Measure is a dispersion measure which measures the average number of bits needed to identify an event from a set of possibilities.
- AKA: Relative Entropy, [math]H(P,Q)[/math].
- Context:
- It can be a generalization of log-loss for multi-class Classification.
- It can range from being a Normalized Cross Entropy to being an Unnormalized Cross Entropy.
- It can be used by Cross-Entropy Minimization.
- Example(s):
- Counter-Example(s):
- See: Cross-Entropy Loss Function, Information Entropy, Probability Distribution, Bit, Information Entropy, Kullback–Leibler Divergence, Discrete Random Variable, Continuous Random Variable, Joint Entropy, Perplexity Measure, Squared Error.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/cross_entropy Retrieved:2017-6-7.
- In information theory, the cross entropy between two probability distributions [math] p [/math] and [math] q [/math] over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution [math] q [/math] , rather than the "true" distribution [math] p [/math] .
The cross entropy for the distributions [math] p [/math] and [math] q [/math] over a given set is defined as follows: : [math] H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q),\! [/math] where [math] H(p) [/math] is the entropy of [math] p [/math] , and [math] D_{\mathrm{KL}}(p \| q) [/math] is the Kullback–Leibler divergence of [math] q [/math] from [math] p [/math] (also known as the relative entropy of p with respect to q — note the reversal of emphasis).
For discrete [math] p [/math] and [math] q [/math] this means : [math] H(p, q) = -\sum_x p(x)\, \log q(x). \! [/math] The situation for continuous distributions is analogous. We have to assume that [math] p [/math] and [math] q [/math] are absolutely continuous with respect to some reference measure [math] r [/math] (usually [math] r [/math] is a Lebesgue measure on a Borel σ-algebra). Let [math] P [/math] and [math] Q [/math] be probability density functions of [math] p [/math] and [math] q [/math] with respect to [math] r [/math] . Then : [math] -\int_X P(x)\, \log Q(x)\, dr(x) = \operatorname{E}_p[-\log Q]. \! [/math] NB: The notation [math] H(p,q) [/math] is also used for a different concept, the joint entropy of [math] p [/math] and [math] q [/math]
- In information theory, the cross entropy between two probability distributions [math] p [/math] and [math] q [/math] over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution [math] q [/math] , rather than the "true" distribution [math] p [/math] .
2017
- http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#theano.tensor.nnet.nnet.categorical_crossentropy
- QUOT:E: Return the cross-entropy between an approximating distribution and a true distribution. The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the “true” distribution p. Mathematically, this function computes H(p,q) = - \sum_x p(x) \log(q(x)), where p=true_dist and q=coding_dist.
2011a
- (Mikolov et al., 2011) ⇒ Tomáš Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Černocký. (2011). “Empirical Evaluation and Combination of Advanced Language Modeling Techniques..” In: Proceedings of INTERSPEECH 2011.
- QUOTE: … Thus, the measure that we will aim to minimize is the cross entropy of the test data given the language model. The cross entropy is equal to [math]\log_2[/math] perplexity (PPL) ...
2011b
- (Yu et al., 2011) ⇒ Dong Yu, Jinyu Li, and Li Deng. (2011). “Calibration of Confidence Measures in Speech Recognition.” In: IEEE Transactions on Audio, Speech, and Language Processing, 19(8). doi:10.1109/TASL.2011.2141988
2004
- (Caruana & Niculescu-Mizil, 2004) ⇒ Rich Caruana, and Alexandru Niculescu-Mizil. (2004). “Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ISBN:1-58113-888-1 doi:10.1145/1014052.1014063
- QUOTE: … compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold.