# Information Entropy Metric

An Information Entropy Metric is a discrete random variable metric of the uncertainty associated with random variable [math]X[/math].

**AKA:**[math]H(X)[/math], Shannon Entropy.**Context:**- It can be written as [math]H(X)[/math]
- It can be expressed as [math]E[I(X)] = E[-\ln(P(X))][/math].
- It can range from being a Conditional Entropy Function [math]H(X|Y)[/math], to being a Joint Entropy Function [math]H(X,Y)[/math].
- It can be an Entropy Measure.
- It can estimate how many bits are required to transmit messages from event space E, where p(x) defines the probability of observing [math]X = x[/math].
- It can estimate average amount of information in bits in some attribute of an instance

**Counter-Example(s):****See:**Cross-Entropy, Maximum Entropy Principle

## References

### 2015

- (Li & Hovy, 2015) ⇒ Jiwei Li, and Eduard Hovy. (2015). “The NLP Engine: A Universal Turing Machine for NLP.” In: arXiv preprint arXiv:1503.00168.

### 2013

- http://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition
- In information theory,
**entropy**is a measure of the uncertainty in a random variable.^{[1]}In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message.^{[2]}Entropy is typically measured in bits, nats, or bans.^{[3]}Shannon entropy is the average unpredictability in a random variable, which is equivalent to its information content. The concept was introduced by Claude E. Shannon in his 1948 paper “A Mathematical Theory of Communication".^{[4]}Shannon entropy provides an absolute limit on the best possible lossless encoding or compression of any communication, assuming that^{[5]}the communication may be represented as a sequence of independent and identically distributed random variables. Shannon's source coding theorem shows that, in the limit, the average length of the shortest possible representation to encode the messages in a given alphabet is their entropy divided by the logarithm of the number of symbols in the target alphabet.^{[citation needed]}A single toss of a fair coin has an entropy of one bit. A series of two fair coin tosses has an entropy of two bits. The number of fair coin tosses is its entropy in bits. This random selection between two outcomes in a sequence over time, whether the outcomes are equally probable or not, is often referred to as a Bernoulli process. The entropy of such a process is given by the binary entropy function. The entropy rate for a fair coin toss is one bit per toss. However, if the coin is not fair, then the uncertainty, and hence the entropy rate, is lower. This is because, if asked to predict the next outcome, we could choose the most frequent result and be right more often than wrong. The difference between what we know, or predict, and the information that the unfair coin toss reveals to us is less than one heads-or-tails "message", or bit, per toss. As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,

^{[6]}or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text.^{[7]}

- In information theory,

- http://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition
- Named after Boltzmann's H-theorem, Shannon denoted the entropy
*H*of a discrete random variable*X*with possible values {x_{1}, ...,*x*_{n}} and probability mass function*P(X)*as, :[math]H(X) = E[I(X)] = E[-\ln(P(X))].[/math] Here E is the expected value operator, and*I*is the information content of*X*.^{[8]}^{[9]}*I*(*X*) is itself a random variable.When taken from a finite sample, the entropy can explicitly be written as :[math]H(X) = \sum_{i} {P(x_i)\,I(x_i)} = -\sum_{i} {P(x_i) \log_b P(x_i)} = -\sum_{i}{\frac{n_i}{N} \log_b \frac{n_i}{N}} = \log_b N - \frac 1 N \sum_{i} {n_i \log_b n_i},[/math] where

*b*is the base of the logarithm used. Common values of*b*are 2, Euler's number*e*, and 10, and the unit of entropy is bit for*b*= 2, nat for*b*=*e*, and dit (or digit) for*b*= 10.^{[10]}In the case of

*p*(*x*_{i}*) = 0 for some*i*, the value of the corresponding summand 0 log*b_{}*(0) is taken to be 0, which is consistent with the well-known limit: :[math]\lim_{p\to0+}p\log (p) = 0[/math].*

- Named after Boltzmann's H-theorem, Shannon denoted the entropy

- ↑ Ihara, Shunsuke (1993).
*Information theory for continuous systems*. World Scientific. p. 2. ISBN 978-981-02-0985-8. http://books.google.com/books?id=rwg5ndTuTC4C&pg=PA2. - ↑ In this context, a 'message' means a specific realization of the random variable.
- ↑ Brillouin, Léon (2004).
*Science & Information Theory*. Dover Publications. p. 293. ISBN 978-0-486-43918-1. http://books.google.com/books?id=DWo7lVRVnhcC&pg=PA293. - ↑ Shannon, Claude E. (July/October 1948). "A Mathematical Theory of Communication".
*Bell System Technical Journal***27**(3): 379–423. (PDF) - ↑ Goise, Francois & Olla, Stefano (2008).
*Entropy methods for the Boltzmann equation: lectures from a special semester at the Centre Émile Borel, Institut H. Poincaré, Paris, 2001*. Springer. p. 14. ISBN 978-3-540-73704-9. http://books.google.com/books?id=K3meGQ9ntjgC&pg=PA14. - ↑ Schneier, B:
*Applied Cryptography*, Second edition, page 234. John Wiley and Sons. - ↑ Shannon, Claude E.: Prediction and entropy of printed English,
*The Bell System Technical Journal*, 30:50–64, January 1951. - ↑ Borda, Monica (2011).
*Fundamentals in Information Theory and Coding*. Springer. p. 11. ISBN 978-3-642-20346-6. http://books.google.com/books?id=Lyte2yl1SPAC&pg=PA11. - ↑ Han, Te Sun & Kobayashi, Kingo (2002).
*Mathematics of Information and Coding*. American Mathematical Society. pp. 19–20. ISBN 978-0-8218-4256-0. http://books.google.com/books?id=VpRESN24Zj0C&pg=PA19. - ↑ Schneider, T.D, Information theory primer with an appendix on logarithms, National Cancer Institute, 14 April 2007.

### 2012

- http://en.wikipedia.org/wiki/Entropy_%28information_theory%29#Limitations_of_entropy_as_a_measure_of_unpredictability
- In cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key. For example, a 128-bit key that is randomly generated has 128 bits of entropy. It takes (on average) [math]2^{128-1}[/math] guesses to break by brute force. If the key's first digit is 0, and the others random, then the entropy is 127 bits, and it takes (on average) [math]2^{127-1}[/math] guesses.
However, this measure fails if the possible keys are not of equal probability

^{[clarification needed]}. If the key is half the time "password" and half the time a true random 128-bit key, then the entropy is approximately 65 bits. Yet half the time the key may be guessed on the first try, if your first guess is "password", and on average, it takes around [math]2^{126}[/math] guesses (not [math]2^{65-1}[/math]) to break this password.Similarly, consider a 1000000-digit binary one-time pad. If the pad has 1000000 bits of entropy, it is perfect. If the pad has 999999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may still be considered very good. But if the pad has 999999 bits of entropy, where the first digit is fixed and the remaining 999999 digits are perfectly random, then the first digit of the ciphertext will not be encrypted at all.

- In cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key. For example, a 128-bit key that is randomly generated has 128 bits of entropy. It takes (on average) [math]2^{128-1}[/math] guesses to break by brute force. If the key's first digit is 0, and the others random, then the entropy is 127 bits, and it takes (on average) [math]2^{127-1}[/math] guesses.

### 2008

- (Wilson, 2008a) ⇒ Bill Wilson. (2008). “The Machine Learning Dictionary for COMP9414." University of New South Wales, Australia.
- QUOTE: entropy: For our purposes, the entropy measure –Σi pilog2pi gives us the average amount of information in bits in some attribute of an instance. The information referred to is information about what class the instance belongs to, and pi is the probability that an instance belongs to class i. The rationale for this is as follows: –log2(p) is the amount of information in bits associated with an event of probability p - for example, with an event of probability ½, like flipping a fair coin, log2((p) is –log2(½) = 1, so there is one bit of information. This should coincide with our intuition of what a bit means (if we have one). If there is a range of possible outcomes with associated probabilities, then to work out the average number of bits, we need to multiply the number of bits for each outcome (–log2(p)) by the probability p and sum over all the outcomes. This is where the formula comes from. Entropy is used in the ID3 decision tree induction algorithm.