# Maximum Entropy-based Learning Algorithm

A Maximum Entropy-based Learning Algorithm is a supervised discriminative classification algorithm that favors class predictions with maximum entropy (least biased estimates).

**Context:**- It can produce a Maximum Entropy Model? Generalized Maximum Entropy Model?.
- It can range from being a Binomial Maximum Entropy-based Learning Algorithm to being a Multinomial Maximum Entropy-based Learning Algorithm.

**Example(s):****Counter-Example(s):****See:**Generalized Maximum Entropy Model, Maximum Entropy Network/Maximum Entropy Markov Network, Exponential Model.

## References

### 2009

- http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node2.html
- The maximum entropy method answers both these questions. Intuitively, the principle is simple: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model at each step in the above example. … In its most general formulation, maximum entropy can be used to estimate any probability distribution. In this paper we are interested in classication; thus we limit our further discussion to learning conditional distributions from labeled training data. Specically, we learn the conditional distribution of the class label given a document.

- http://homepages.inf.ed.ac.uk/s0450736/maxent.html
- MEGA Optimization Package

### 2004

- (Goodman, 2004) ⇒ J. Goodman. (2004). Exponential Priors for Maximum Entropy Models. In: Proceedings of HLTNAACL (2004). http://citeseer.ist.psu.edu/goodman04exponential.html

### 1999

- (Nigam et al., 1999) ⇒ Kamal Nigam, John Lafferty, and Andrew McCallum. (1999). “Using Maximum Entropy for Text Classification.” In: IJCAI-99 workshop on machine learning for information filtering.
- QUOTE: Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, part-of-speech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally non-uniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the results indicate that maximum entropy is a promising technique for text classification.

### 1996

- (Berger et al., 1996) ⇒ Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. (1996). “A Maximum Entropy Approach to Natural Language Processing.” In: Computational Linguistics, 22(1).

### 1989

- (Kesavan & Kapur, 1989) ⇒ H. K. Kesavan, and J. N. Kapur. (1989). “The Generalized Maximum Entropy Principle.” In: IEEE Transactions on Systems, Man and Cybernetics, 19(5). doi:10.1109/21.44019
- ABSTRACT: Generalizations of the maximum entropy principle (MEP) of E.T. Jaynes (1957) and the minimum discrimination information principle (MDIP) of S. Kullback (1959) are described. The generalizations have been achieved by enunciating the entropy maximization postulate and examining its consequences. The inverse principles which are inherent in the MEP and MDIP are made quite explicit. Several examples are given to illustrate the power and scope of the generalized maximum entropy principle that follows from the entropy maximization postulate.

### 1979

- (Thomas, 1979) ⇒ Marlin U. Thomas. (1979). “A Generalized Maximum Entropy Principle.” In: Operations Research, 27(6).
- ABSTRACT: We describe a generalized maximum entropy principle for dealing with decision problems involving uncertainty but with some prior knowledge about the probability space corresponding to nature. This knowledge is expressed through known bounds on event probabilities and moments, which can be incorporated into a nonlinear programming problem. The solution provides a maximum entropy distribution that is then used in treating the decision problem as one involving risk. We describe an example application that involves the selection of oil spill recovery systems for inland harbor regions.

### 1957

- (Jaynes, 1957) ⇒ E. T. Jaynes. (1957). “Information Theory and Statistical Mechanics.
- "Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.