# Maximum Entropy-based Learning Algorithm

A Maximum Entropy-based Learning Algorithm is a supervised discriminative classification algorithm that favors class predictions with maximum entropy (least biased estimates).

**Context:**- It can produce a Maximum Entropy Model? Generalized Maximum Entropy Model?.
- It can range from being a Binomial Maximum Entropy-based Learning Algorithm to being a Multinomial Maximum Entropy-based Learning Algorithm.

**Example(s):****Counter-Example(s):****See:**Generalized Maximum Entropy Model, Maximum Entropy Network/Maximum Entropy Markov Network, Exponential Model.

## References

### 2009

- http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node2.html
- The maximum entropy method answers both these questions. Intuitively, the principle is simple: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model at each step in the above example. … In its most general formulation, maximum entropy can be used to estimate any probability distribution. In this paper we are interested in classication; thus we limit our further discussion to learning conditional distributions from labeled training data. Specically, we learn the conditional distribution of the class label given a document.

- http://homepages.inf.ed.ac.uk/s0450736/maxent.html
- MEGA Optimization Package

### 2004

- (Goodman, 2004) ⇒ J. Goodman. (2004). Exponential Priors for Maximum Entropy Models. In: Proceedings of HLTNAACL (2004). http://citeseer.ist.psu.edu/goodman04exponential.html

### 2003

- (Caticha, 2004) ⇒ Ariel Caticha. (2004). “Relative Entropy and Inductive Inference.” In: AIP conference proceedings, vol. 707, no. 1, pp. 75-96 . American Institute of Physics,
- QUOTE: We discuss how the method of maximum entropy, MaxEnt, can be extended beyond its original scope, as a rule to assign a probability distribution, to a full‐fledged method for inductive inference. ...
... The method of maximum entropy, MaxEnt, as conceived by Jaynes [1], is a method to assign probabilities on the basis of partial information of a certain kind. The type of information in question is called testable information and consists in the specification of the family of acceptable distributions. The infor- mation is “testable” in the sense that one should be able to test whether any candidate distribution belongs or not to the family.

The purpose of this paper is to discuss how MaxEnt can be extended beyond its original scope, as a rule to assign a probability distribution, to a full-fledged method for inductive inference. To distinguish it from MaxEnt the extended method will henceforth be abbreviated as ME. [2] ...

- QUOTE: We discuss how the method of maximum entropy, MaxEnt, can be extended beyond its original scope, as a rule to assign a probability distribution, to a full‐fledged method for inductive inference. ...

### 1999

- (Nigam et al., 1999) ⇒ Kamal Nigam, John Lafferty, and Andrew McCallum. (1999). “Using Maximum Entropy for Text Classification.” In: IJCAI-99 workshop on machine learning for information filtering.
- QUOTE: Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, part-of-speech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally non-uniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the results indicate that maximum entropy is a promising technique for text classification.

### 1996

- (Berger et al., 1996) ⇒ Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. (1996). “A Maximum Entropy Approach to Natural Language Processing.” In: Computational Linguistics, 22(1).

### 1989

- (Kesavan & Kapur, 1989) ⇒ H. K. Kesavan, and J. N. Kapur. (1989). “The Generalized Maximum Entropy Principle.” In: IEEE Transactions on Systems, Man and Cybernetics, 19(5). doi:10.1109/21.44019
- ABSTRACT: Generalizations of the maximum entropy principle (MEP) of E.T. Jaynes (1957) and the minimum discrimination information principle (MDIP) of S. Kullback (1959) are described. The generalizations have been achieved by enunciating the entropy maximization postulate and examining its consequences. The inverse principles which are inherent in the MEP and MDIP are made quite explicit. Several examples are given to illustrate the power and scope of the generalized maximum entropy principle that follows from the entropy maximization postulate.

### 1979

- (Thomas, 1979) ⇒ Marlin U. Thomas. (1979). “A Generalized Maximum Entropy Principle.” In: Operations Research, 27(6).
- ABSTRACT: We describe a generalized maximum entropy principle for dealing with decision problems involving uncertainty but with some prior knowledge about the probability space corresponding to nature. This knowledge is expressed through known bounds on event probabilities and moments, which can be incorporated into a nonlinear programming problem. The solution provides a maximum entropy distribution that is then used in treating the decision problem as one involving risk. We describe an example application that involves the selection of oil spill recovery systems for inland harbor regions.

### 1957

- (Jaynes, 1957) ⇒ E. T. Jaynes. (1957). “Information Theory and Statistical Mechanics.
- "Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.