# Maximum Likelihood Estimation (MLE) Algorithm

A Maximum Likelihood Estimation (MLE) Algorithm is an M-based supervised point estimation algorithm that applies maximum likelihood estimates.

**Context:**- It can (typically) neglect the PDF's Prior Probability.
- It can be implemented into an MLE-based system (that solves an MLE task to finds a pdf's maximum likelihood estimate).
- It can be a Generalized Maximum Likelihood Algorithm.
- It can range from being a Penalized MLE Algorithm to being an Unpenalized MLE Algorithm (unsmoothed MLE algorithm).

**Example(s):****Counter-Example(s):**- a Max Pseudolikelihood Estimation Algorithm (MPLE) that approximates the (log) likelihood by conditioning each variable on its Markov blanket.
- a Maximum a Posteriori Estimation Algorithm.
- a Least Squares Estimation Algorithm, such as Linear Least Squares or Logistic Least Squares.
- a Max-margin Weight Learning Algorithm (MM) (discriminative), that maximizes the difference between the scores of the ground truth and the best alternative state.
- an Expected Log MLE Algorithm.
- a Bayesian Parameter Estimation Algorithm.
- a Method of Moments.
- a Minimum Mean Squared Error Algorithm.
- a Minimum Chi-Square Estimation Algorithm.

**See:**EM Algorithm, Frequentist Algorithm, Unconstrained Optimization Algorithm, Global Unconstrained Optimization Algorithm, Generative Statistical Model, Statistical Model Parameter Estimation Task.

## References

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/maximum_likelihood Retrieved:2015-5-11.
- In statistics,
**maximum-likelihood estimation**(MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model).

In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the "agreement" of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do not exist.

- In statistics,

### 2012a

- (Levy, 2012) ⇒ Roger Levy. (2012). “Probabilistic Models in the Study of Language - Chapter 4: Parameter Estimation."
- QUOTE: … In this chapter we delve more deeply into the theory of probability density estimation, focusing on inference within parametric families of probability distributions (see discussion in Section 2.11.2). We start with some important properties of estimators, then turn to basic frequentist parameter estimation (maximum-likelihood estimation and corrections for bias), and finally basic Bayesian parameter estimation.

### 2012b

- http://en.wikibooks.org/wiki/R_Programming/Maximum_Likelihood#Introduction
- QUOTE: Maximum likelihood estimation is just an optimization problem. You have to write down your log likelihood function and use some optimization technique. Sometimes you also need to write your score (the first derivative of the log likelihood) and or the hessian (the second derivative of the log likelihood).

### 2011

- https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/
- QUOTE: Why is minimizing the negative log likelihood equivalent to maximum likelihood estimation (MLE)? Or, equivalently, in Bayesian-speak: Why is minimizing the negative log likelihood equivalent to maximum a posteriori probability (MAP), given a uniform prior?

### 2011b

### 2009

- (Gentle, 2009) ⇒ James E. Gentle. (2009). “Computational Statistics." Springer. ISBN:978-0-387-98143-7
- QUOTE: One of the most commonly used approaches to statistical estimation is maximum likelihood. The concept has an intuitive appeal, and the estimators based on this approach have a number of desirable mathematical properties, at least for broad classes of distributions.

### 2008

- (Upton & Cook, 2008) ⇒ Graham Upton, and Ian Cook. (2008). “A Dictionary of Statistics, 2nd edition revised." Oxford University Press. ISBN:0199541450
- QUOTE: A commonly used method for obtaining an estimate of an unknown parameter of an assumed population distribution. The likelihood of a data set depends upon the parameter(s) of the distribution (or probability density function) from which the observations have been taken. In cases where one or more of these parameters are unknown, a shrewd choice as an estimate would be the value that maximizes the likelihood. This is the maximum likelihood estimate (mle). Expressions for maximum likelihood estimates are frequently obtained by maximizing the natural logarithm of the likelihood rather than the likelihood itself (the result is the same). … Sir Ronald Fisher introduced the method in 1912.

### 2007

- (Minka, 2007) ⇒ Thomas P. Minka. (2007). “A Comparison of Numerical Optimizers for Logistic Regression." Technical Report.
- QUOTE: Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum a-posteriori parameter estimate. A full derivation of each algorithm is given. In particular, a new derivation of Iterative Scaling is given which applies more generally than the conventional one. A new derivation is also given for the Modified Iterative Scaling algorithm of Collins et al. (2002). Most of the algorithms operate in the primal space, but can also work in dual space. All algorithms are compared in terms of computational complexity by experiments on large data sets. The fastest algorithms turn out to be conjugate gradient ascent and quasi-Newton algorithms, which far outstrip Iterative Scaling and its variants.

### 2003

- (Myung, 2003) ⇒ In Jae Myung. (2003). “Tutorial on Maximum Likelihood Estimation.” In: Journal of Mathematical Psychology, 47.
- QUOTE: There are two general methods of parameter estimation. They are least-squares estimation (LSE) and maximum likelihood estimation (MLE). The former has been a popular choice of model fitting in psychology (e.g., Rubin, Hinton, & Wenzel, 1999; Lamberts, 2000 but see Usher & McClelland, 2001) and is tied to many familiar statistical concepts such as linear regression, sum of squares error, proportion variance accounted for (i.e. [math]\displaystyle{ r^2 }[/math]), and root mean squared deviation. LSE, which unlike MLE requires no or minimal distributional assumptions, is useful for obtaining a descriptive measure for the purpose of summarizing observed data, but it has no basis for testing hypotheses or constructing confidence intervals. … Unlike least-squares estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. … MLE has many optimal properties in estimation: sufficiency (complete information about the parameter of interest contained in its MLE estimator); consistency (true parameter value that generated the data recovered asymptotically, i.e. for data of sufficiently large samples); efficiency (lowest-possible variance of parameter estimates achieved asymptotically); and parameterization invariance (same MLE solution obtained independent of the parametrization used). In contrast, no such things can be said about LSE. As such, most statisticians would not view LSE as a general method for parameter estimation, but rather as an approach that is primarily used with linear regression models. Further, many of the inference methods in statistics are developed based on MLE. For example, MLE is a prerequisite for the chi-square test, the G-square test, Bayesian methods, inference with missing data, modeling of random effects, and many model selection criteria such as the Akaike information criterion (Akaike, 1973) and the Bayesian information criteria (Schwarz, 1978).

### 2000

- (Valpola, 2000) ⇒ Harri Valpola. (2000). “Bayesian Ensemble Learning for Nonlinear Factor Analysis." PhD Dissertation, Helsinki University of Technology.
- QUOTE: … The two point estimates in wide use are the maximum likelihood (ML) and the maximum a posteriori (MAP) estimator. The ML estimator neglects the prior probability of the models and maximises only the probability which the model gives for the observation. The MAP estimator chooses the model which has the highest posterior probability mass or density.

### 1997

- (Huelsenbeck & Crandall, 1997) ⇒ John P. Huelsenbeck and Keith A. Crandall. (1997). “Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood.” In: Annual Review of Ecology and Systematics, 28. http://www.jstor.org/stable/2952500
- QUOTE: … The application of maximum likelihood estimation to the phylogeny problem was first suggested by Edwards & Cavalli-Sforza (20). However, they found the problem too computationally difficult at the time and attempted approxim-...

### 1991

- (Efron & Tibshirani, 1991) ⇒ Bradley Efron, and Robert Tibshirani. (1991). “Statistical Data Analysis in the Computer Age.” In: Science, 253(5018). 10.1126/science.253.5018.390
- QUOTE: Most of our familiar statistical methods, such as hypothesis testing, linear regression, analysis of variance, and maximum likelihood estimation, were designed to be implemented on mechanical calculators. ...

### 1977

- (Dempster et al., 1977) ⇒ Arthur P. Dempster, Nan Laird, and Donald Rubin. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” In: Journal of the Royal Statistical Society, Series B, 39(1):1-38.

### 1973

- (Akaike, 1973) ⇒ Hitotogu Akaike. (1973). “Information Theory and an Extension of the Maximum Likelihood Principle.” In: Proceedings of the Second International Symposium on Information Theory.