# Logistic Model Fitting Algorithm

A Logistic Model Fitting Algorithm is a discriminative maximum entropy-based generalized linear classification algorithm that accepts a logistic model family.

**Context:**- It can range from (typically) being a Binomial Logistic Regression Algorithm to being a Multinomial Logistic Regression Algorithm.
- It can be implemented by a Logistic Regression System (that solves a logistic function fitting task to produce a fitted logistic function).
- It can (typically) assume that Class-Conditional Densitys are members of the (same) exponential family distribution.
- It can (typically) ignore Marginal Density information P(X)
- It can (typically) be represented as a Generalized Linear Model (a linear classifier that minimizes the classification error based on the sum of differences).
- It can use an Unconstrained Optimization Algorithm to maximize the log-likelihood of the logistic regression model (such as Newton-Raphson).
- It can range from being an Unregularized Logistic Regression Algorithm to being a Regularized Logistic Regression Algorithm (such as a L1-regularized LR or an L2-regularized LR).
- It can range from being a Dense Logistic Regression Algorithm to being a Sparse Logistic Regression Algorithm.
- It can be a Logistic Regression Algorithm with Random Intercepts.

**Example(s):**- a Maximum Likelihood Estimation (MLE)-based Logistic Regression Algorithm.
- an Iteratively Reweighted Least Squares (IRLS)-based Binary Logistic Regression Algorithm.
- a Maximum Entropy Markov Model.
- a Stepwise Logistic Regression Algorithm.
- a Reduced Error Logistic Regression Algorithm.
- an Limited Memory BFGS-based Logistic Regression Algorithm (using L-BFGS).
- an Iterative Scaling-based Logistic Regression Algorithm (using iterative scaling).
- a Truncated Newtown-based Logistic Regression Algorithm (using truncated Newton).
- …

**Counter-Example(s):**- a Decision Tree Training Algorithm.
- a Linear Discriminant Analysis Algorithm.
- a Linear Regression Algorithm.
- a Classification Tree Training Algorithm.
- a Perceptron Training Algorithm.
- a Probit Regression Algorithm.
- a Linear SVM Algorithm (that maximizes the margin of a linear kernel).
- a Generative Model Training Algorithm, such as a linear discriminant analysis.
- a Naive Bayes Classification Algorithm.

**See:**Logistic Regression Model Parameter, Linear Logistic Regression, Maximum Entropy.

## References

### 2020

- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting Retrieved:2020-9-6.
- Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable Y being 0 or 1 given experimental data.
Consider a generalized linear model function parameterized by \theta , : h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) Therefore, : \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) and since Y \in \{0,1\} , we see that \Pr(y\mid X;\theta) is given by \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed, : \begin{align} L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align} Typically, the log likelihood is maximized, : N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) which is maximized using optimization techniques such as gradient descent.

Assuming the (x, y) pairs are drawn uniformly from the underlying distribution, then in the limit of large

*N*, : \begin{align} & \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align} where H(X\mid Y) is the conditional entropy and D_\text{KL} is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.

- Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable Y being 0 or 1 given experimental data.

### 2017

- Digio. (2017). “Differences between Logistic Regression and Perceptrons." URL (version: 2017-06-07): https://stats.stackexchange.com/q/284013
- QUOTE: … Logistic regression models a function of the mean of a Bernoulli distribution as a linear equation (the mean being equal to the probability $p$ of a Bernoulli event). By using the logit link as a function of the mean ($p$), the logarithm of the odds (log-odds) can be derived analytically and used as the response of a so-called generalised linear model. Parameter estimation on this GLM is then a statistical process which yields p-values and confidence intervals for model parameters. On top of prediction, this allows you to interpret the model in causal inference. This is something that you cannot achieve with a linear Perceptron.
The Perceptron is a reverse engineering process of logistic regression: Instead of taking the logit of $y$, it takes the inverse logit (logistic) function of $w_x$, and doesn't use probabilistic assumptions for neither the model nor its parameter estimation. Online training will give you exactly the same estimates for the model weights/parameters, but you won't be able to interpret them in causal inference due to the lack of p-values, confidence intervals, and well, an underlying probability model.

Long story short, logistic regression is a GLM which can perform prediction and inference, whereas the linear Perceptron can only achieve prediction (in which case it will perform the same as logistic regression). The difference between the two is also the fundamental difference between statistical modelling and machine learning.

- QUOTE: … Logistic regression models a function of the mean of a Bernoulli distribution as a linear equation (the mean being equal to the probability $p$ of a Bernoulli event). By using the logit link as a function of the mean ($p$), the logarithm of the odds (log-odds) can be derived analytically and used as the response of a so-called generalised linear model. Parameter estimation on this GLM is then a statistical process which yields p-values and confidence intervals for model parameters. On top of prediction, this allows you to interpret the model in causal inference. This is something that you cannot achieve with a linear Perceptron.

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/logistic_regression Retrieved:2015-5-13.
- In statistics,
**logistic regression**, or logit regression, or**logit model**^{[1]}is a direct probability model that was developed by statistician D. R. Cox in 1958^{[2]}^{[3]}although much work was done in the single independent variable case almost two decades earlier. The binary logistic model is used to predict a binary response based on one or more predictor variables (features). That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and hereafter in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression or**polytomous logistic regression**, or, if the multiple categories are ordered, as ordinal logistic regression.Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by estimating probabilities. Thus, it treats the same set of problems as does probit regression using similar techniques; the first assumes a logistic function and the second a standard normal distribution function.

Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression. First, the conditional distribution [math]\displaystyle{ p(y \mid x) }[/math] is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the estimated probabilities are restricted to [0,1] through the logistic distribution function because logistic regression predicts the probability of the instance being positive.

Logistic regression is an alternative to Fisher's 1936 classification method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, application of Bayes' rule to reverse the conditioning results in the logistic model, so if linear discriminant assumptions are true, logistic regression assumptions must hold. The converse is not true, so the logistic model has fewer assumptions than discriminant analysis and makes no assumption on the distribution of the independent variables.

- In statistics,

- ↑ Tolles, Juliana; Meurer, William J (2016). "Logistic Regression Relating Patient Characteristics to Outcomes". JAMA. 316 (5): 533–4. doi:10.1001/jama.2016.7653. ISSN 0098-7484. OCLC 6823603312. PMID 27483067.
- ↑ Cox, David R. (1958). "The regression analysis of binary sequences (with discussion)". J R Stat Soc B. 20 (2): 215–242. JSTOR 2983890.
- ↑ Walker, SH; Duncan, DB (1967). "Estimation of the probability of an event as a function of several independent variables". Biometrika. 54 (1/2): 167–178. doi:10.2307/2333860. JSTOR 2333860

### 2013

- (scikit-learn, 2013)⇒ http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- Logistic regression, despite its name, is a linear model for classification rather than regression. As such, it minimizes a “hit or miss” cost function rather than the sum of square residuals (as in ordinary regression). Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.

### 2011

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Logistic_regression
- … Logistic regression analyzes binomially distributed data of the form [math]\displaystyle{ Y_i \ \sim B(n_i,p_i),\text{ for }i = 1, \dots , m, }[/math] where the numbers of Bernoulli trials
*n*_{i}*are known and the probabilities of success*p_{i}are unknown. An example of this distribution is the fraction of seeds (*p*_{i}*) that germinate after*n_{i}are planted.The model proposes for each trial

*i*there is a set of explanatory variables that might inform the final probability. These explanatory variables can be thought of as being in a*k*-dimensional vector*X*_{i}and the model then takes the form [math]\displaystyle{ p_i = \operatorname{E}\left(\left.\frac{Y_i}{n_{i}}\right|X_i \right). \, }[/math]The logits, natural logs of the odds, of the unknown binomial probabilities are modeled as a linear function of the

*X*. [math]\displaystyle{ \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}. }[/math] ..._{i}

- … Logistic regression analyzes binomially distributed data of the form [math]\displaystyle{ Y_i \ \sim B(n_i,p_i),\text{ for }i = 1, \dots , m, }[/math] where the numbers of Bernoulli trials

### 2011

- (nzumel, 2011) ⇒ http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/
- QUOTE: Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes. Logistic regression preserves the marginal probabilities of the training data. The coefficients of the model also provide some hint of the relative importance of each input variable.

### 2012

- (Shalizi, 2012) ⇒ Cosma Shalizi. (2012). “Chapter 12 - Logistic Regression.” In: Carnegie Melon University, 36-402, Undergraduate Advanced Data Analysis.
- QUOTE: Finally, the easiest modification of $\log \;p$ which has an unbounded range is the logistic (or logit) transformation, $\log p/1-p$ . We can make this a linear function of $x$ without fear of nonsensical results. (Of course the results could still happen to be wrong, but they’re not guaranteed to be wrong.) This last alternative is logistic regression. Formally, the logistic regression model is that $\log \dfrac{p(x)}{p(x)-1} = \beta_0 + x \beta$ Solving for $p$, this gives
$p(x;\beta) = \dfrac{\exp(\beta_0+x)}{1+\exp (\beta_0+x\beta)}$

Notice that the over-all specification is a lot easier to grasp in terms of the transformed probability that in terms of the untransformed probability.

^{[1]}

- QUOTE: Finally, the easiest modification of $\log \;p$ which has an unbounded range is the logistic (or logit) transformation, $\log p/1-p$ . We can make this a linear function of $x$ without fear of nonsensical results. (Of course the results could still happen to be wrong, but they’re not guaranteed to be wrong.) This last alternative is logistic regression. Formally, the logistic regression model is that $\log \dfrac{p(x)}{p(x)-1} = \beta_0 + x \beta$ Solving for $p$, this gives

### 2008

- (Lin et al., 2008) ⇒ Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. (2008). “Trust region newton method for logistic regression.” In: The Journal of Machine Learning Research, 9.
- QUOTE: There are many methods for training logistic regression models. In fact, most unconstrained optimization techniques can be considered. Those which have been used in large-scale scenarios are, for example, iterative scaling (Darroch and Ratcliff, 1972; Pietra et al., 1997; Goodman, 2002; Jin et al., 2003), nonlinear conjugate gradient, quasi Newton (in particular, limited memory BFGS) (Liu and Nocedal, 1989; Benson and Morfie, 2001), and truncated Newton (Komarek and Moore, 2005). All these optimization methods are iterative procedures, which generate a sequence [math]\displaystyle{ \{\mathbf{w}^k\}^\infty_{k=1} }[/math] converging to the optimal solution of (2). One can distinguish them according to the following two extreme situations of

Low cost per iteration; slow convergence. ↔ High cost per iteration; fast convergence. ...... The logistic regression model is useful for two-class classification. Given data [math]\displaystyle{ \mathbf{x} }[/math] and weights [math]\displaystyle{ (\mathbf{w},b) }[/math], it assumes the following probability model :[math]\displaystyle{ P(y=±1 | \mathbf{x},\mathbf{w}) = \frac{1}{1 + exp(-y(\mathbf{w}^T \mathbf{x} + b)}, }[/math] where [math]\displaystyle{ y }[/math] is the class label. If training instances are [math]\displaystyle{ x_i }[/math], [math]\displaystyle{ i=1,...,l }[/math] and labels are [math]\displaystyle{ y_i \in {1,-1}, }[/math] one estimates [math]\displaystyle{ (\mathbf{w}; b) }[/math] by minimizing the negative log-likelihood: [math]\displaystyle{ \operatorname{min}_{\mathbf{x},b} \sum_{i=1}^{l} \log (1 + e^{-y_i(w^T x_i + b)}) }[/math]

- QUOTE: There are many methods for training logistic regression models. In fact, most unconstrained optimization techniques can be considered. Those which have been used in large-scale scenarios are, for example, iterative scaling (Darroch and Ratcliff, 1972; Pietra et al., 1997; Goodman, 2002; Jin et al., 2003), nonlinear conjugate gradient, quasi Newton (in particular, limited memory BFGS) (Liu and Nocedal, 1989; Benson and Morfie, 2001), and truncated Newton (Komarek and Moore, 2005). All these optimization methods are iterative procedures, which generate a sequence [math]\displaystyle{ \{\mathbf{w}^k\}^\infty_{k=1} }[/math] converging to the optimal solution of (2). One can distinguish them according to the following two extreme situations of

### 2007

- (Sutton & McCallum, 2007) ⇒ Charles Sutton, and Andrew McCallum. (2007). “An Introduction to Conditional Random Fields for Relational Learning.” In: (Getoor & Taskar, 2007).
- QUOTE: Another well-known classifier that is naturally represented as a graphical model is
**logistic regression**(sometimes known as the**maximum entropy classifier**in the NLP community). In statistics, this classifier is motivated by the assumption that the log probability, [math]\displaystyle{ \log p(y \vert x) }[/math], of each class is a linear function of x, plus a normalization constant. This leads to the conditional distribution ...

- QUOTE: Another well-known classifier that is naturally represented as a graphical model is

### 2006

- (Chu et al., 2006) ⇒ Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. (2006). “Map-Reduce for Machine Learning on Multicore.” In: Proceedings of the 19th International Conference on Neural Information Processing Systems.
- QUOTE: ... Logistic Regression (LR): For logistic regression (Pregibon, 1981), 2006_MapReduceforMachineLearningonMuwe choose the form of hypothesis as [math]\displaystyle{ h_\theta(x) = g(\theta^T x) = 1/(1 + exp (-\theta^T x)) }[/math]. Learning is done by fitting [math]\displaystyle{ \theta }[/math] to the training data where the likelihood function can be optimized by using Newton-Raphson to update [math]\displaystyle{ \theta : = \theta ¡ H¡1r \theta` (\theta) }[/math]. [math]\displaystyle{ r \theta` (\theta) }[/math] is the gradient, which can be computed in parallel by mappers summing up P subgroup (y (i) ¡ hµ (x (i))) x (i) j each NR step i. The computation of the hessian matrix can be also written in a summation form of [math]\displaystyle{ H (j; k): = H (j; k) + hµ (x (i)) (hµ (x (i)) ¡ 1) x (i) j x (i) k }[/math] for the mappers. The reducer will then sum up the values for gradient and hessian to perform the update for [math]\displaystyle{ \theta }[/math]. ...

- QUOTE: ... Logistic Regression (LR): For logistic regression (Pregibon, 1981), 2006_MapReduceforMachineLearningonMuwe choose the form of hypothesis as [math]\displaystyle{ h_\theta(x) = g(\theta^T x) = 1/(1 + exp (-\theta^T x)) }[/math]. Learning is done by fitting [math]\displaystyle{ \theta }[/math] to the training data where the likelihood function can be optimized by using Newton-Raphson to update [math]\displaystyle{ \theta : = \theta ¡ H¡1r \theta` (\theta) }[/math]. [math]\displaystyle{ r \theta` (\theta) }[/math] is the gradient, which can be computed in parallel by mappers summing up P subgroup (y (i) ¡ hµ (x (i))) x (i) j each NR step i. The computation of the hessian matrix can be also written in a summation form of [math]\displaystyle{ H (j; k): = H (j; k) + hµ (x (i)) (hµ (x (i)) ¡ 1) x (i) j x (i) k }[/math] for the mappers. The reducer will then sum up the values for gradient and hessian to perform the update for [math]\displaystyle{ \theta }[/math]. ...

### 2004a

- (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
- QUOTE: Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression … . Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3].

### 2004b

- (Hastie et al., 2004) ⇒ Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. (2004). “The Entire Regularization Path for the Support Vector Machine.” In: The Journal of Machine Learning Research, 5.
- QUOTE: There are many ways to fit such a linear classifier, including linear regression, Fisher’s linear discriminant analysis, and logistic regression

### 2000

- (Hosmer & Lemeshow, 2000) ⇒ David W. Hosmer, and Stanley Lemeshow. (2000). “Applied Logistic Regression, 2nd Edition." Wiley. ISBN:0471356328
- QUOTE: What distinguishes a
**logistic regression model**from the linear regression model is that the outcome variable in**logistic regression**is*binary*or*dichotomous*. This difference between logistic and linear regression is reflected both in the choice of a parametric model and in the assumptions.

- QUOTE: What distinguishes a

### 1978

- (Press & Wilson, 1978) ⇒ S. James Press, and Sandra Wilson. (1978). “Choosing Between Logistic Regression and Discriminant Analysis.” In: Journal of the American Statistical Association, 73(364). : http://www.jstor.org/stable/2286261
- ABSTRACT: Classifying an observation into one of several populations is discriminant analysis, or classification. Relating qualitative variables to other variables through a logistic cdf functional form is logistic regression. Estimators generated for one of these problems are often used in the other. If the populations are normal with identical covariance matrices, discriminant analysis estimators are preferred to logistic regression estimators for the discriminant analysis problem. In most discriminant analysis applications, however, at least one variable is qualitative (ruling out multivariate normality). Under nonnormality, we prefer the logistic regression model with maximum likelihood estimators for solving both problems. In this article we summarize the related arguments, and report on our own supportive empirical studies.
- KEYWORDS: Logistic regression; Discriminant analysis; Qualitative variables; Classification.
- QUOTE: The rationale for a logistic formulation of the relationship between qualitative and other variables, rather than a normal (probit analysis), angular (such as archine), or other relationship, has been discussed extensively in the literature and is summarized in the excellent book by Cox (1970). We do not repeat it here. To provide additional support for the logistic formulation, however, we note that Anderson (1972) pointed out that it results from a wide variety of underlying assumptions about the explanatory variables. In particular, the logistic formulation results not only from assuming that the explanatory variables are multivariate normally distributed with equal covariance matrices, but also from assuming that the explanatory variables are independent and dichotomous zero-or-one variables, or that some are multivariate normal and some dichotomous. Thus, one advantage of using the logistic model for discriminant analysis (rather than a linear discriminant function) is that it is relatively robust; i.e., many types of underlying assumptions lead to the same logistic formulation. The linear discriminant analysis approach, by contrast, is strictly applicable only when the underlying variables are jointly normal with equal covariance matrices.

### 1970

- (Cox, 1970) ⇒ D. R. Cox. (1970). “The Analysis of Binary Data." Methuen & Co.

- ↑ Unless you've taken thermodynamics or physical chemistry, in which case you recognize that this is the Boltzmann distribution for a system with two states, which differ in energy by $\beta_0 + x\beta$ .