# Logistic Model Fitting Algorithm

(Redirected from logistic regression)

## References

### 2020

• (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Logistic_regression#Model_fitting Retrieved:2020-9-6.
• Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable Y being 0 or 1 given experimental data.

Consider a generalized linear model function parameterized by \theta , : h_\theta(X) = \frac{1}{1 + e^{-\theta^TX}} = \Pr(Y=1 \mid X; \theta) Therefore, : \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) and since Y \in \{0,1\} , we see that \Pr(y\mid X;\theta) is given by \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed, : \begin{align} L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)} \end{align} Typically, the log likelihood is maximized, : N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) which is maximized using optimization techniques such as gradient descent.

Assuming the (x, y) pairs are drawn uniformly from the underlying distribution, then in the limit of large N, : \begin{align} & \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt] = {} & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt] = {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X) \end{align} where H(X\mid Y) is the conditional entropy and D_\text{KL} is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.

### 2015

• (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/logistic_regression Retrieved:2015-5-13.
• In statistics, logistic regression, or logit regression, or logit model[1] is a direct probability model that was developed by statistician D. R. Cox in 1958[2] [3] although much work was done in the single independent variable case almost two decades earlier. The binary logistic model is used to predict a binary response based on one or more predictor variables (features). That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and hereafter in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression or polytomous logistic regression, or, if the multiple categories are ordered, as ordinal logistic regression.

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by estimating probabilities. Thus, it treats the same set of problems as does probit regression using similar techniques; the first assumes a logistic function and the second a standard normal distribution function.

Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression. First, the conditional distribution $p(y \mid x)$ is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the estimated probabilities are restricted to [0,1] through the logistic distribution function because logistic regression predicts the probability of the instance being positive.

Logistic regression is an alternative to Fisher's 1936 classification method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, application of Bayes' rule to reverse the conditioning results in the logistic model, so if linear discriminant assumptions are true, logistic regression assumptions must hold. The converse is not true, so the logistic model has fewer assumptions than discriminant analysis and makes no assumption on the distribution of the independent variables.

1. Tolles, Juliana; Meurer, William J (2016). "Logistic Regression Relating Patient Characteristics to Outcomes". JAMA. 316 (5): 533–4. doi:10.1001/jama.2016.7653. ISSN 0098-7484. OCLC 6823603312. PMID 27483067.
2. Cox, David R. (1958). "The regression analysis of binary sequences (with discussion)". J R Stat Soc B. 20 (2): 215–242. JSTOR 2983890.
3. Walker, SH; Duncan, DB (1967). "Estimation of the probability of an event as a function of several independent variables". Biometrika. 54 (1/2): 167–178. doi:10.2307/2333860. JSTOR 2333860

### 2011

• (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Logistic_regression
• … Logistic regression analyzes binomially distributed data of the form $Y_i \ \sim B(n_i,p_i),\text{ for }i = 1, \dots , m,$ where the numbers of Bernoulli trials ni are known and the probabilities of success pi are unknown. An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.

The model proposes for each trial i there is a set of explanatory variables that might inform the final probability. These explanatory variables can be thought of as being in a k-dimensional vector Xi and the model then takes the form $p_i = \operatorname{E}\left(\left.\frac{Y_i}{n_{i}}\right|X_i \right). \,$

The logits, natural logs of the odds, of the unknown binomial probabilities are modeled as a linear function of the Xi. $\operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}.$ ...

### 1970

• (Cox, 1970) ⇒ D. R. Cox. (1970). “The Analysis of Binary Data." Methuen & Co.

1. Unless you've taken thermodynamics or physical chemistry, in which case you recognize that this is the Boltzmann distribution for a system with two states, which differ in energy by $\beta_0 + x\beta$ .