2007 AComparisonofNumericalOptimizer

(Minka, 2007) ⇒ Thomas P. Minka. (2007). “A Comparison of Numerical Optimizers for Logistic Regression.” Technical Report.

Subject Headings: MAP Algorithm.

Notes

It is a revision of (Minka, 2003).

Cited By

http://scholar.google.com/scholar?q=%222007%22+A+Comparison+of+Numerical+Optimizers+for+Logistic+Regression

Quotes

Abstract

Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum a-posteriori parameter estimate. A full derivation of each algorithm is given. In particular, a new derivation of Iterative Scaling is given which applies more generally than the conventional one. A new derivation is also given for the Modified Iterative Scaling algorithm of Collins et al. (2002). Most of the algorithms operate in the primal space, but can also work in dual space. All algorithms are compared in terms of computational complexity by experiments on large data sets. The fastest algorithms turn out to be conjugate gradient ascent and quasi-Newton algorithms, which far outstrip Iterative Scaling and its variants.

1 Introduction

The logistic regression model is :[math]\displaystyle{ p(y = ±1 \mid \mathbf{x},\mathbf{w}) = \sigma(y\mathbf{w^Tx}) = \frac{1}{1 + exp(-y\mathbf{w^Tx})} \ \ (1) }[/math]

It can be used for binary classification or for predicting the certainty of a binary outcome. See Cox & Snell (1970) for the use of this model in statistics. This note focuses only on computational issues related to maximum-likelihood or more generally maximum a-posteriori (MAP) estimation. A common prior to use with MAP is:

[math]\displaystyle{ p(w) ~ \mathcal{N}(0, \lambda^{-1} \mathbf{I}) \ \ (2) }[/math]

Using [math]\displaystyle{ \lambda \gt 0 }[/math] gives a “regularized” estimate of [math]\displaystyle{ \mathbf{w} }[/math] which often has superior generalization performance, especially when the dimensionality is high (Nigam et al., 1999).

Given a data set [math]\displaystyle{ (X, y) = [(x1, y1), ..., (xN, yN)] }[/math], we want to find the parameter vector w which maximizes:

[math]\displaystyle{ l(w) = - Xn i=1 log(1 + exp(-yiwTxi)) - � 2 wTw \ \ (3) }[/math]

The gradient of this objective is

[math]\displaystyle{ g = ?wl(w) = X i (1 - �(yiwTxi))yixi - �w \ \ (4) }[/math]

Gradient descent using (4) resembles the Perceptron learning algorithm, except that it will always converge for a suitable step size, regardless of whether the classes are separable.

The Hessian of the objective is

[math]\displaystyle{ H = \frac{\rd^2l(w)}{\rdw \rdw^T} = -\Sigma_i \sigma(w^Tx_i)(1 - \sigma(wTxi))xixTi - \lambdaI \ \ (5) }[/math]

which in matrix form can be written

[math]\displaystyle{ a_{ii} = \sigma(wTxi)(1 - \sigma(wTxi)) \ \ (6) }[/math]

[math]\displaystyle{ H = -XAXT - \lambda \mathbf{I} (7) }[/math]

Note that the Hessian does not depend on how the x’s are labeled. It is nonpositive definite, which means l(w) is convex.

For sufficiently large �, the posterior for w will be approximately Gaussian:

[math]\displaystyle{ p(w|X, y) ˜ N(w; ˆw,-H-1) \ \ (8) }[/math]

[math]\displaystyle{ ˆw = argmaxp(w) Y i p(yi|xi,w) \ \ (9) }[/math]

So the model likelihood can be approximated by :[math]\displaystyle{ p(y|X) ˜ p(y|X, ˆw)p( ˆw)(2�)d/2 |-H|-1/2 \ \ (10) }[/math] where d is the dimensionality of w.

2 Newton’s method

If we define [math]\displaystyle{ zi = xTi wold + (1 - �(yiwTxi))yi aii \ \ (11) }[/math] Then a Newton step is :[math]\displaystyle{ w_{\text{new}} = w_{\text{old}} + (\mathbf{XAX}^T + \lambda \mathbf{I})^{-1} \bigl{(} \Sigma_i (1 - �(yiwTxi))yixi - �w ! \ \ (12) }[/math]

[math]\displaystyle{ = (XAXT + �I)-1XA(XTwold + � (1 - �(yiwTxi))yi aii � ) \ \ (13) }[/math]

[math]\displaystyle{ = (XAXT + �I)-1XAz (14) which is the solution to a [[weighted least squares problem]]. This efficient implementation of [[Newton’s method]] is called [[Iteratively Reweighted Least Squares]]. It takes \lt math\gt O(nd^2) }[/math] time per iteration.

3 Coordinate ascent

Because the likelihood is convex, we can also optimize each wk alternately. A coordinate-wise Newton update is …

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 AComparisonofNumericalOptimizer	Thomas P. Minka			A Comparison of Numerical Optimizers for Logistic Regression						2007