# 2008 TrustRegionNewtonMethForLR

- (Lin et al., 2008) ⇒ Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. (2008). “Trust Region Newton Method for Logistic Regression.” In: The Journal of Machine Learning Research, 9.

**Subject Headings:** Logistic Regression, Newton Method, Conjugate Gradient, Support Vector Machine.

## Notes

- It focuses on Learning Record Sets with:
- a large number of Learning Records.
- a large number of Predictor Feature.
- a large number of Sparse Learning Records.

- It proposes the use of Trust Region Netwon Method as the Optimization Algorithm.

## Cited By

- ~68 …

## Quotes

### Author Keywords

logistic regression, newton method, trust region, conjugate gradient, support vector machines

### Abstract

Large-scale logistic regression arises in many applications such as document classification and natural language processing. In this paper, we apply a trust region Newton method to maximize the log-likelihood of the logistic regression model. The proposed method uses only approximate Newton steps in the beginning, but achieves fast convergence in the end. Experiments show that it is faster than the commonly used quasi Newton approach for logistic regression. We also extend the proposed method to large-scale L2-loss linear support vector machines (SVM).

### 1. Introduction

The logistic regression model is useful for two-class classification. Given data [math]\displaystyle{ \mathbf{x} }[/math] and weights [math]\displaystyle{ (\mathbf{w},b) }[/math], it assumes the following probability model

- [math]\displaystyle{ P(y=±1 | \mathbf{x},\mathbf{w}) = \frac{1}{1 + exp(-y(\mathbf{w}^T \mathbf{x} + b)}, }[/math]

where [math]\displaystyle{ y }[/math] is the class label. If training instances are [math]\displaystyle{ x_i }[/math], [math]\displaystyle{ i=1,...,l }[/math] and labels are [math]\displaystyle{ y_i \in {1,-1}, }[/math] one estimates</math>(\mathbf{w}; b)</math> by minimizing the negative log-likelihood:

- [math]\displaystyle{ \operatorname{min}_{\mathbf{x},b} \sum_{i=1}^{l} \log (1 + e^{-y_i(w^T x_i + b)}) }[/math]

…

…

There are many methods for training logistic regression models. In fact, most unconstrained optimization techniques can be considered. Those which have been used in large-scale scenarios are, for example, iterative scaling (Darroch and Ratcliff, 1972; Pietra et al., 1997; Goodman, 2002; Jin et al., 2003), nonlinear conjugate gradient, quasi Newton (in particular, limited memory BFGS) (Liu and Nocedal, 1989; Benson and Morfie, 2001), and truncated Newton (Komarek and Moore, 2005). All these optimization methods are iterative procedures, which generate a sequence [math]\displaystyle{ \{\mathbf{w}^k\}^\infty_{k=1} }[/math] converging to the optimal solution of (2). One can distinguish them according to the following two extreme situations of

Low cost per iteration; slow convergence. ↔ High cost per iteration; fast convergence.

For instance, iterative scaling updates one component of [math]\displaystyle{ w }[/math] at a time, so the cost per iteration is low but the number of iterations is high. In contrast, Newton method, which is expensive at each iteration, has very fast convergence rates. Many have attempted to compare these methods for logistic regression. Minka (2003) experiments with small data sets, and Malouf (2002) has done an extensive comparison for large-scale sets. Currently, most argue that the limited memory BFGS method is the most efficient and effective (e.g., Malouf, 2002; Sutton and McCallum, 2006) and references therein). In this article, we aim at situations for which both [math]\displaystyle{ l }[/math] (number of instances) and [math]\displaystyle{ n }[/math] (number of features) are very large. In addition, the data instances x1; : : : ; xl are sparse (i.e., many feature values are zero). Many recent applications from document classification and computational linguistics are of this type.

…

,