2006 AcceleratedTrainingofConditiona

(Vishwanathan et al., 2006) ⇒ S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. (2006). “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In: Proceedings of the 23rd International Conference on Machine learning (ICML-2006). doi:10.1145/1143844.1143966

Subject Headings: CRF Training Algorithm, Stochastic Gradient Optimization Algorithm.

Notes

Cited By

Quotes

Author Keywords

Abstract

We apply Stochastic Meta-Descent (SMD), a stochastic gradient optimization method with gain vector adaptation, to the training of Conditional Random Fields (CRFs). On several large data sets, the resulting optimizer converges to the same quality of solution over an order of magnitude faster than limited-memory BFGS, the leading method reported to date. We report results for both exact and inexact inference techniques.

1. Introduction

Conditional Random Fields (CRFs) have recently gained popularity in the machine learning community (Lafferty et al., 2001; Sha & Pereira, 2003; Kumar & Hebert, 2004). Current training methods for CRFs^[1] include generalized iterative scaling (GIS), conjugate gradient (CG), and limited-memory BFGS. These are all batch-only algorithms that do not work well in an online setting, and require many passes through the training data to converge. This currently limits the scalability and applicability of CRFs to large real-world problems. In addition, for many graph structures with large tree-width, such as 2D lattices, computing the exact gradient is intractable. Various approximate inference methods can be employed, but these cause many optimizers to break.

Stochastic gradient methods, on the other hand, are online and scale sub-linearly with the amount of training data, making them very attractive for large data sets; empirically we have also found them more resilient to errors made when approximating the gradient. Unfortunately their asymptotic convergence to the optimum is often painfully slow. Gain adaptation methods like Stochastic Meta-Descent (SMD) accelerate this process by using second-order information to adapt the gradient step sizes (Schraudolph, 1999, 2002). Key to SMD’s efficiency is the implicit computation of fast Hessian-vector products (Pearlmutter, 1994; Griewank, 2000).

In this paper we marry the above two techniques and show how SMD can be used to significantly accelerate the training of CRFs. The rest of the paper is organized as follows: Section 2 gives a brief overview of CRFs while Section 3 introduces stochastic gradient methods. We present experimental results for 1D chain CRFs in Section 4, and 2D lattice CRFs in Section 5. We conclude with a discussion in Section 6.

2. Conditional Random Fields (CRFs)

CRFs are a probabilistic framework for labeling and segmenting data. Unlike Hidden Markov Models (HMMs) and Markov Random Fields (MRFs), which model the joint density P(X, Y ) over inputs X and labels Y , CRFs directly model P(Y |x) for a given input sample x. Furthermore, instead of maintaining a per-state normalization, which leads to the so-called label bias problem (Lafferty et al., 2001), CRFs utilize a global normalization which allows them to take long-range interactions into account.

We now introduce exponential families, and describe CRFs as conditional models in the exponential family.

…

3. Stochastic Gradient Methods

In this section we describe stochastic gradient descent and discuss how its convergence can be improved by gain vector adaptation via the Stochastic Meta-Descent (SMD) algorithm (Schraudolph, 1999, 2002).

3.1. Stochastic Approximation of Gradients

Since the log-likelihood (7) is summed over a potentially large number [math]\displaystyle{ m }[/math] of data points, we approximate it by subsampling batches of [math]\displaystyle{ b \lt \lt m }[/math] points:

[math]\displaystyle{ L(\theta) \approx \Sigma{m/b−1}{t=0} L_b(\theta, t) }[/math], where (16)

[math]\displaystyle{ Lb(\theta, t) = b||�t||2 2m�2 − Xb i=1 [h�(xbt+i, ybt+i), \theta_{t+i} − z(\theta_t|x_{bt+i})]. }[/math] (17)

Note that for [math]\displaystyle{ \theta_t = }[/math] const. (16) would be exact. We will, however, interleave an optimization step that modifies [math]\displaystyle{ \theta }[/math] with each evaluation of [math]\displaystyle{ L_b(\theta, t) }[/math], resp. its gradient

[math]\displaystyle{ g_t := @/@\theta L_b(\theta, t). }[/math] (18)

The batch size [math]\displaystyle{ b }[/math] controls the stochasticity of the approximation. At one extreme, [math]\displaystyle{ b = m }[/math] recovers the conventional deterministic algorithm; at the other, [math]\displaystyle{ b = 1 }[/math] adapts [math]\displaystyle{ \theta }[/math] fully online, based on individual data samples. Typically small batches of data ([math]\displaystyle{ 5 \le b \le 20 }[/math]) are found to be computationally most efficient.

Unfortunately most advanced gradient methods do not tolerate the sampling noise inherent in stochastic approximation: it collapses conjugate search directions (Schraudolph & Graepel, 2003) and confuses the line searches that both conjugate gradient and quasi-Newton methods depend upon. Full second-order methods are unattractive here because the computational cost of inverting the Hessian is better amortized over a large data set.

This leaves plain first-order gradient descent. Though this can be very slow to converge, the speed-up gained by stochastic approximation dominates on large, redundant data sets, making this strategy more efficient overall than even sophisticated deterministic methods. The convergence of stochastic gradient descent can be further improved by gain vector adaptation.

…

Footnotes

↑ In this paper, “training” specifically means penalized maximum likelihood parameter estimation.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 AcceleratedTrainingofConditiona	Kevin P. Murphy S. V. N. Vishwanathan Nicol N. Schraudolph Mark W. Schmidt			Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods				10.1145/1143844.1143966

[1] In this paper, “training” specifically means penalized maximum likelihood parameter estimation.

[1]