Limited-Memory Quasi-Newton Algorithm
(Redirected from Limited Memory BFGS)
- AKA: L-BFGS.
- See: Newton's Method Algorithm, Iterative Scaling Algorithm, CRFpp Package, LogReg Algorithm.
- QUOTE: The limited-memory BFGS (L-BFGS or LM-BFGS) algorithm is a member of the broad family of quasi-Newton optimization methods that uses a limited memory variation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update to approximate the inverse Hessian matrix (denoted by [math]H_k[/math]). Unlike the original BFGS method which stores a dense [math] n \times n [/math] approximation, L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its moderate memory requirement, L-BFGS method is particularly well suited for optimization problems with a large number of variables. L-BFGS never explicitly forms or stores [math]H_k[/math]. Instead, it maintains a history of the past [math]m\,\![/math] updates of the position [math]x\,\![/math] and gradient [math]\nabla f(x)[/math], where generally the history [math]m\,\![/math] can be short, often less than 10. These updates are used to implicitly do operations requiring the [math]H_k[/math]-vector product. While strictly, a straightforward BFGS implementation at the [math]i\,\![/math]-th iteration would represent the inverse Hessian approximation as informed by all updates on [math]0 \ldots i-1[/math], L-BFGS does quite well using updates from only the most recent iterations [math]i-m \ldots i-1[/math].
- (Tasi et al., 2006) ⇒ Tzong-han Tsai, Wen-Chi Chou, Shih-Hung Wu, Ting-Yi Sung, Jieh Hsiang, and Wen-Lian Hsu. (2006). “Integrating Linguistic Knowledge into a Conditional Random Field Framework to Identify Biomedical Named Entities.” In: Expert Systems with Applications: An International Journal, 30(1). doi:10.1016/j.eswa.2005.09.072
- … Our CRF package uses limited-memory quasi-Newton (L-BFGS) (Nocedal & Wright, 1999) method to perform comparably on very large problems (about millions of features). Newton methods for nonlinear optimization use second order information to find search directions. It is not practical to obtain exact second order information for CRF training. L-BFGS estimates the curvature numerically from previous gradients and updates. (Malouf, 2002) indicates that L-BFGS performs well in maximum-entropy classifier training. More detailed description of this method can be found in (Nocedal & Wright, 1999).
- (Vishwanathan et al., 2006) ⇒ S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. (2006). “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In: Proceedings of the 23rd International Conference on Machine learning (ICML-2006). doi:10.1145/1143844.1143966
- QUOTE: … Current training methods for CRFs (In this paper, “training” specifically means penalized maximum likelihood parameter estimation) include generalized iterative scaling (GIS), conjugate gradient (CG), and limited-memory BFGS. These are all batch-only algorithms that do not work well in an online setting, and require many passes through the training data to converge. … Stochastic gradient methods, on the other hand, are online and scale sub-linearly with the amount of training data, making them very attractive for large data sets;
- (Nocedal & Wright, 1999) ⇒ Jorge Nocedal, and Stephen J. Wright. (1999). “Numerical Optimization." Springer, ISBN:0387987932.
- (Liu & Nocedal, 1989) ⇒ Dong C. Liu, and Jorge Nocedal. (1989). “On the Limited Memory BFGS Method for Large Scale Optimization.” In: Mathematical Programming, 45(2). doi:10.1007/BF01589116
- QUOTE: We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cyles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.