2007 TheTradeoffsOfLargeScaleLearning

(Bottou & Bousquet, 2007) ⇒ Léon Bottou, and Olivier Bousquet. (2007). “The Tradeoffs of Large Scale Learning.” In: Advances in Neural Information Processing Systems 20 (NIPS 2007)

Subject Headings: Large-Scale Supervised Learning Algorithm, Stochastic Gradient Descent Optimization Algorithm.

Notes

Cited By

http://leon.bottou.org/papers/bottou-bousquet-2008
- QUOTE: This paper establishes a formal distinction between small-scale and large-scale learning systems. The generalization performance of small-scale systems is entirely defined by the statistical properties of their estimation procedure. The generalization performance of large-scale systems also depends on the chosen optimization algorithm. In particular, stochastic gradient algorithms display very good generalization performances despite being very poor optimization algorithms. Here is a very clear experimental validation.

Quotes

Abstract

This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.

1 Motivation

The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is "learnable" when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].)

Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind:

Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data.
Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data.

This contribution finds its source in the idea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error.

2 Approximate Optimization

2.1 Setup

Following [6, 2], we consider a space of input-output pairs (x, y) ¸ X ~ Y endowed with a probability distribution P(x, y). The conditional distribution P(y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output .y and the real output y is measured with a loss function .(.y, y). Our benchmark is the function f. that minimizes the expected risk

[math]\displaystyle{ E(f) = … Z .(f(x), y) dP(x, y) = E[.(f(x), y)], }[/math]

that is,

</math>f.(x) = argmin … .y E[ .(.y, y)| x].</math>

Although the distribution P(x, y) is unknown, we are given a sample S of n independently drawn training examples (xi, yi), i = 1 . . . . We define the empirical risk

[math]\displaystyle{ En(f) = … 1 n Xn i=1 .(f(xi), yi) = En[.(f(x), y)]. }[/math]

Our first learning principle consists in choosing a family F of candidate prediction functions and finding the function fn = argminf¸F En(f) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently restrictive. Since the optimal function f. is unlikely to belong to the family F, we also define f. F = argminf¸F E(f). For simplicity, we assume that f., f. F and fn are well defined and unique. We can then decompose the excess error as E[E(fn) . E(f.)] = E[E(f. F) . E(f.)] + E[E(fn) . E(f. F)] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f.. The estimation error Eest measures the effect of minimizing the empirical risk En(f) instead of the expected risk E(f). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess

…

3.2 Gradient Optimization Algorithms

We now discuss and compare the asymptotic learning properties of four gradient optimization algorithms. Recall that the family of function F is linearly parametrized by [math]\displaystyle{ w \in \mathbb{R}^d }[/math]. Let w. [math]\displaystyle{ F }[/math] and [math]\displaystyle{ w_n }[/math] correspond to the functions f. [math]\displaystyle{ F }[/math] and [math]\displaystyle{ f_n }[/math] defined in section 2.1. In this section, we assume that the functions [math]\displaystyle{ w }[/math] 7¡æ §¤(fw(x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function [math]\displaystyle{ C(w) = E_n(f_w) }[/math] has a single minimum.

Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum [math]\displaystyle{ w_n }[/math].

…

The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector.

Gradient Descent (GD) iterates :[math]\displaystyle{ w(t + 1) = w(t) . ƒÅ ÝC Ýw (w(t)) = w(t) . ƒÅ 1 n Xn i=1 Ý Ýw . .. fw(t)(xi), yi }[/math] where [math]\displaystyle{ ƒÅ \gt 0 }[/math] is a small enough gain. GD is an algorithm with linear convergence [16]. When ƒÅ = 1/ƒÉmax, this algorithm requires O(ƒÈ log(1/ƒÏ)) iterations to reach accuracy ƒÏ. The exact number of iterations depends on the choice of the initial parameter vector.

Second Order Gradient Descent (2GD) iterates :[math]\displaystyle{ w(t + 1) = w(t) . H.1 ÝC Ýw (w(t)) = w(t) . 1 n H.1 Xn i=1 Ý Ýw . .. fw(t)(xi), yi }[/math] where matrix [math]\displaystyle{ H^{-1} }[/math] is the inverse of the Hessian matrix (8). This is more favorable than Newton's algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. [[2GD] is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufficient. In the general case, O(log log(1/ƒÏ)) iterations are required to reach accuracy ƒÏ.

Stochastic Gradient Descent (SGD) picks a random training example [math]\displaystyle{ (x_t, y_t) }[/math] at each iteration and updates the parameter [math]\displaystyle{ w }[/math] on the basis of this example only, :[math]\displaystyle{ w(t + 1) = w(t) - \frac{\eta}{t} \frac{\partial}{\partial w} \ell(f_{w(t)}(x_t),y_t). }[/math] Murata [17, section 2.2], characterizes the mean [math]\displaystyle{ \mathbb{E}_{\it{S}}[w(t)] }[/math] and variance [math]\displaystyle{ \mathbb{V}ar_{\it{S}}[w(t)] }[/math] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for ƒÅ = 1/ƒÉmin, we have ƒÂw(t)2 = O(1/t) where ƒÂw(t) is a shorthand notation for w(t) . wn.

We can then write ES[C(w(t)) . inf C ] = ES . tr ` H w(t) w(t)ŒL. + o ` 1 t L = tr ` H ES[w(t)] ES[w(t)]Œ + H VarS[w(t)] … (11)

Therefore the SGD algorithm reaches accuracy ƒÏ afterƒËƒÈ2/ƒÏ + o(1/ƒÏ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector [math]\displaystyle{ w }[/math] nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ƒÏ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound.

Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error .). Legend: n number of examples; d parameter dimension; ƒÈ, ƒË see equation (10). Algorithm Cost of one Iterations Time to reach Time to reach iteration to reach ƒÏ accuracy c (Eapp + ") GD O(nd) O

Second Order Stochastic Gradient Descent (2SGD) replaces the gain ƒÅ by the inverse of the Hessian matrix H: :[math]\displaystyle{ w(t + 1) = w(t) . 1 }[/math]

Unlike standard gradient algorithms, using the second order information does not change the influence of ƒÏ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ƒÏ is reached after ƒË/ƒÏ + o(1/ƒÏ) iterations.

References

[1] Leslie G. Valiant. A theory of learnable. Proceedings of of the 1984 STOC, pages 436–445, 1984.
[2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982.
[3] Stéphane Boucheron, Olivier Bousquet, and G´abor Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
[4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006.
[5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988.
[6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973.
[7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004.
[8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag.
[9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994.
[10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002.
[11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), 2004.
[12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006.
[13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, (2):245–303, 2000.
[14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998.
[15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003.
[16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.
[17] NoboruMurata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.
[18] Léon Bottou and Yann LeCun. Large scale online learning. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
[19] Thorsten Joachims. Training linear svms in linear time. In: Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM.
[20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 TheTradeoffsOfLargeScaleLearning	Léon Bottou Olivier Bousquet			The Tradeoffs of Large Scale Learning		NIPS 2007	http://books.nips.cc/papers/files/nips20/NIPS2007 0726.pdf			2007