2015 CompetingwiththeEmpiricalRiskMi

(Frostig et al., 2015) ⇒ Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. (2015). “Competing with the Empirical Risk Minimizer in a Single Pass.” In: Proceedings of In Conference on learning theory (JMLR 2015).

Subject Headings: Streaming Stochastic Variance Reduced Gradient Algorithm, Data Streaming Algorithm, Empirical Risk Minimization Algorithm.

Notes

Cited By

http://scholar.google.com/scholar?q=%222015%22+Competing+with+the+Empirical+Risk+Minimizer+in+a+Single+Pass

2018

(Jothimurugesan et al., 2018) ⇒ Ellango Jothimurugesan, Ashraf Tahmasbi, Phillip B Gibbons, and Srikanta Tirthapura. (2018). “Variance-Reduced Stochastic Gradient Descent on Streaming Data.” In: Proceedings of 32nd Conference on Neural Information Processing Systems (NIPS 2018).

Quotes

Abstract

In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data – commonly referred to as either the empirical risk minimizer (ERM) or the M-estimator – is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties:

The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.
The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors.
The algorithm’s performance depends on the initial error at a rate that decreases super-polynomially.
The algorithm is easily parallelizable.
Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.

1. Introduction

Consider the following optimization problem:

[math]\displaystyle{ \underset{w \in S} \min P(w),\quad }[/math] where [math]\displaystyle{ P(w) \overset{def}=\mathbb{E}_{\psi\sim \mathcal{D}[\psi (w)]} \quad }[/math](1)

and [math]\displaystyle{ \mathcal{D} }[/math] is a distribution over convex functions from a Euclidean space [math]\displaystyle{ S }[/math] to [math]\displaystyle{ \mathbb{R} }[/math] (e.g. [math]\displaystyle{ S = \mathbb{R}^d }[/math] in the ﬁnite dimensional setting). Let [math]\displaystyle{ w_\star }[/math] be a minimizer of [math]\displaystyle{ P }[/math] and suppose we observe the functions [math]\displaystyle{ \psi_1,\psi_2,\cdots, \psi_N }[/math] independently sampled from [math]\displaystyle{ \mathcal{D} }[/math]. Our objective is to compute an estimator [math]\displaystyle{ \hat{w}_N }[/math] so that the expected error (or, equivalently, the excess risk):

[math]\displaystyle{ \mathbb{E}[P(\hat{w}_N)- P(w_\star)] }[/math]

is small, where the expectation is over the estimator [math]\displaystyle{ \hat{w}_N }[/math] (which depends on the sampled functions).

Stochastic approximation algorithms, such as stochastic gradient descent (SGD) (Robbins and Monro, 1951), are the most widely used in practice, due to their ease of implementation and their efﬁciency with regards to runtime and memory. Without consideration for computational constraints, we often wish to compute the empirical risk minimizer (ERM; or, equivalently, the M-estimator):

[math]\displaystyle{ \hat{w}^{ERM}_{N}\in \underset{w\in S}{\arg\min}\dfrac{1}{N}\sum^N_{i=1}\psi(w)\quad }[/math] (2)

In the context of statistical modeling, the ERM is the maximum likelihood estimator (MLE). Under certain regularity conditions, and under correct model specification^[1] the MLE is asymptotically efﬁcient, in that no unbiased estimator can have a lower variance in the limit (see Lehmann and Casella (1998); van der Vaart (2000))^[2]. Analogous arguments have been made in the stochastic approximation setting, where we do not necessarily have a statistical model of the distribution [math]\displaystyle{ \mathcal{D} }[/math] (see Kushner and Yin (2003)). The regularity conditions we consider herein are standard; without such assumptions the work in Shalev-Shwartz et a1. (2010) shows that the ERM may unstable.

The question we aim to address is as follows. Consider the ratio:

[math]\displaystyle{ \dfrac{\mathbb{E}[P(\hat{w}^{ERM}_{N}-P(w_\star))]}{\mathbb{E}[P(\hat{w}_{N}-P(w_\star))]}\quad }[/math] (3)

We seek an algorithm to compute EN in which: (1) under sufﬁcient regularity conditions, this ratio approaches 1 on every problem [math]\displaystyle{ \mathcal{D} }[/math] and (2) it does so quickly, at a rate quantiﬁable in terms of the number of samples, the dependence on the initial error (and other relevant quantities), and the computational time and space usage.

↑ A well specified statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.1
↑ Note that biased estimators, e.g. the James-Stein estimator, can outperform the MLE (Lehmann and Casella, 1998).

1.1. This work

Under certain smoothness assumptions on 111 and strong convexity assumptions on P (applicable to linear and logistic regression, generalized linear models, smooth Huber losses, and various other M -estimation problems), we provide an algorithm where:

1. The algorithm achieves the same statistical rate of convergence as the ERM on every problem, even considering constant factors, and we quantify the sample size at which this occurs.

2. The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.

3. The algorithm decreases the standard notion of initial error at a super-polynomial rate.3

4. The algorithm is trivially parallelizable (see Remark 6).

Table 1 compares previous (and concurrent) algorithms that enjoy the ﬁrst two guarantees; this work is the ﬁrst with a ﬁnite-sample analysis handling the more general Class of problems. Our algorithm is a variant of the stochastic variance reduced gradient procedure of Johnson and Zhang (2013). Importantly, we quantify how fast we obtain a rate comparable to that of the ERM. For the case of linear regression, we have non-triVial guarantees when the sample size N is larger than

1. A well speciﬁed statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.1.

2. Note that biased estimators, e.g. the James — Stein estimator, can outperform the MLE (Lehmann and Casella, 1998).

3. A function is super—polynornial if grows faster than any polynomial.

Algorithm / analysis Problem Step size Initial error Parallel— Finite-step dependence izable analysis Polyak and Juditsky (1992) general decaying: 1 /nC ? ? X Polyak and Juditsky (1992)/ linear constant 9(1/712) ? ./ Dieuleveut and Bach (2014) regression This work: Streaming SVRG general constant 1 /nw(1) ./ ./

Table 1: Comparison of known streaming algorithms which achieve a constant competitive ratio to the ERM. Polyak and Juditsky (1992) is an SGD algorithm with iterate averaging. Concurrent to and independently from our work, Dieuleveut and Bach (2014) provide a ﬁnite-sample analysis for SGD with averaging in the linear regression problem setting (where the learning rate can be taken as constant). In the “problem” column, “general" indicates problems under the regularity assumptions herein. Polyak and Juditsky (1992) require the step size to decay with the sample size n, as I/n‘: with c strictly in the range 1/2 < c < 1. The dependence on c in a ﬁnite-sample analysis is unclear (and tuning the decay of learning rates is often undesirable in practice). The initial error is P(wo) 7 P(w1), where 100 is the starting point of the algorithm. We seek algorithms in which the initial error dependence is signiﬁcantly lower in order, and we write 1 / n” (1) to indicate that it can be driven down to an arbitrarily low-order polynomial. See Remark 6 with regard to parallelization.

a constant times what can be interpreted as a condition number, H = L / M, where M is a strong convexity parameter of P and where L is a smoothness parameter of each 11}. Critically, after N is larger than H, the initial error is divided by a factor that can be larger than any polynomial in N / H.

Finally, in order to address this question on a per-problem basis, we provide both upper and lower bounds for the rate of convergence of the ERM.

1.2. Related work

Stochastic optimization dates back to the work of Robbins and Monro (1951) and has seen much subsequent work (Kushner and Clark, 1978; Kushner and Yin, 2003; Nemirovski and Yudin, 1983). More recently, questions of how to quantify and compare rates of estimation procedures 7 with implications to machine learning problems in the streaming and large dataset settings 7 have been raised and discussed several times (see Bottou and Bousquet (2008); Agarwal and Bottou (2014)).

Stochastic approximation. The pioneering work of Polyak and Juditsky (1992) and Ruppert (1988) provides an asymptotically optimal streaming algorithm, by averaging the iterates of an SGD procedure. It is unclear how quickly these algorithms converge to the rate of the ERM in ﬁnite sam- ple; the relevant dependencies, such as the dependence on the initial error 7 that is, P(wo) 7 P(w1) where 100 is the starting point of the algorithm 7 are not speciﬁed. In particular, they characterize the limiting distribution of \/N(1UN 7 10*), essentially arguing that the variance of the iterate-averaging procedure matches the asymptotic distribution of the ERM (see Kushner and Yin (2003)).

In a series of papers, Bach and Moulines (2011), Each and Moulines (2013), Dieuleveut and Bach (2014), and Defossez and Bach (2015) provide non-asymptotic analysis of the same averaging schemes. Of these, for the speciﬁc case of linear least-squares regression, Dieuleveut and Bach (2014) and Defossez and Bach (2015) provide rates which are competitive with the ERM, concur- rently and independent of results presented herein. In Dieuleveut and Bach (2014) and Defossez and Bach (2015) the dependence on the initial error decaying as 1 /N2 is shown. In non-parametric (least squares) settings, there are various results looking at optimal rates of online algorithms in compari- son to batch estimators (see Dieuleveut and Bach (2014); Yao (2010); Tarres and Yao (2014); Ying and Pontil (2008); Yao (2010)). The work in Bach and Moulines (2011) and Bach and Moulines (2013) either does not achieve the ERM rate or has a dependence on the initial error which is not lower in order.

For the special case of least squares, one could adapt the algorithm and guarantees of Dieuleveut and Bach (2014); Defossez and Bach (2015), by replacing global averaging with random restarts, to obtain super-polynomial rates (results comparable to ours when specializing to linear regression). For more general problems, it is unclear how such an adaptation would work 7 using constant step sizes alone may not sufﬁce. In contrast, as shown in Table 1, our algorithm is identical for a wide variety of cases and does not need decaying rates (whose choices may be difﬁcult in practice).

We should also note that much work has characterized rates of convergence under various as- sumptions on P and 111 different than our own. Our case of interest is when P is strongly convex. For such P, the rates of convergence of many algorithms are 0(1 /N ), often achieved by averaging the iterates in some way (Nemirovski et al., 2009; Juditsky and Nesterov, 2010; Rakhlin et al., 2012; Hazan and Kale, 2014). These results do not achieve a constant competitive ratio, for a variety of reasons (they have a leading order dependencies on various quantities, including the initial error along with strong convexity and smoothness parameters). Solely in terms of the dependence on the sample size N, these rates are known to be optimal (Nemirovski and Yudin, 1983; Nesterov, 2004; Agarwal et al., 2012).

Empirical risk minimization (M -estimati0n). In statistics, it is classically argued that the MLE, under certain restrictions, is an asymptotically efﬁcient estimator for well-speciﬁed statistical mod- els (Lehmann and Casella, 1998; van der Vaart, 2000). Analogously, in an optimization context, applicable to mis-speciﬁed models, similar asymptotic arguments have been made: under certain restrictions, the asymptotically optimal estimator is one which has a limiting variance that is equiv- alent to that of the ERM (Anbar, 1971; Fabian, 1973; Kushner and Clark, 1978).

With regards to ﬁnite—sample rates, Agarwal et a1. (2012) provide information-theoretic lower bounds (for any strategy) for certain stochastic convex optimization problems. This result does not imply our bounds as they do not consider the same smoothness assumptions on 11}. For the special case of linear least—squares regression, there are several upper bounds (for instance, Caponnetto and De Vito (2007); Hsu et a1. (2014)). Recently, Shamir (2014) provides lower bounds speciﬁcally for the least—squares estimator, applicable under model mis-speciﬁcation, and sharp only for speciﬁc problems.

Linearly convergent optimization (and approaches based on doubling). There are numerous algorithms for optimizing sums of convex functions that converge linearly, i.e. that depend only logarithmically on the target precision. Notably, several recently developed such algorithms are applicable in the setting where the sample size N becomes large, due to their stochastic nature (Strohmer and Vershynin, 2009; Le Roux et al., 2012; ShaleV-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). These procedures minimize a sum of N losses in time (near to) linear in N , provided N is sufﬁciently large relative to the dimension and the condition number.

Naively, one could attempt to use one of these algorithms to directly compute the ERM. Such an attempt poses two difﬁculties. First, we would need to prove concentration results for the em- pirical function PN(1U) = ﬁ 2?; 111,-(112); in order to argue that these algorithms perform well in linear time with respect to the objective P, one must relate the condition number of PN(1U) to the condition number of P (11)) Second, we would need new generalization analysis in order to relate the in-sample error 5N(13N), where 5N(1u) déf PN(1U) 7 minw/ PN(1U’), to the gener- alization error lE[P(13N) 7 P(U)*)]. To use existing generalization analyses would demand that 5 N (10 N) = Q (1 / N ), but the algorithms in question all require at least log N passes of the data (fur- thermore scaled by other problem-dependent factors) to achieve such an in-sample error. Hence, this approach would not immediately describe the generalization error obtained in time linear in N . Finally, it requires that entire observed data sample, constituting the sum, be stored in memory.

A second natural question is: can one naively use a doubling trick with an extant algorithm to compete with the ERM? By this we mean to iteratively run such a linearly convergent optimization algorithm, on increasingly larger subsets of the data, with the hope of cutting the error at each iteration by a constant fraction, eventually down to that of the ERM. There are two points to note for this approach. First, the approach is not implementable in a streaming model as one would eventually have to run the algorithm on a constant fraction of the entire dataset size, thus essentially holding the entire dataset in memory. Second, proving such an algorithm succeeds would similarly involve the aforementioned type of generalization argument.

We conjecture that these tight generalization arguments described are attainable, although with a somewhat involved analysis. For linear regression, the bounds in Hsu et a1. (2014) may sufﬁce. More generally, we believe the detailed ERM analysis provided herein could be used.

In contrast, the statistical convergence analysis of our single-pass algorithm is self—contained and does not go through any generalization arguments about the ERM. In fact, it avoids matrix concentration arguments entirely.

Comparison to related work. To our knowledge, this work provides the ﬁrst streaming algorithm guaranteed to have a rate that approaches that of the ERM (under certain regularity assumptions on D), where the initial error is decreased at a super—polynomial rate. The previous work, in the gen- eral case that we consider, only provides asymptotic convergence guarantees (Polyak and Juditsky, 1992). For the special case of linear least—squares regression, the concurrent and independent work presented in Dieuleveut and Bach (2014) and Defossez and Bach (2015) also converges to the rate of the ERM, with a lower-order dependence on the initial error of 9(1/N2). Furthermore, even if we ignored memory constraints and focused solely on computational complexity, our algorithm compares favorably to using state-of—the-art algorithms for minimizing sums of functions (such as the linearly convergent algorithms in Le Roux et a1. (2012); ShaleV-Shwartz and Zhang (2013); Johnson and Zhang (2013)); as discussed above, obtaining a convergence rate with these algorithms would entail some further generalization analysis.

It would be interesting if one could quantify an approach of restarting the algorithm of Polyak and Juditsky (1992) to obtain guarantees comparable to our streaming algorithm. Such an analysis could be delicate in settings other than linear regression, as their learning rates do not decay too quickly or too slowly (they must decay strictly faster than l/MN, yet more slowly than l/N). In contrast, our algorithm takes a constant learning rate to obtain its constant competitive ratio. Fur- thermore, our algorithm is easily parallelizable and its analysis, we believe, is relatively transparent.

1.3. Organization

Section 2 summarizes our main results, and Section 3 provides applications to a few stande sta- tistical models. The Appendix contains all proofs.

FROSTIG GE KAKADE SIDFORD

2. Main results

This section summarizes our main results, as corollaries of more general theorems provided later. After providing our assumptions in Section 2.1, Section 2.2 provides the algorithm, along with performance guarantees. Then Section 2.3 provides upper and lower bounds of the statistical rate of the empirical risk minimizer.

. . . . . . d f First, a few preliminarles and deﬁnitions are needed. Denote Hmlﬁw =5

mTMm for a vector 30 and a matrix M of appropriate dimensions. Denote Amax(M) and Amin(M) as the maximal and minimal eigenvalues of a matrix M . Let I denote the identity matrix. Also, for positive semideﬁnite symmetric matrices A and B, A j B if and only if mTAm g mTBm for all 30.

Throughout, deﬁne 0'2 as:

2 def

1 a =E¢WD §1|V111<w7>1|§v219<m>r1 (4)

This quantity governs the precise (problem dependent) convergence rate of the ERM. Namely, under certain restrictions on D, we have

hm E1P(@§EVRM) 7 Pm.»

N500 02/N = 1' (5)

This limiting rate is well-established in asymptotic statistics (see, for instance, van der Vaart (2000)), whereas Section 2.3 provides upper and lower bounds on this rate for ﬁnite sample sizes N . Analo- gous to the Cramér-Rao lower bound, under certain restrictions, 0'2 / N is the asymptotically efﬁcient rate for stochastic approximation problems (Anbar, 1971; Fabian, 1973; Kushner and Yin, 2003).4 The problem dependent rate of 0'2/N sets the benchmark. Statistically, we hope to achieve a leading order dependency of 0'2/N quickly, with rapidly-decaying dependence on the initial error.

2.1. Assumptions

We now provide two assumptions under which we analyze the convergence rate of our streaming algorithm, Algorithm 1. Our ﬁrst assumption is relatively standard. It provides upper and lower quadratic approximations (the lower approximation is on the full objective P).

Assumption 2.1 Suppose that: I. The objective P is twice diﬁ‘erentiable.

2. (Strong convexity) The objective P is M—strongly convex, i.e. for all 10,112’ E S,

P(w) 2 PM) 7 VP(w’>T(w 7 w’) 7 guw 7 Mug. (6) 3. (Smoothness) Each loss 11 is L-smooth (with probability one), i.e. for all 10,112’ E S,

L 1110103 111(w’)+V111(w’)T(w7w’)+§Hw7wll|§, (7)

4. Though, as with Cramer—Rao, this may be improvable with biased estimators.

COMPETING WITH THE EMPIRICAL RISK MINIMIZER IN A SINGLE PASS

Our results in fact hold under a slightly weaker version of this assumption 7 see Remark 21. Deﬁne: def L h = —. 11 The quantity 11 can be interpreted as the condition number of the optimization objective (1). The

(8)

following deﬁnition quantiﬁes a global bound on the Hessian.

Deﬁnition 1 (a-bounded Hessian) Let oz 2 1 be the smallest value (if it exists) such that for all ’11) E S, V2P(1U*) j aV2P(w).

Under Assumption 2.1, we have oz 3 11, because L-smoothness implies V2P (112*) j LI and 117strong convexity implies 111 j V2P(1u). However, oz could be much smaller. For instance, oz = 1 in linear regression, whereas 11 is the maximum to minimum eigenvalue ratio of the design matrix.

Our second assumption offers a stronger, local relationship on the objective’s Hessian, namely self-concordance. A function is self—concordant if its third-order derivative is bounded by a multiple of its second-order derivative. Formally, f : R 7 R is M self—concordant if and only if f is convex and | f’” (m)| g M f” (1)912, A multivariate function f : RCl 7 R is M self—concordant if and only if its restriction to any line is M self—concordant.

Assumption 2.2 (Self-concordance) Suppose that:

I. P is M -self concordant (or that the weaker condition in Equation (31 ) holds).

2. The following kurtosis condition holds:

M < (EM [uwmnuélf ’

Note that these two assumptions are also standard assumptions in the analysis of the two phases of Newton’s method (aside from the kurtosis condition): the ﬁrst phase of Newton’s method gets close to the minimizer quickly (based on a global strong convexity assumption) and the second phase obtains quadratic convergence (based on local curvature assumptions on how fast the local Hessian changes, e.g. self—concordance). Moreover, our proof of the streaming algorithm follows a similar structure; we use Assumption 2.1 to analyze the progress of our algorithm when the current point is far away from optimality and Assumption 2.2 when it is close.

2.2. Algorithm

Here we describe a streaming algorithm and provide its convergence guarantees. Algorithm 1 is inspired by the Stochastic Variance Reduced Gradient (SVRG) algorithm of Johnson and Zhang (2013) for minimizing a strongly convex sum of smooth losses. The algorithm follows a simple framework that proceeds in stages. In each stage 5 we draw hs samples independently at random from D and use these samples to obtain an estimate of the gradient of P at the current point, 1225 ((9)).

This stable gradient, denoted VP(1DS), is then used to decrease the variance of a gradient descent

procedure. For each of 771 steps (where m is chosen uniformly at random from {1, 2, . . . ,m}), we draw a sample 111 from D and take a step opposite to its gradient at the current point, plus a zero-bias

correction given by V11J(1DS) 7 P(1Ds) (see (10)). The remainder of this section shows that, for suitable choices of ks and m, Algorithm 1 achieves desirable convergence rates under the aforementioned assumptions.

Algorithm 1 Streaming Stochastic Variance Reduced Gradient (Streaming SVRG) input Initial point 1220, batch sizes {h0, h1, . . .}, update frequency m, learning rate 17, smoothness L for each stage 5 = 0, 1, 2, . .. d0

Sample 1171, . . . ,11Jks from D and compute the estimate /\2 1 ~ 2 vmws) = 1.: Z V111,-(1us). (9) 1€[ks] Sample 771 uniformly at random from {1, 2, . . . ,m}. ’LUO E 1178

f0rt=0,1,...,m71d0 Sample 111: from D and set

wt“ P ’wt 7 % (V11Jt(1ut)7 v111t(1178) Jr VP(1I)S)>. (10) end for ws+1 E writ end for

Remark 2 (Generalizing SVRG) Note that Algorithm 1 is a generalization of SVRG. In particular if we chose hs = 00, i.e. if VPﬁg) = VP(1DS), then our algorithm coincides with the SVRG algorithm of Johnson and Zhang (2013). Also, note that Johnson and Zhang (2013) do not make use of any self—concordance assumptions.

Remark 3 (Non-conformance t0 stochastic ﬁrst-order oracle models) Algorithm 1 is not imple- mentable in the stande stochastic ﬁrst—order oracle model, e.g. that which is assumed in order to obtain the lower bounds in Nemirovski and Yudin (1983) and Agarwal et a1. (2012). Streaming SVRG computes the gradient of the randomly drawn 1 at two points, while the oracle model only allows gradient queries at one point.

We have the following algorithmic guarantee under only Assumption 2.1, which is a corollary of Theorem 25 (also see the Appendix).

Corollary 4 (Convergence under a-bounded Hessians) Suppose Assumption 2.] holds. Fix 150 6

Rd. Forp 2 2 andb 2 3, set17 = 20br++b m = W, kg = 20ahbp+1, and hs = bh8_1. Denote: 5—1 N. L“ (k, + m) 7'20

( N 8 is an upper bound on the number Ofsamples drawn up to the end Ofstage 5). Let 1? NS be the parameter returned at iteration s by Algorithm 1. For N 8 2 bp2+6ph (and s0 5 > p2 + 61)), we have

N 2 E1P<m>7p<m>ig ((1.3) Xv“ W1

When oz = 1 (such as for least squares regression), the above bound achieves the ERM rate of 0'2/N (up to a constant factor, which can be driven to one, as discussed later). Furthermore, under self—concordance, we can drive the competitive ratio (3) down from oz to arbitrarily near to 1. The following is a corollary of Theorem 26 (also see the Appendix):

Corollary 5 (Convergence under self-concordance) Suppose Assumptions 2.] and 2.2 hold. Con- sider1D0 6 Rd. Forp 2 2 andb 2 3, set17 = 20b; , m = W, kg = max{400h2b2p+3,100} =

max{bmh,100}, and hs = bh8_1. Denote N8 déf 28—5015 + m) (an upper bound on the number

T: of samples drawn up to the end of stage 5). Let 1? NS be the parameter returned at iteration s by

Algorithm 1. Let 60 = P(1D0) 7 P(1U1). Then:

2 EWNJ * PM 3 l<1+%>7 +<2+%>f77:min{17<mvm>i}+ Til l (2,.) 2 j

Remark 6 (Implementation and parallelization) Note that Algorithm 1 is simple to implement and requires little space. In each iteration, the space usage is linear in the size of a single sample (along with needing to count to hs and m). Furthermore, the algorithm is easily parallelizable once we have run enough stages. In both Theorem 25 and Theorem 26 as 5 increases ks grows geometrically, whereas m remains constant. Hence, the majority of the computation time is spent averaging the gradient, i.e. (9), which is easily parallelizable.

Note that the constants in the parameter settings for the Algorithm have not been optimized. Furthermore, we have not attempted to fully optimize the time it takes the algorithm to enter the second phase (in which self—concordance is relevant), and we conjecture that the algorithm in fact enjoys even better dependencies. Our emphasis is on an analysis that is ﬂexible in that it allows for a variety of assumptions in driving the competitive ratio to 1 (as is done in the case of logistic regression in Section 3, where we use a slight variant of self—concordance).

Before providing statistical rates for the ERM, let us remark that the above achieves super- polynomial convergence rates and that the competitive ratio can be driven to I (recall that 0'2 /N is the rate of the ERM).

Remark 7 (Linear convergence and super-polynomial convergence) Suppose the ratio ”y between P(1D0) 7 P(1U1) and 0'2 is known approximately (within a multiplicative factor), we can let hs = kg for 10gb 7 number of iterations, then start increasing hs = bh8_1. This way in the ﬁrst 10gb "y iter- ations lE[P(1Z)NS) 7 P(w*)] is decreasing geometrically. Furthermore, even without knowing the ratio 7, we can can obtain a super-polynomial rate of convergence by setting the parameters as we specify in the next remark. (The dependence on the initial error will then be 2‘m10gN/ 10g log N )2 .)

Remark 8 (Driving the ratio to 1) By choosing b sufﬁciently large, the competitive ratio (3) can be made close to 1 (on every problem). Furthermore, we can ensure this constant goes to 1 by altering the parameter choices adaptively: let hs = 48(5!)h0, and let 775 = 17/ 28 , m8 = m - 45. Intuitively, h grows so fast that lims_,oO hs /N8 = 1; 775 and ms are also changing fast enough so the initial error vanishes very quickly.

2.3. Competing with the ERM

Now we provide a ﬁnite—sample characterization of the rate of convergence of the ERM under regularity conditions. This essentially gives the numerator of (3), allowing us to compare the rate of the ERM against the rate achieved by Streaming SVRG. We provide the more general result in Theorem 31; this section focuses on a corollary.

In the following, we constrain the domain 8; so the ERM, as deﬁned in (2), is taken over this restricted set. Further discussion appears in Theorem 31 and the comments thereafter.

Corollary 9 (of Theorem 31) Suppose 1111,1112, . . . ,11JN are an independently drawn sample from D. Assume the following regularity conditions hold; see Theorem 31 for weaker conditions.

I. S is compact. 2. 111 is convex (with probability one ). 3. 10* is an interior point OfS, and V2P (112*) exists and is positive deﬁnite.

4. (Smoothness) Assume the ﬁrst, second, and third derivatives of 111 exist and are uniformly bounded on S.

Then, for the ERM @ERM (as deﬁned in (2)) we have limN_>00 W:

N 1.

In particular the followmg lower and upper bounds hold. With problem dependent constants CO and Cl (polynomial in the relevant quantities, as speciﬁed in Theorem 31), we have for all p 2 2, if N satisﬁes ML < CO, then

(1 7 CH/ﬁff—N) 0W g 1E[P(1DfVRM)7 p(wm g (1 + (:1, /m,%:Vd_N) 0% + W

3. Applications: one pass learning and generalization

We now provide a few applications to stande statistical models. For the least—squares regression, we also instantiate upper and lower bounds for the ERM. The applications in this section can be eX- tended to include generalized linear models, some M 7estimation problems, and other loss functions (e.g. the Huber loss).

3.1. Linear least-squares regression

In linear regression, the goal is to minimize the (possibly [g-regularized) squared loss 111ny(112) = (Y 7 wTX)2 + A H1016 for arandom data point (X, Y) E RCl X R. The objective (1) is

P(w)=EX,m (Y7wTX)2 +Al|wl|§. (11)

3.1.1. Upper Bound For The Algorithm

Using that oz = 1, the following corollary illustrates that Algorithm 1 achieves the rate of the ERM,

Corollary 10 (Least-squares performance of streaming SVRG) Suppose that HX H2 g L. Using the parameter settings of Corollary 4, taking 11 = A 7 Amin(2), and supposing that N 2 pr-Hipﬁ’ 2 N 4 a P(1D0) 7 P(w1) lE[P(1uN) 7 P(w1)] g (1 7 7) — 7 — b «N (7)1

Remark 11 (When N g 11) If the sample size is less than 11 and A = 0, there exist distributions on X in which the ERM is not unique (as the sample matrix % Z X ,X ,T will not be invertible, with reasonable probability, on these distributions by construction).

Remark 12 (When do the streaming SVRG bounds become meaningful?) Algorithm 1 is com- petitive with the performance of the ERM when the sample size N is slightly larger than a constant times 11. In particular, as the sample N size grows larger than 11, then the initial error is decreased at an arbitrary polynomial rate in N / 11.

Let us consider a few special cases. First, consider the unregularized setting where A = 0. Assume also that the least—squares problem is well-speciﬁed. That is, Y = ’11): X +17 where R[n] = 0 and M172] = 0'2 Deﬁne E = R[XXT]. Here, we have

noise' 2 2 2 2 U = EH77 X11271 2 danoise'

In other words, Corollary 10 recovers the classical rate in this case. In the mis-speciﬁed case 7 where we do not assume the aforementioned model is correct (i.e. lE[Y|X] may not equal MIX ) 7 deﬁne K(X) = MIX, and we have

02 =EHKY’Y7(X))X+)\w7ilfz+)\1)711 (12)

3.1.2. Statistical Upper And Lower Bounds =

For comparison, the following corollary (of Theorem 31) provides lower and upper bounds for the statistical rate of the ERM.

Corollary 13 (Least-squares ERM bounds) Suppose that HX 11%27A1)’1 g 7% and the dimension is d (in the inﬁnite dimensional setting, we may take d to be the intrinsic dimension, as per Remark 32). Let c be an appropriately chosen universal constant. For all p > 0, if # g g, then

E[P(1DJEVRM), p(wQ] 2 <1 7 CW) 0N2 , JEszjl

where z = 1|V¢(w7)1l(vzp(w,))71 = 1|(Y 7 wIX>X 7 Aw1|am7 For an upper bound, we have two cases:

o (Unregularized case) Suppose A = 0. Assume that we constrain the ERM to lie in some compact set S (and supposing 10* E S). Thenfor allp > 0, if# g g, we have

Eipmfvm) 7 Pm.» : <1 7 cw—Epkj’idN) ”N27 —maxw€5 (13$) ’ PM)

o (Regularized case) Suppose A > 0. Thenfor allp > 0, if# g g, we have

~ Amax(E+A) 2 A hplog dN a2 a 1E[P(1u)EVRM) 7 Pm.» S (1 7 q/T N 7 1v—p

( this last equation follows from a modiﬁcation of the argument in Equation (38)).

Remark 14 (ERM comparisons) Interestingly, for the upper bound (when A = 0), we see no way to avoid constraining the ERM to lie in some compact set; this allows us to bound the loss P in the event of some extremely low probability failure (see Theorem 31). The ERM upper bound has a term comparable to the initial error of our algorithm. In contrast, the lower bound is for the usual unconstrained least—squares estimator.

3.2. Logistic regression

In (binary) logistic regression, we have a distribution on (X , Y) E RCl X {0, 1}. For any 11), deﬁne

def exp(yXT1u) E<Y=9|M> = m

for X E RCl and y 6 {0,1}. We do not assume the best ﬁt model 10* is correct. The loss function is taken to be the regularized log likelihood 111ny(10) = 7 log R(Y | 10, X) 7 A Hwﬂg and the objective (1) instantiates as the negative expected (regularized) log likelihood P(w) = R[7 loglP’(Y | 1U,X)] 7 AH’LUHS. Deﬁne K(X) = R(Y = 1 | 111*,X) and E1 = V2P(w*) = E[K(X)(1 7 K(X))XXT] 7 AI. Analogous to the least—squares case, we can interpret K(X) as the conditional expectation of Y under the (possibly mis-speciﬁed) best ﬁt model. With this notation, 0'2 is similar to its instantiation under regularized least—squares (Equation (12)):

1 a2 = E §H(Y 7 wt»)? 7 11,792-,

Under this deﬁnition of 0'2, by Corollary 5 together with the following deﬁned quantities, the single-pass estimator of Algorithm 1 achieves a rate competitive with that of the ERM:

Corollary 15 (Logistic regression performance) Suppose that HX H2 g L. Under parameters from Corollary 5, taking 11 = A andM = aE[i|Xi|?V2P(w*))’l]’ and denoting 60 = P(@0)7P(’LU*), we have 2 \/6—0

N w (m)2

The corollary uses Lemma 17, a straightforward lemma to handle self—concordance for logistic regression, which is included for completeness. See Bach (2010) for techniques for analyzing the self—concordance of logistic regression.

E1P<aN>7 P(wm : (1 7 %) 7'7 7 (2 7 %) £75 min {17 (m) 1+

Acknowledgments

We thank the COLT reviewers for their detailed reviews and helpful comments, and we thank Jonathan Kelner, Yin Tat Lee, and Boaz Barak for helpful discussion. Part of this work was done while AS was visiting the Simons Institute for the Theory of Computing, UC Berkeley. This work was partially supported by NSF awards 0843915 and 1111109, NSF Graduate Research Fellowship (grant no. 1122374).

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2015 CompetingwiththeEmpiricalRiskMi	Rong Ge Sham M. Kakade Ellango Jothimurugesan Ashraf Tahmasbi Phillip B Gibbons Srikanta Tirthapura Roy Frostig Aaron Sidford			Competing with the Empirical Risk Minimizer in a Single Pass						2015 2018

[1] A well specified statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.1

[2] Note that biased estimators, e.g. the James-Stein estimator, can outperform the MLE (Lehmann and Casella, 1998).

[1]

[2]