2018 VarianceReducedStochasticGradie

Subject Headings: STRSAGA, Stochastic Gradient Descent, Streaming Stochastic Variance Reduced Gradient.

Notes

We present an algorithm STRSAGA for efficiently maintaining a machine learning model over data points that arrive over time, quickly updating the model as new training data is observed. We present a competitive analysis comparing the suboptimality of the model maintained by STRSAGA with that of an offline algorithm that is given the entire data beforehand, and analyze the risk-competitiveness of STRSAGA under different arrival patterns. Our theoretical and experimental results show that the risk of STRSAGA is comparable to that of offline algorithms on a variety of input arrival patterns, and its experimental performance is significantly better than prior algorithms suited for streaming data, such as SGD and SSVRG.

Stochastic Gradient Descent (SGD) [RM51] and its extensions are used extensively in practice for learning from large datasets. While an iteration of SGD is cheap relative to an iteration of a full gradient method, its variance can be high. To control the variance, the learning rate of SGD must decay over time, resulting in a sublinear convergence rate. Newer variance-reduced versions of SGD, on the other hand, achieve linear convergence on strongly convex objective functions, generally by incorporating a correction term in each update step that approximates a full gradient, while still ensuring each iteration is efficient like SGD.

SAG [RSB12] was among the first variance reduction methods proposed and achieves linear convergence rate for smooth and strongly convex problems. SAG requires storage of the last gradient computed for each data point and uses their average in each update. SAGA [DBLJ14] improves on SAG by eliminating a bias in the update. Stochastic Variance-Reduced Gradient (SVRG) [JZ13] is another variance reduction method that does not store the computed gradients, but periodically computes a full-data gradient, requiring more computation than SAGA. Semi-Stochastic Gradient Descent (S2GD) [KR13] differs from SVRG by computing a random stochastic gradient at each iteration and a full gradient occasionally. The gap between two full gradient computations is determined by a geometric law. CHEAPSVRG [SAKS16] is another variant of SVRG. In contrast with SVRG, it estimates the gradient through computing the gradient on a subset of training data points rather than all the data points. However, all of the above variance-reduced methods require [math]\displaystyle{ O(n log n) }[/math] iterations to guarantee convergence to statistical accuracy (to yield a good fit to the underlying data) for [math]\displaystyle{ n }[/math] data points. DYNASAGA [DLH16] achieves statistical accuracy in only [math]\displaystyle{ O(n) }[/math] iterations by using a gradually increasing sample set and running SAGA on it.

So far, all the algorithms we have discussed are offline, and assume the entire dataset is available beforehand. Streaming SVRG (SSVRG) [FGKS15] is an algorithm that handles streaming data arrivals, and processes them in a single pass through data, using limited memory. In our experimental study, we found STRSAGA to be significantly more accurate than SSVRG. Further, our analysis of STRSAGA shows that it handles arrival distributions which allow for burstiness in the stream, while SSVRG is not suited for this case. In many practical situations, restricting a streaming data algorithm to use limited memory is overly restrictive and as our results show, leads to worse accuracy.

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2018 VarianceReducedStochasticGradie	Rong Ge Sham M. Kakade Ellango Jothimurugesan Ashraf Tahmasbi Phillip B Gibbons Srikanta Tirthapura Roy Frostig Aaron Sidford			Variance-Reduced Stochastic Gradient Descent on Streaming Data						2015 2018