2011 AdaptiveSubgradientMethodsforOn

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

1 Introduction

In many applications of online and stochastic learning, the input instances are of very high dimension, yet within any particular instance only a few features are non-zero. It is often the case, however, that the infrequently occurring features are highly informative and discriminative. The informativeness of rare features has led practitioners to craft domain-specific feature weightings, such as TF-IDF (Salton and Buckley, 1988), which pre-emphasize infrequently occurring features. We use this old idea as a motivation for applying modern learning-theoretic techniques to the problem of online and stochastic learning, focusing specifically on (sub)gradient methods.

Standard stochastic subgradient methods largely follow a predetermined procedural scheme that is oblivious to the characteristics of the data being observed. In contrast, our algorithms dynamically incorporate knowledge of the geometry of the data from earlier iterations to perform more informative gradient-based learning. Informally, our procedures associate frequently occurring features with low learning rates and infrequent features high learning rates. This construction prompts the learner to “take notice” each time an infrequent feature is observed. Thus, the adaptation facilitates identification and adaptation of highly predictive but comparatively rare features.

1.1 The Adaptive Gradient Algorithm

For simplicity, consider the basic online convex optimization setting. The algorithm iteratively makes a prediction [math]\displaystyle{ x_t ∈ \mathcal{X} }[/math], where [math]\displaystyle{ \mathcal{X} ⊆ \mathbb{R}^d }[/math] is a closed convex set, and then receives a convex loss function [math]\displaystyle{ f_t }[/math]. Define the regret with respect to the (optimal) predictor [math]\displaystyle{ x^∗ \in \mathcal{X} }[/math] as

[math]\displaystyle{ R(T) \triangleq \sum_{t=1}^{T}\lbrack f_t(x_t) − f_t(x^∗) \rbrack . }[/math]

... Before introducing our adaptive gradient algorithm, which we term ADAGRAD, we establish no- tation. Vectors and scalars are lower case italic letters, such as x E X . We denote a sequence of vectors by subscripts, that is, xt,xt+1,. . ., and entries of each vector by an additional subscript, for example, xtyj. The subdifferential set of a function f evaluated at x is denoted a f (x), and a partic- ular vector in the subdifferential set is denoted by f’ (x) E a f (x) or gt 6 81309). When a function is differentiable, we write V f (x) We use (my) to denote the inner product between x and y. The Bregman divergence associated with a strongly convex and differentiable function \V is

(...)

We also make frequent use of the following two matrices. Let gm = [g1 -- - gt] denote the matrix obtained by concatenating the subgradient sequence. We denote the ith row of this matrix, which amounts to the concatenation of the ith component of each subgradient we observe, by g1”. We also define the outer product matrix G; = 22:1 ngTT.

Online learning and stochastic optimization are Closely related and basically interchangeable (Cesa—B ianchi et al., 2004). In order to keep our presentation simple, we confine our discussion and algorithmic descriptions to the online setting with the regret bound model. In online learning, the learner repeatedly predicts a point x; E X g Rd, which often represents a weight vector assigning importance values to various features. The learner’s goal is to achieve low regret with respect to a static predictor x* in the (Closed) convex set X g Rd (possibly X = Rd) on a sequence of functions fi(x), measured as

(...) At every timestep t, the learner receives the (sub)gradient information gt 6 8ft(xt). Standard sub- gradient algorithms then move the predictor x; in the opposite direction of gt while maintaining xt+1 E X Via the projected gradient update (e.g., Zinkevich, 2003)

(...)

In contrast, let the Mahalanobis norm H - HA = x / (-,A-) and denote the projection of a point y onto X according to A by 119(0)) = argminxex Hx i yHA = argminxex (x i y,A(x i y)). Using this notation, our generalization of standard gradient descent employs the update

(...)

The above algorithm is computationally impractical in high dimensions since it requires computa- tion of the root of the matrix Gt, the outer product matrix. Thus we specialize the update to

(...)

Both the inverse and root of diag(Gt) can be computed in linear time. Moreover, as we discuss later, when the gradient vectors are sparse the update above can often be performed in time proportional to the support of the gradient. We now elaborate and give a more formal discussion of our setting.

In this paper we consider several different online learning algorithms and their stochastic convex optimization counterparts. Formally, we consider online learning with a sequence of composite functions (1);. Each function is of the form ¢t(x) = fi(x) + (p(x) where ft and (p are (closed) convex functions. In the learning settings we study, ft is either an instantaneous loss or a stochastic estimate of the objective function in an optimization task. The function (p serves as a fixed regularization function and is typically used to control the complexity of x. At each round the algorithm makes a prediction x; E X and then receives the function fi. We define the regret with respect to the fixed (optimal) predictor x* as

(...)

Our goal is to devise algorithms which are guaranteed to suffer asymptotically sub-linear regret, namely, R¢(T) = 0(T).

Our analysis applies to related, yet different, methods for minimizing the regret (2). The first is Nesterov’s primal-dual subgradient method (2009), and in particular Xiao’s (2010) extension, regularized dual averaging, and the follow-the—regularized-leader (FTRL) family of algorithms (see for instance Kalai and Vempala, 2003; Hazan et al., 2006). In the primal-dual subgradient method the algorithm makes a prediction x; on round I using the average gradient g = [1 22:1 gr. The update encompasses a trade-off between a gradient—dependent linear term, the regularizer (p, and a strongly- convex term Wt for well-conditioned predictions. Here Wt is the proximal term. The update amounts to solving

(...)

where n is a fixed step—size and x1 = argminxex(p(x). The second method similarly has numer- ous names, including proximal gradient, forwaId-backward splitting, and composite mirror descent (Tseng, 2008; Duchi et al., 2010). We use the term composite mirror descent. The composite mirror descent method employs a more immediate trade-off between the current gradient g, (p, and staying close to x; using the proximal function V,

(...)

Our work focuses on temporal adaptation of the proximal function in a data driven way, while previous work simply sets Wt: w, Wt(' )= \flw(-), or Wt(') = tw(-) for some fixed w.

We provide formal analyses equally applicable to the above two updates and show how to au- tomatically choose the function Wt so as to achieve asymptotically small regret. We describe and analyze two algorithms. Both algorithms use squared Mahalanobis norms as their proximal func- tions, setting Wt (x) = (x7 Htx) for a symmetric matrix H; t 0. The first uses diagonal matrices while the second constructs full dimensional matrices. Concretely, for some small fixed 5 2 0 (specified later, though in practice 5 can be set to 0) we set

(...)

Plugging the appropriate matrix from the above equation into Wt in (3) or (4) gives rise to our ADAGRAD family of algorithms. Informally, we obtain algorithms which are similar to second- order gradient descent by constructing approximations to the Hessian of the functions fi, though we use roots of the matrices.

1.2 Outline of Results

We now outline our results, deferring formal statements of the theorems to later sections. Recall the definitions of gm as the matrix of concatenated subgradients and G, as the outer product matrix in the prequel. The ADAGRAD algorithm with full matrix divergences entertains bounds of the form

(...)

We further show that

(...)

These results are formally given in Theorem 7 and its corollaries. When our proximal function wt(x) = <x,diag(Gt)1/2x> we have bounds attainable in time at most linear in the dimension d of our problems of the form

(...)

Similar to the above, we will show that

(...)

We formally state the above two regret bounds in Theorem 5 and its corollaries.

Following are a simple example and corollary to Theorem 5 to illustrate one regime in which we expect substantial improvements (see also the next subsection). Let (p E 0 and consider Zinke- Vich’s online gradient descent algorithm. Given a compact convex set X g Rd and sequence of convex functions fi, Zinkevich’s algorithm makes the sequence of predictions x1, . . . ,xT with xt+1 = U)((xt i (n/fl)gt). If the diameter of X is bounded, thus squex Hxin2 3 D2, then on- line gradient descent, with the optimal choice in hindsight for the stepsize 1] (see the bound (7) in Section 1.4), achieves a regret bound of

(...)

When X is bounded Via squex Hx i yHW 3 D50, the following corollary is a simple consequence of our Theorem 5.

Corollary 1 Let the sequence {xt} C Rd be generated by the update (4 ) and assume that max, Hx" i x, Hoe 3 Dog. Using stepsize n = DW/fl, for any x*, thefollowing bound holds.

(...)

The important feature of the bound above is the infimum under the square root, which allows us to perform better than simply using the identity matrix, and the fact that the stepsize is easy to set a priori. For example, if the setX = {x2 HxHW g 1}, then D2 = 2M3 while D5,, = 2, which suggests that if we are learning a dense predictor over a box, the adaptive method should perform well. Indeed, in this case we are guaranteed that the bound in Corollary 1 is better than (6) as the identity matrix belongs to the set over which we take the infimum.

To conclude the outline of results, we would like to point to two relevant research papers. First, Zinkevich’s regret bound is tight and cannot be improved in a minimax sense (Abernethy et al., 2008). Therefore, improving the regret bound requires further reasonable assumptions on the input space. Second, in a independent work, performed concurrently to the research presented in this paper, McMahan and Streeter (2010) study competitive ratios, showing guaranteed improvements of the above bounds relative to families of online algorithms.

1.3 Improvements and Motivating Example

As mentioned in the prequel, we expect our adaptive methods to outperform standard online learning methods when the gradient vectors are sparse. We give empirical evidence supporting the improved performance of the adaptive methods in Section 6. Here we give a few abstract examples that show that for sparse data (input sequences where g, has many zeros) the adaptive methods herein have better performance than non-adaptive methods. In our examples we use the hinge loss, that is,

(...)

where y, is the label of example t and z, 6 Rd is the data vector.

For our first example, which was also given by McMahan and Streeter (2010), consider the following sparse random data scenario, where the vectors z, E {71,0,1}d. Assume that at in each round t, feature i appears with probability p,- = min{1,ci’°‘} for some 0c 6 (1,00) and a dimension- independent constant c. Then taking the expectation of the gradient terms in the bound in Corol- lary 1, we have

(...)

by Jensen’s inequality. In the rightmost sum, we have c2321 i7042 = O(logd) for 0c 2 2, and 22:1 i7042 = O(dl’a/z) foroc 6 (1,2). If the domainX is ahypercube, sayX = {x2 HxHW g 1}, then in Corollary 1 D5,, = 2, and the regret of ADAGRAD is 0(max{log d , d 1")‘/2]>\/T) For contrast, the standard regret bound (6) for online gradient descent has D2 = 2% and HgtHg 2 1, yielding best case regret O(M). So we see that in this sparse yet heavy tailed feature setting, ADAGRAD’s re- gret guarantee can be exponentially smaller in the dimension d than the non-adaptive regret bound.

Our remaining examples construct a sparse sequence for which there is a perfect predictor that the adaptive methods learn after d iterations, while standard online gradient descent (Zinkevich, 2003) suffers significantly higher loss. We assume the domain X is compact, so that for online gradient descent we set 1]; = n/fl, which gives the optimal 0(\/T) regret (the setting of 1] does not matter to the adversary we construct).

1.3.1 DIAGONAL ADAPTATION

Consider the diagonal version of our proposed update (4) with X = {x : HxHW g 1}. Evidently, we can take Dog = 2, and this choice simply results in the update xt+1 = x; i \/2diag(Gt)’1/2gt followed by projection (1) onto X for ADAGRAD (we use a pseudo-inverse if the inverse does not exist). Let e,- denote the ith unit basis vector, and assume that for each t, z; = :tel- for some i. Also let y; = sign((1,zt)) so that there exists a perfect classifier x* = 1 E X C Rd. We initialize x1 to be the zero vector. Fix some 8 > 0, and on rounds rounds t = 1, . . . ,nZ/EZ, set z; = e1. After these rounds, simply choose z; = :tel- for index i E {2,. . . ,d} chosen at random. It is clear that the update to parameter x,- at these iterations is different, and amounts to

(...)

(Here ['hilynd denotes the truncation of the vector to [71,1]d). In particular, after suffering d i 1 more losses, ADAGRAD has a perfect classifier. However, on the remaining iterations gradient descent has n/fl g 8 and thus evidently suffers loss at least d / (28). Of course, for small 8, we have d / (28) >> d . In short, ADAGRAD achieves constant regret per dimension while online gradient descent can suffer arbitrary loss (for unbounded t). It seems quite silly, then, to use a global learning rate rather than one for each feature.

Full Matrix Adaptation. We use a similar construction to the diagonal case to show a situation in which the full matrix update from (5) gives substantially lower regret than stochastic gradient descent. For full divergences we set X = {x : Htz 3 fl}, Let V = [v1 vd] 6 Rd” be an orthonormal matrix. Instead of having z; cycle through the unit vectors, we make 2; cycle through the v,- so that z; = :tvi. We let the label y; = sign(<1,Vth)) = sign (231:1 (vi,zt)). We provide an elaborated explanation in Appendix A. Intuitively, with wt(x) = (x,Htx) and H; set to be the full matrix from (5), ADAGRAD again needs to observe each orthonormal vector v,- only once while stochastic gradient descent’s loss can be made .Q.(d/£) for any 8 > 0.

1.4 Related Work

Many successful algorithms have been developed over the past few years to minimize regret in the online learning setting. A modern View of these algorithms casts the problem as the task of following the (regularized) leader (see Rakhlin, 2009, and the references therein) or FTRL in short. Informally, FTRL methods choose the best decision in hindsight at every iteration. Verbatim usage of the FTRL approach fails to achieve low regret, however, adding a proximal[1] term to the past predictions leads to numerous low regret algorithms (Kalai and Vempala, 2003; Hazan and Kale, 2008; Rakhlin, 2009). The proximal term strongly affects the performance of the learning algorithm. Therefore, adapting the proximal function to the characteristics of the problem at hand is desirable.

Our approach is thus motivated by two goals. The first is to generalize the agnostic online learning paradigm to the meta—task of specializing an algorithm to fit a particular data set. Specifically, we change the proximal function to achieve performance guarantees which are competitive with the best proximal term found in hindsight. The second, as alluded to earlier, is to automatically adjust the learning rates for online learning and stochastic gradient descent on a per-feature basis. The latter can be very useful when our gradient vectors gt are sparse, for example, in a classification setting where examples may have only a small number of non-zero features. As we demonstrated in the examples above, it is rather deficient to employ exactly the same learning rate for a feature seen hundreds of times and for a feature seen only once or twice.

Our techniques stem from a variety of research directions, and as a byproduct we also extend a few well—known algorithms. In particular, we consider variants of the follow-the-regularized leader (FTRL) algorithms mentioned above, which are kin to Zinkevich’s lazy projection algorithm. We use Xiao’s recently analyzed regularized dual averaging (RDA) algorithm (2010), which builds upon Nesterov’s (2009) primal—dual subgradient method. We also consider forward-backward splitting (FOBOS) (Duchi and Singer, 2009) and its composite mirror-descent (proximal gradient) general- izations (Tseng, 2008; Duchi et al., 2010), which in turn include as special cases projected gradients (ZinkeVich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Re- cent work by several authors (Nemirovski et al., 2009; Juditsky et al., 2008; Lan, 2010; Xiao, 2010) considered efficient and robust methods for stochastic optimization, especially in the case when the expected objective f is smooth. It may be interesting to investigate adaptive metric approaches in smooth stochastic optimization.

The idea of adapting first order optimization methods is by no means new and can be traced back at least to the 1970s with the work on space dilation methods of Shor (1972) and variable metric methods, such as the BFGS family of algorithms (e.g., Fletcher, 1970). This prior work often assumed that the function to be minimized was differentiable and, to our knowledge, did not consider stochastic, online, or composite optimization. In her thesis, Nedic (2002) studied variable metric subgradient methods, though it seems difficult to derive explicit rates of convergence from the results there, and the algorithms apply only when the constraint set X = Rd. More recently, Bordes et al. (2009) proposed a Quasi—Newton stochastic gradient—descent procedure, which is similar in spirit to our methods. However, their convergence results assume a smooth objective with positive definite Hessian bounded away from 0. Our results apply more generally.

Prior to the analysis presented in this paper for online and stochastic optimization, the strongly convex function \V in the update equations (3) and (4) either remained intact or was simply multiplied by a time—dependent scalar throughout the run of the algorithm. Zinkevich’s projected gradient, for example, uses wt(x) = Hng, while RDA (Xiao, 2010) employs wt(x) = \flw(x) where \V is a strongly convex function. The bounds for both types of algorithms are similar, and both rely on the norm H - H (and its associated dual H - H *) with respect to which \V is strongly convex. Mirror-descent type first order algorithms, such as projected gradient methods, attain regret bounds of the form (ZinkeVich, 2003; Bartlett et al., 2007; Duchi et al., 2010)

(...)

Choosing n <>< 1/\/T gives R¢(T) = 0(\/T). When Bw(x,x*) is bounded for all x E X, we choose step sizes 1]; <>< 1 /\/E which is equivalent to setting wt(x) = \flw(x). Therefore, no assumption on the time horizon is necessary. For RDA and follow-the-leader algorithms, the bounds are similar (Xiao, 2010, Theorem 3):

(...)

The problem of adapting to data and obtaining tighter data—dependent bounds for algorithms such as those above is a natural one and has been studied in the mistake-bound setting for online learning in the past. A framework that is somewhat related to ours is the confidence weighted learning scheme by Crammer et al. (2008) and the adaptive regularization of weights algorithm (AROW) of Crammer et al. (2009). These papers provide mistake-bound analyses for second- order algorithms, which in turn are similar in spirit to the second-order Perceptron algorithm (Cesa- Bianchi et al., 2005). The analyses by Crammer and colleagues, however, yield mistake bounds dependent on the runs of the individual algorithms and are thus difficult to compare with our regret bounds.

AROW maintains a mean prediction vector ,th 6 Rd and a covariance matrix E; 6 over ,th as well. At every step of the algorithm, the learner receives a pair (Znyt) where z; 6 Rd is the tth example and y; E {71, +1} is the label. Whenever the predictor ,th attains a margin value smaller than 1, AROW performs the update

(...)

In the above scheme, one can force Z; to be diagonal, which reduces the run-time and storage requirements of the algorithm but still gives good performance (Crammer et al., 2009). In contrast to AROW, the ADAGRAD algorithm uses the root of the inverse covariance matrix, a consequence of our formal analysis. Crammer et al.’s algorithm and our algorithms have similar run times, generally linear in the dimension d, when using diagonal matrices. However, when using full matrices the runtime of AROW algorithm is O(dz), which is faster than ours as it requires computing the root of a matrix.

In concurrent work, McMahan and Streeter (2010) propose and analyze an algorithm which is very similar to some of the algorithms presented in this paper. Our analysis builds on recent advances in online learning and stochastic optimization (Duchi et al., 2010; Xiao, 2010), whereas McMahan and Streeter use first—principles to derive their regret bounds. As a consequence of our approach, we are able to apply our analysis to algorithms for composite minimization with a known additional objective term (p. We are also able to generalize and analyze both the mirror descent and dual-averaging family of algorithms. McMahan and Streeter focus on what they term the compet- itive ratio, which is the ratio of the worst case regret of the adaptive algorithm to the worst case regret of a non-adaptive algorithm with the best proximal term \V chosen in hindsight. We touch on this issue briefly in the sequel, but refer the interested reader to McMahan and Streeter (2010) for this alternative elegant perspective. We believe that both analyses shed insights into the problems studied in this paper and complement each other.

There are also other lines of work on adaptive gradient methods that are not directly related to our work but nonetheless relevant. Tighter regret bounds using the variation of the cost functions ft were proposed by Cesa—Bianchi et a1. (2007) and derived by Hazan and Kale (2008). Bartlett et a1. (2007) explore another adaptation technique for n; where they adapt the step size to accommodate both strongly and weakly convex functions. Our approach differs from previous approaches as it does not focus on a particular loss function or mistake bound. Instead, we View the problem of adapting the proximal function as a meta—learning problem. We then obtain a bound comparable to the bound obtained using the best proximal function chosen in hindsight.

2.AdapfiveProxhnalFuncfions

Examining the bounds (7) and (8), we see that most of the regret depends on dual norms of fl (XI), and the dual norms in turn depend on the choice of ‘11- This naturally leads to the question of whether we can modify the proximal term \1! along the run of the algorithm in order to lower the contribution of the aforementioned norms. We achieve this goal by keeping second order information about the sequence ft and allow \1! to vary on each round of the algorithms.

We begin by providing two corollaries based on previous work that give the regret of our base algorithms when the proximal function Wt is allowed to change. These corollaries are used in the sequel in our regret analysis. We assume that Wt is monotonically non-decreasing, that is, wt+1(x) 2 W106). We also assume that Wt is 1-strongly convex with respect to a time-dependent semi—norm 11- H\1/1' Formally, \1! is 1-strongly convex with respect to H - H\1/ if

(...)

Strong convexity is guaranteed if and only if BW, (x,y) 2 % Hx i Y11\21;,- We also denote the dual norm of 11- 11“,! by 11-1 W?’ For completeness, we provide the proofs of following two results in Appendix F, as they build straightforwardly on work by Duchi et al. (2010) and Xiao (2010). For the primal-dual subgradient update, the following bound holds.

Proposition 2 Let the sequence {xt} be defined by the update (3). For any x* E X,

(...)

For composite mirror descent algorithms a similar result holds.

Proposition 3 Let the sequence {xt} be defined by the update (4 ). Assume w. l. 0. g. that (p(xl) = 0. For any x* E X,

(...)

The above corollaries allow us to prove regret bounds for a family of algorithms that iteratively modify the proximal functions Wt in attempt to lower the regret bounds.

(...)

Primal-Dual Subgradient Update (3):

(...)

Composite Mirror Descent Update (4):

(...)

Figure 1: ADAGRAD with diagonal matrices

3. Diagonal Matrix Proximal Functions

We begin by restricting ourselves to using diagonal matrices to define matrix proximal functions and (semi)norms. This restriction serves a two-fold purpose. First, the analysis for the general case is somewhat complicated and thus the analysis of the diagonal restriction serves as a proxy for better understanding. Second, in problems with high dimension where we expect this type of modification to help, maintaining more complicated proximal functions is likely to be prohibitively expensive. Whereas earlier analysis requires a learning rate to slow changes between predictors x; and xt+1, we will instead automatically grow the proximal function we use to achieve asymptotically low regret. To remind the reader, glw- is the ith row of the matrix obtained by concatenating the subgradients from iteration 1 through t in the online algorithm.

To provide some intuition for the algorithm we show in Algorithm 1, let us examine the problem

(...) This problem is solved by setting s,- = H and scaling s so that (s, 1) = c. To see this, we can write the Lagrangian of the minimization problem by introducing multipliers 71 t 0 and 0 2 0 to get

(...)

Taking partial derivatives to find the infimum of L, we see that i 12 7 7»,- + 0 = 0, and com- plementarity conditions on his,- (Boyd and Vandenberghe, 2004) imply that 7»,- = 0. Thus we have

(...)

As a final note, we can plug s,- into the objective above to see

(...)

Let diag(v) denote the diagonal matrix with diagonal v. It is natural to suspect that for s achieving the infimum in Equation (12), if we use a proximal function similar to \1l(x) = (x,diag(s)x) with associated squared dual norm 1 2 = <x,diag(s)’1x), we should do well lowering the gradient terms in the regret bounds (10) and (11).

To prove a regret bound for our Algorithm 1, we note that both types of updates suffer losses that include a term depending solely on the gradients obtained along their run. The following lemma is applicable to both updates, and was originally proved by Auer and Gentile (2000), though we provide a proof in Appendix C. McMahan and Streeter (2010) also give an identical lemma.

Lemma 4 Let gt = f,’(xt) and gm and s, be defined as in Algorithm 1. Then

(...) To obtain a regret bound, we need to consider the terms consisting of the dual—norm of the sub- gradient in the regret bounds (10) and (11), which is H f,’ (XI) ‘2”. When W106) = (x, (51 + diag(st))x), it is easy to see that the associated dual-norm is

(...)

From the definition of st in Algorithm 1, we clearly have 11 f, (x‘)\1/ <<gt, diag(st) 1gt). Note that if St,i_ — 0 then gtyi— — 0 by definition of Sm. Thus, for any 5 2 0, Lemma 4 implies

(...)

To obtain a bound for a primal—dual subgradient method, we set 5 2 max; 11 gt

(...)

It remains to bound the various Bregman divergence terms for Corollary 3 and the term WT (x*) for Corollary 2. We focus first on the composite mirror-descent update. Examining the bound (11) and Algorithm 1, we notice that

(...)

g (gt,diag(st)’1gt), and we follow the same lines of reasoning to achieve the inequal-

(...) Since (...) We also have

(...)

Combining the above arguments with Corollaries 2 and 3, and using (14) with the fact that Bun (x*,x1) g % Hx" :xl 1|:(1,s1), we have proved the following theorem.

Theorem 5 Let the sequence {x;} be defined by Algorithm 1. For x; generated using the primal- dnal subgradient update (3) with 5 2 max; 11 g; for any x* E X,

(...)

For x; generated using the composite mirror—descent update (4 ), for any x* E X

(...)

The above theorem is a bit unwieldy. We thus perform a few algebraic simplifications to get the next corollary, which has a more intuitive form. Let us assume that X is compact and set Dog — squex Hx:x* 111x,- Furthermore, define

(...)

Also w.l.o.g. let 0 E X . The following corollary is immediate (this is equivalent to Corollary 1, though we have moved the Mg term in the earlier bound).

Corollary 6 Assume that Dog and 71 are defined as above. For {x;} generated by Algorithm 1 using the primal-dnal subgradient update (3) with n = Hx" 1100’ for any x* E X we have

(...)

Using the composite mirror descent update (4 ) to generate {XI} and setting 11 = DW/\/2, we have

(...)

We now give a short derivation of Corollary 1 from the introduction: use Theorem 5, Corollary 6,

and the fact that

(...)

as in (12) in the beginning of Section 3. Plugging the YT term in from Corollary 6 and multiplying Dog by ME completes the proof of the corollary.

(...)

As discussed in the introduction, Algorithm 1 should have lower regret than non-adaptive algo- rithms on sparse data, though this depends on the geometry of the underlying optimization space X . For example, suppose that our learning problem is a logistic regression with 0/ l-Valued features. Then the gradient terms are likewise based on 0/1-Valued features and sparse, so the gradient terms in the bound 211:1 H gm) 12 should all be much smaller than \/T. If some features appear much more frequently than others, then the infimal representation of YT and the infimal equality in Corollary 1 show that we have significantly lower regret by using higher learning rates for infrequent features and lower learning rates on commonly appearing features. Further, if the optimal predictor is rela- tively dense, as is often the case in predictions problems with sparse inputs, then 1115,6111» is the best p-norm we can have in the regret.

More precisely, McMahan and Streeter (2010) show that if X is contained within an fix, ball of radius R and contains an fix, ball of radius r, then the bound in the above corollary is within a factor of \/2R / r of the regret of the best diagonal proximal matrix, chosen in hindsight. So, for example, ifX = {x 6 Rd: Hpr g C}, then R/r = dl/P, which shows that the domain X does effect the guarantees we can give on optimality of ADAGRAD.

4. Full Matrix Proximal Functions

In this section we derive and analyze new updates when we estimate a full matrix for the divergence \11; instead of a diagonal one. In this generalized case, we use the root of the matrix of outer products of the gradients that we have observed to update our parameters. As in the diagonal case, we build on intuition garnered from an optimization problem, and in particular, we seek a matrix S which is the solution to the following minimization problem:

(...)

The solution is obtained by defining G; = 22:1 ngTT and setting S to be a normalized version of the root of GT, that is, S = cGlT/Z/ tr(G1T/ 2). For a proof, see Lemma 15 in Appendix E, which also

shows that when GT is not full rank we can instead use its pseudo-inverse. If we iteratively use divergences of the form \11;(x) = <x,G,1/2x>, we might expect as in the diagonal case to attain low

regret by collecting gradient information. We achieve our low regret goal by employing a similar doubling lemma to Lemma 4 and bounding the gradient norm terms. The resulting algorithm is given in Algorithm 2, and the next theorem provides a quantitative analysis of the brief motivation above.

Theorem 7 Let G; be the outer product matrix defined above and the sequence {x;} be defined by Algorithm 2. For x; generated using the primal-dnal subgradient update of (3) and 5 2 max; 11g; 112, for any x* E X

(...)

(...)

Figure 2: ADAGRAD with full matrices

Proof To begin, we consider the difference between the divergence terms at time t + 1 and time t from the regret (11) in Corollary 3. Let )bmax (M ) denote the largest eigenvalue of a matrix M . We have

(...)

For the last inequality we used the fact that the trace of a matrix is equal to the sum of its eigenvalues

along with the property Gt+11/2 : th/z : 0 (see Lemma 13 in Appendix B) and therefore tr(Gt1 121 :

(...) Now we use the fact that G1 is a rank 1 PSD matrix with non-negative trace to see that

(...)

It remains to bound the gradient terms common to all our bounds. We use the following three lemmas, which essentially directly applicable. We prove the first two in Appendix D.

Lemma 8 Let B t 0 and B’l/2 denote the root of the inverse of B when B > 0 and the root of the pseudo-inverse of B otherwise. For any V such that B : VggT : 0 the following inequality holds.

(...)

Lemma 9 (...)

Lemma 10 Let S; = th/z be as defined in Algorithm 2 and A1 denote the pseudo-inverse ofA. Then (...) Proof We prove the lemma by induction. The base case is immediate, since we have

(...) Now, assume the lemma is true for T : 1, so from the inductive assumption we get

(...)

Since STSI does not depend on t we can rewrite 2;? <g;,S171gt> as

(...)

where the right—most equality follows from the definitions of S; and G;. Therefore, we get

(...)

Using Lemma 8 with the substitution B = GT, V = 1, and g = g; lets us exploit the concavity of the function tr(A1/2) to bound the above sum by 2tr(G1T/2). A

We can now finalize our proof of the theorem. As in the diagonal case, we have that the squared dual norm (seminorm when 5 = 0) associated with \11; is

(...)

Thus it is clear that 11 gt1

‘2” g <gt, SIgt) For the dual-averaging algorithms, we use Lemma 9 above

show that 11 gt1 that

(...) so long as (...) Lemma 10’s doubling inequality then implies

(...) for the mirror-descent and primal-dual subgradient algorithm, respectively.

To finish the proof, Note that 3W1 (x*,x1) g % Hx" : x1 11% u(G1/2) when 5 = 0. By combining this with the first of the bounds (17) and the bound (16) on 2;? B“,1+1 (x*,x’11) : 3% (16211111), Corol- lary 3 gives the theorem’s statement for the mirror-descent family of algorithms. Combining the

fact that 24:1 11 fl (x;)1 3,271 g 2tr(G1T/2) and the bound (16) with Corollary 2 gives the desired bound

on R¢(T) for the primal—dual subgradient algorithms, which completes the proof of the theorem. I

As before, we can give a corollary that simplifies the bound implied by Theorem 7. The infimal equality in the corollary uses Lemma 15 in Appendix B. The corollary underscores that for learning problems in which there is a rotation U of the space for which the gradient vectors g; have small inner products (gt, U gt) (essentially a sparse basis for the g;) then using full—matrix proximal functions can attain significantly lower regret.

Corollary 11 Assume that (p(xl) = 0. Then the regret 0f the sequence {x;} generated by Algorithm 2 when using the primal-dnal subgradient update with n = Hx" 112 is

(...) Let X be compact set so that squex Hx:)c"H2 g D. Taking n = D / \/2 and using the composite mirror descent update with 5 = 0, we have

(...)

5. Derived Algorithms

In this section, we derive updates using concrete regularization functions (1) and settings of the domain X for the ADAGRAD framework. We focus on showing how to solve Equations (3) and (4) with the diagonal matrix version of the algorithms we have presented. We focus on the diagonal case for two reasons. First, the updates often take closed-form in this case and carry some intuition. Second, the diagonal case is feasible to implement in very high dimensions, whereas the full matrix version is likely to be confined to a few thousand dimensions. We also discuss how to efficiently compute the updates when the gradient vectors are sparse.

We begin by noting a simple but useful fact. Let G; denote either the outer product matrix of gradients or its diagonal counterpart and let H; = 5] + G; /2, as usual. Simple algebraic manipula- tions yield that each of the updates (3) and (4) in the prequel can be written in the following form (omitting the stepsize n):

(...)

In particular, at time t for the RDA update, we have n = mg). For the composite gradient update (4),

(...)

so that n = 11g; : Htxt. We now derive algorithms for solving the general update (18). Since most of the derivations are known, we generally provide only the closed-form solutions or algorithms for the solutions in the remainder of the subsection, deferring detailed derivations to Appendix G for the interested reader.

5.1 T 1 -regularizati0n

We begin by considering how to solve the minimization problems necessary for Algorithm 1 with diagonal matrix divergences and (p(x) = 7111x111. We consider the two updates we proposed and denote the ith diagonal element of the matrix H; = 5] + diag(st) from Algorithm 1 by Ht),- = 5 +

1| glw- 12. For the primal—dual subgradient update, the solution to (3) amounts to the following simple update for n+1):

(...)

Comparing the update (19) to the standard dual averaging update (Xiao, 2010), which is x11171 = S1g11(1§t,i)11\/T11§t,i1 7 7‘11. 7

it is clear that the difference distills to the step size employed for each coordinate. Our generalization of RDA yields a dedicated step size for each coordinate inversely proportional to the time-based norm of the coordinate in the sequence of gradients. Due to the normalization by this term the step size scales linearly with t, so when Ht),- is small, gradient information on coordinate i is quickly incorporated.

The composite mirror-descent update (4) has a similar form that essentially amounts to iterative shrinkage and thresholding, where the shrinkage differs per coordinate:

(...)

We compare the actual performance of the newly derived algorithms to previously studied versions in the next section.

For both updates it is clear that we can perform “lazy" computation when the gradient vectors are sparse, a frequently occurring setting when learning for instance from text corpora. Suppose that from time step to through t, the ith component of the gradient is 0. Then we can evaluate the above updates on demand since HtJi remains intact. For composite mirror-descent, at time t when xm- is needed, we update

(...)

Even simpler just in time evaluation can be performed for the the primal—dual subgradient update. Here we need to keep an unnormalized version of the average gt. Concretely, we keep track of n; = 1g; = 2;:1g1; = nt:1 1g), then use the update (19):

(...)

where H; can clearly be updated lazily in a similar fashion.

5.2 61-ball Projections

We next consider the setting in which (1) E 0 and X = {x : Hle g c}, for which it is straightfor- ward to adapt efficient solutions to continuous quadratic knapsack problems (Brucker, 1984). We

(...)

Figure 3: Project v t 0 to {z : (a,z) S 6,1 t 0}.

use the matrix H; = 5] + diag(G;)1/2 from Algorithm 1. We provide a brief derivation sketch and an O(dlog d) algorithm in this section. First, we convert the problem (18) into a projection prob- lem onto a scaled I 1-ball. By making the substitutions z = H 1/ 2x and A = H 71/ 2, it is clear that problem (18) is equivalent to

(...)

Now, by appropriate choice of v = :H 71/ 2n = Enthil/Zg; for the primal—dual update (3) and v = Htl/zx; : nHtil/2gt for the mirror-descent update (4), we arrive at the problem

(...)

We can clearly recover xt+1 from the solution z* to the projection (20) Via xt+1 = Hfl/2z".

By the symmetry of the objective (20), we can assume without loss of generality that v t 0 and constrain z t 0, and a bit of manipulation with the Lagrangian (see Appendix G) for the problem shows that the solution 2* has the form

(...)

for some 0* 2 0. The algorithm in Figure 3 constructs the optimal 0 and returns z*.

5.3 [2 Regularization

We now turn to the case where (p(x) = 7111x112 while X = Rd. This type of regularization is useful for zeroing multiple weights in a group, for example in multi-task or multiclass learnin g (Obozinski et al., 2007). Recalling the general proximal step (18), we must solve

(...)

There is no closed form solution for this problem, but we give an efficient bisection-based procedure for solving (21). We start by deriving the dual. Introducing a variable z = x, we get the equivalent problem of minimizing (n,x) + % (x,Hx) 1 71 112112 subject to x = z. With Lagrange multipliers 01 for the equality constraint, we obtain the Lagrangian

(...)

Figure 4: Minimize (n,x) 1% (x,Hx)+ 7111x112

Taking the infimum of L with respect to the primal variables x and z, we see that the infimum is attained atx = :H’1 (11+ 01). Coupled with the fact that infzk Hz1|2 : (01,z) = :00 unless 1101112 3 71, in which case the infimum is 0, we arrive at the dual form

(...)

Setting V — H 1M, We further distill the dual to

(...)

We can solve problem (22) efficiently using a bisection search of its equivalent representation in Lagrange form,

(...)

where 0 > 0 is an unknown scalar. The solution to the latter as a function of 0 is clearly 01(0) =

(H’1 + 0I)’1v = :(H’1 + 0I)’1H’1n. Since 1101(0) 112 is monotonically decreasing in 0 (consider

the the eigen-decomposition of the positive definite H11), we can simply perform a bisection search over 0, checking at each point whether 1101(0) 112 2 71.

To find initial upper and lower bounds on 0, we note that

(...)

where Gmax (H ) denotes the maximum singular value of H and Gmin(H) the minimum. To guarantee 1101(0max) 112 g 71, we thus set 0mx = 1|sz/XEI/6max(H). Similarly, for 0min we see that so long as 0 21|v1|2/k:1/6min(H)we have 1|01(0)H2 2 71. The fact that 0 HxH2 = {z: HzH2 g 1} whenx = 0 implies that the solution for the original problem (21) is x = 0 if and only if 1111112 3 71. We provide pseudocode for solving (21) in Algorithm 4.

5.4 [Do Regularization

We again let X = Rd but now choose (p(x) = 7111x1100. This type of update, similarly to £2, zeroes groups of variables, which is handy in finding structurally sparse solutions for multitask or multi- class problems. Solving the fix, regularized problem amounts to

(...)

The dual of this problem is a modified 61-projection problem. As in the case of £2 regularization, we introduce an equality constrained variable z = x with associated Lagrange multipliers 01 6 Rd to obtain

(...)

Performing identical manipulations to the [2 case, we take derivatives and get that x = +H’1 (11+ 01) and, similarly, unless 1101111 3 71, infz L(x,z,01) = +00. Thus the dual problem for (23) is

(...)

When H is diagonal we can find the optimal 01* using the generalized 61-projection in Algorithm 3, then reconstruct the optimal )1 via )1 = +H’1 (11 + 01*).

5.5 Mixed-norm Regularization

Finally, we combine the above results to show how to solve problems with matriX-Valued inputs X 6 Rd”, where X = [21 2,1]? We consider mixed-norm regularization, which is very useful for encouraging sparsity across several tasks (Obozinski etal., 2007). Now (1) is an [1 Mp norm, that is, (p(X) = 712;; 11211? By imposing an [l-norm over p-norms of the rows of X, entire rows are nulled at once.

When p E {2, 00} and the proximal H in (18) is diagonal, the previous algorithms can be readily used to solve the mixed norm problems. We simply maintain diagonal matrix information for each of the rows )1,- of X separately, then solve one of the previous updates for each row independently. We use this form of regularization in our experiments with multiclass prediction problems in the next section.

6. Experiments

We performed experiments with several real world data sets with different characteristics: the Im- ageNet image database (Deng et al., 2009), the Reuters RCV1 text classification data set (Lewis etal., 2004), the 1V[N IST multiclass digit recognition problem, and the census income data set from the UCI repository (Asuncion and Newman, 2007). For uniformity across experiments, we focus on the completely online (fully stochastic) optimization setting, in which at each iteration the learning algorithm receives a single example. We measure performance using two metrics: the online loss or error and the test set performance of the predictor the learning algorithm outputs at the end of a single pass through the training data. We also give some results that show how imposing sparsity constraints (in the form of £1 and mixed-norm regularization) affects the learning algorithm’s per- formance. One benefit of the ADAGRAD framework is its ability to straightforwardly generalize to domain constraints X 75 Rd and arbitrary regularization functions (1), in contrast to previous adaptive online algorithms.

(...) Table 1: Test set error rates and proportion non-zero (in parenthesis) on Reuters RCV1.

We experiment with RDA (Xiao, 2010), FOBOS(Duchi and Singer, 2009), adaptive RDA, adap- tive FOBOS, the Passive-Aggressive (PA) algorithm (Crammer et al., 2006), and AROW (Crammer et al., 2009). To remind the reader, PA is an online learning procedure with the update

(...)

where 71 is a regularization parameter. PA’s update is similar to the update employed by AROW (see (9)), but the latter maintains second order information on )1. By using a representer theorem it is also possible to derive efficient updates for PA and AROW when the loss is the logistic loss, log(1+exp(+yt<zt,xt))). We thus we compare the above six algorithms using both hinge and logistic loss.

6.1 Text Classification

The Reuters RCVl data set consists of a collection of approximately 800,000 text articles, each of which is assigned multiple labels. There are 4 high-level categories, Economics, Commerce, Medical, and Government (ECAT, CCAT, MCAT, GCAT), and multiple more specific categories. We focus on training binary classifiers for each of the four major categories. The input features we use are 0/1 bigram features, which, post word stemming, give data of approximately 2 million dimensions. The feature vectors are very sparse, however, and most examples have fewer than 5000 non-zero features.

We compare the twelve different algorithms mentioned in the prequel as well as variants of FOBOS and RDA with [l-regularization. We summarize the results of the [l-regularized runs as well as AROW and PA in Table 1. The results for both hinge and logistic losses are qualitatively and quantitatively very similar, so we report results only for training with the hinge loss in Table 1. Each row in the table represents the average of four different experiments in which we hold out 25% of the data for a test set and perform an online pass on the remaining 75% of the data. For RDA and FOBOS, we cross-Validate the stepsize parameter 1] by simply running multiple passes and then choosing the output of the learner that had the fewest mistakes during training. For PA and AROW we choose 71 using the same approach. We use the same regularization multiplier on the [1 term for RDA and FOBOS, selected so that RDA achieved approximately 10% non-zero predictors.

It is evident from the results presented in Table 1 that the adaptive algorithms (AROW and ADA- GRAD) are far superior to non-adaptive algorithms in terms of error rate on test data. The ADA- GRAD algorithms naturally incorporate sparsity as well since they are run with fll-regularization, though RDA has significantly higher sparsity levels (PA and AROW do not have any sparsity). Fur- thermore, although omitted from the table to avoid clutter, in every test with the RCVl corpus, the adaptive algorithms outperformed the non-adaptive algorithms. Moreover, both ADAGRAD-RDA and ADAGRAD-Fobos outperform AROW on all the classification tasks. Unregularized RDA and FOBOS attained similar results as did the [l-regularized variants (of course without sparsity), but we omit the results to avoid clutter and because they do not give much more understanding.

(...) Table 2: Test set precision for ImageNet

6.2 Image Ranking

ImageNet (Deng et al., 2009) consists of images organized according to the nouns in the WordNet hierarchy, where each noun is associated on average with more than 500 images collected from the web. We selected 15,000 important nouns from the hierarchy and conducted a large scale im- age ranking task for each noun. This approach is identical to the task tackled by Grangier and Bengio (2008) using the Passive-Aggressive algorithm. To solve this problem, we train 15,000 ranking machines using Grangier and Bengio’s Visterms features, which represent patches in an im- age with 79-dimensional sparse vectors. There are approximately 120 patches per image, resulting in a 10,000-dimensional feature space.

Based on the results in the previous section, we focus on four algorithms for solving this task: AROW, ADAGRAD with RDA updates and I 1-regularization, vanilla RDA with £1, and Passive- Aggressive. We use the ranking hinge loss, which is [1+ (x,zl +zg)1+ when zl is ranked above zg. We train a ranker x5 for each of the image classes individually, cross-Validating the choice of initial stepsize for each algorithm on a small held-out set. To train an individual ranker for class c, at each step of the algorithm we randomly sample a positive image zl for the category c and an image 22 from the training set (which with high probability is a negative example for class c) and perform an update on the example zl + 22. We let each algorithm take 100,000 such steps for each image category, we train four sets of rankers with each algorithm, and the training set includes approximately 2 million images.

For evaluation, we use a distinct test set of approximately 1 million images. To evaluate a set of rankers, we iterate through all 15,000 classes in the data set. For each class we take all the positive image examples in the test set and sample 10 times as many negative image examples. Following Grangier and Bengio, we then rank the set of positive and negative images and compute precision- at-k for k = {1, . . . , 10} and the average precision for each category. The precision-at—k is defined as the proportion of examples ranked in the top k for a category c that actually belong to c, and the average precision is the average of the precisions at each position in which a relevant picture appears. Letting Pos(c) denote the positive examples for category c and p(i) denote the position of the ith returned picture in list of images sorted by inner product with x5, the average precision is We compute the mean of each measurement across all classes, performing this twelve times for each of the sets of rankers trained. Table 2 summarizes our results. We do not report variance as the variance was on the order of 10’5 for each algorithm. One apparent characteristic to note from the table is that ADAGRAD RDA achieves higher levels of sparsity than the other algorithmsiusing only 73% of the input features it achieves very high performance. Moreover, it outperforms all the algorithms in average precision. AROW has better results than the other algorithms in terms of precision-at—k for k g 10, though ADAGRAD’s performance catches up to and eventually surpasses AROW’s as k grows.

(...)

Figure 5: Learning curves on MNIST

6.3 Multiclass Optical Character Recognition

In the well-known MNIST multiclass classification data set, we are given 28 X 28 pixel images 11,-, and the learner’s task is to classify each image as a digit in {0, . . . ,9}. Linear classifiers do not work well on a simple pixel—based representation. Thus we learn classifiers built on top of a kernel machine with Gaussian kernels, as do Duchi and Singer (2009), which gives a different (and non- sparse) structure to the feature space in contrast to our previous experiments. In particular, for the support set of approximately 3000 images to compute the kernels and trained multiclass predictors, which consist of one vector x5 6 R3000 for each class c, giving a 30,000 dimensional problem. There is no known multiclass AROW algorithm. We therefore compare adaptive RDA with and without mixed-norm [1 Mg and [1 MW regularization (see Section 5.5), RDA, and multiclass Passive Aggressive to one another using the multiclass hinge loss (Crammer et al., 2006). For each algorithm we used the first 5000 of 60,000 training examples to choose the stepsize n (for RDA) and 71 (for PA).

In Figure 5, we plot the learning curves (cumulative mistakes made) of multiclass PA, RDA, RDA with [1 Mg regularization, adaptive RDA, and adaptive RDA with 61/62 regularization (11/fix,

ith example and jth feature, the feature value is (...) We use a (...) is similar). From the curves, we see that Adaptive RDA seems to have similar performance to PA, and the adaptive versions of RDA are vastly superior to their non-adaptive counterparts. Table 3 further supports this, where we see that the adaptive RDA algorithms outperform their non-adaptive counterparts both in terms of sparsity (the proportion of non—zero rows) and test set error rates.

(...) Table 3: Test set error rates and sparsity proportions on MNIST. The scalar 71 is the multiplier on the [1 Mg regularization term.

6.4 Income Prediction

The KDD census income data set from the UCI repository (Asuncion and Newman, 2007) contains census data extracted from 1994 and 1995 population surveys conducted by the U.S. Census Bureau. The data consists of 40 demographic and employment related variables which are used to predict whether a respondent has income above or below $50,000. We quantize each feature into bins (5 per feature for continuous features) and take products of features to give a 4001 dimensional feature space with 0/1 features. The data is diVided into a training set of 199,523 instances and test set of 99,762 test instances.

As in the prequel, we compare AROW, PA, RDA, and adaptive RDA with and without [1- regularization on this data set. We use the first 10,000 examples of the training set to select the step size parameters 71 for AROW and PA and n for RDA. We perform ten experiments on random shuffles of the training data. Each experiment consists of a training pass through some proportion of the data (.05, .1, .25, .5, or the entire training set) and computing the test set error rate of the learned predictor. Table 4 and Figure 6 summarize the results of these experiments. The variance of the test error rates is on the order of 10’6 so we do not report it. As earlier, the table and figure make it clear that the adaptive methods (AROW and ADAGRAD-RDA) give better performance than non-adaptive methods. Further, as detailed in the table, the ADAGRAD methods can give extremely sparse predictors that still give excellent test set performance. This is consistent with the experiments we have seen to this point, where ADAGRAD gives sparse but highly accurate predictors.

6.5 Experiments with Sparsity-Accuracy Tradeoffs

In our final set of experiments, we investigate the tradeoff between the level of sparsity and the classification accuracy for the ADAGRAD-RDA algorithms. Using the same experimental setup as for the initial text classification experiments described in Section 6.1, we record the average test—set performance of ADAGRAD-RDA versus the proportion of features that are non-zero in the predictor ADAGRAD outputs after a single pass through the training data. To achieve this, we run ADAGRAD with I 1 -regularization, and we sweep the regularization multiplier 71 from 10’8 to 1071. These values result in predictors ranging from a completely dense predictor to an all-zeros predictor, respectively.

Figure 6: Test set error rates as function of proportion of training data seen on Census Income data

Table 4: Test set error rates as function of proportion of training data seen (proportion of non-zeros in parenthesis where appropriate) on Census Income data set.

We summarize our results in Figure 7, which shows the test set performance of ADAGRAD for each of the four categories ECAT, CCAT, GCAT, and MCAT. Within each plot, the horizontal black line labeled AROW designates the baseline performance of AROW on the text classification task, though we would like to note that AROW generates fully dense predictors. The plots all portray a similar story. With high regularization values, ADAGRAD exhibits, as expected, poor performance as it retains no predictive information from the learning task. Put another way, when the regularization value is high ADAGRAD is confined to an overly sparse predictor which exhibits poor generalization. However, as the regularization multiplier 71 decreases, the learned predictor becomes less sparse and eventually the accuracy of ADAGRAD exceeds AROW’s accuracy. It is interesting to note that for these experiments, as soon as the predictor resulting from a single pass through the data has more than 1% non-zero coefficients, ADAGRAD’s performance matches that of AROW. We also would like to note that the variance in the test—set error rates for these experiments is on the order of 10’6, and we thus do not draw error bars in the graphs. The performance of ADAGRAD as a function of regularization for other sparse data sets, especially in relation to that of AROW, was qualitatively similar to this experiment.

Figure 7: Test set error rates as a function of proportion of non-zeros in predictor )1 output by ADAGRAD (AROW plotted for reference).

7 Conclusions

We presented a paradigm that adapts subgradient methods to the geometry of the problem at hand. The adaptation allows us to derive strong regret guarantees, which for some natural data distributions achieve better performance guarantees than previous algorithms. Our online convergence results can be naturally converted into rate of convergence and generalization bounds (Cesa-Bianchi et al., 2004). Our experiments show that adaptive methods, specifically ADAGRAD-FOBOS, ADAGRAD-RDA, and AROW clearly outperform their non-adaptive counterparts. Furthermore, the ADAGRAD family of algorithms naturally incorporates regularization and gives very sparse solutions with similar performance to dense solutions. Our experiments with adaptive methods use a diagonal approxima- tion to the matrix obtained by taking outer products of subgradients computed along the run of the algorithm. It remains to be tested whether using the full outer product matrix can further improve performance.

To conclude we would like to underscore a possible elegant generalization that interpolates between full-matrix proximal functions and diagonal approximations using block diagonal matrices. Specifically, for (...) are subvectors of v with 21:1 dl- = d . We can define the associated block-diagonal approximation to the outer product matrix 22:1 grg;r by

(...)

In this case, a combination of Theorems 5 and 7 gives the next corollary.

Corollary 12 Let G; be the block-diagonal 011terpr0d11ct matrix defined above and the sequence {xt} be defined by the RDA update of (3) with W106) = <x,Gt1/2x>. Then, for any x* E X,

(...)

A similar bound holds for composite mirror-descent updates, and it is straightforward to get infimal equalities similar to those in Corollary 11 with the infimum taken over block-diagonal matrices. Such an algorithm can interpolate between the computational simplicity of the diagonal proximal functions and the ability of full matrices to capture correlation in the gradient vectors.

A few open questions stem from this line of research. The first is whether we can eficiently use full matrices in the proximal functions, as in Section 4. A second open issue is whether non- Euclidean proximal functions, such as the relative entropy, can be used. We also think that the strongly convex caseiwhen ft or (p is strongly convexipresents interesting challenges that we have not completely resolved. We hope to investigate both empirical and formal extensions of this work in the near future.

Acknowledgments

There are many people to whom we owe our sincere thanks for this research. Fernando Pereira helped push us in the direction of working on adaptive online methods and has been a constant source of discussion and helpful feedback. Samy Bengio provided us with a processed version of the ImageNet data set and was instrumental in helping to get our experiments running, and Adam Sadovsky gave many indispensable coding suggestions. The anonymous reviewers also gave several suggestions that improved the quality of the paper. Lastly, Sam Roweis was a sounding board for some of our earlier ideas on the subject, and we will miss him dearly.

Appendix A. Full Matrix Motivating Example =

As in the diagonal case, as the adversary we choose 8 > 0 and on rounds t = 1,. . . ,n2/82 play the vector :tvl. After the first 112/82 rounds, the adversary simply cycles through the vectors V2,. . . ,vd. Thus, for ZinkeVich’s projected gradient, we have x; = 01:711/1 for some multiplier 09,1 > 0 when t g 112/82. After the first 112/82 rounds, we perform the updates

(...)

for some index i, but as in the diagonal case, n/\/t g 8, and by orthogonality of 1),-,1»), we have x; = V01; for some 01; t 0, and the projection step can only shrink the multiplier 0911' for index

1. Thus, each coordinate incurs loss at least 1/(28), and projected gradient descent suffers losses 9(d/8). On the other hand, ADAGRAD suffers loss at most d. Indeed, since g1 = v1 and 111/1 112 = 1, we

(...)

Since (212,171) = 1, we see that ADAGRAD suffers no loss (and G; = G1) until a vector 2; = iv,- for i 75 1 is played by the adversary. However, an identical argument shows that G; is simply updated to 1111/1T + 1),-11,7, in which case x; = v1 + V). Indeed, an inductive argument shows that until all the vectors V)- are seen, we have 11211112 < M; by orthogonality, and eventually we have

(...)

so that x; E X = {x2 11x112 3 ME} for ADAGRAD for all t. All future predictions thus achieve margin 1 and suffer no loss.

Appendix B. Technical Lemmas

Lemma 13 LetA : B t 0 be symmetric d X d PSD matrices. ThenAl/2 : Bl/Z.

Proof This is Example 3 of Davis (1963). We include a proof for convenience of the reader. Let )1 be any eigenvalue (with corresponding eigenvector x) of Al/2 + 31/2; we show that 71 2 0.

Clearly Al/ZX+ 7111 = Bl/zx. Taking the inner product of both sides with Al/Zx, we have 11A1/2x11: + ?1<A1/2x,x> = (AI/Zxfil/Zx). We use the Cauchy-Schwarz inequality:

(...)

where the last inequality follows from the assumption thatA : B. Thus we must have 71(A1/2x,x> 2 0, which implies 71 2 0. I

The gradient of the function tr(XP) is easy to compute for integer values of p. However, when p is real we need the following lemma. The lemma tacitly uses the fact that there is a unique positive semidefinite X p when X t 0 (Horn and Johnson, 1985, Theorem 7.2.6).

Lemma 14 Letp E R andX > 0. Then VXtr(XP) = po’l.

Proof We do a first order expansion of (X +A)p when X > 0 and A is symmetric. Let X = U AU T be the symmetric eigen—decomposition of X and VDVT be the decomposition of A’l/zUTAUA’l/z. Then (X +A)p = (UAUT +A)p = U(A+ UTAU)pUT = UAp/2(I+A’1/2UTAUA’1/ 2)pAp/ 2UT = UAP/ZVT(I+ D)PVAP/2UT = UAp/zVTU + pD + o(D))VAP/2UT = UAPUT + pUAp/zVTDVAP/ZUT + 0(UA’/2VTDVAp/2UT) = XP + UA1P’11/2UTAUA1P’11/2UT +001) = XP +1620le1/2AX<P71V2 + 0(A).

In the above, 0(A) is a matrix that goes to zero faster than A a 0, and the second line follows Via a first—order Taylor expansion of (1 + di)p . From the above, we immediately have

tr((X +A)P) = trXp +ptr(Xp’1A)+ 0(trA),

which completes the proof. I

Appendix C. Proof of Lemma 4

We prove the lemma by considering an arbitrary real—Valued sequence {11,-} and its vector represen- tation llm = [a1 --- at]. We are next going to show that

(...)

where we define % = 0. We use induction on T to prove inequality (24). For T = 1, the inequality trivially holds. Assume the bound (24) holds true for T + 1, in which case

(...)

where the inequality follows from the inductive hypothesis. We define bT = 23:1 (1? and use con-

cavity to obtain that 1/bT + (1% g x/bT + (1%fi so long as bT + (1% 2 0.2 Thus,

(...)

Having proved the bound (24), we note that by construction that Sty,(...) so

(...) 2. We note that we use an identical technique in the full—matrix case. See Lemma 8.

(...)

Appendix D. Proof of Lemmas 8 and 9

We begin with the more difficult proof of Lemma 8. Proof of Lemma 8 The core of the proof is based on the concavity of the function tr(Al/z). How- ever, careful analysis is required as A might not be strictly positive definite. We also use the previous lemma which implies that the gradient of tr(Al/Z) is %A’1/2 when A > 0.

First, AP is matriX-concave for A > 0 and 0 g p g 1 (see, for example, Corollary 4.1 in Ando, 1979 or Theorem 16.1 in Bondar, 1994). That is, forA,B > 0 and 01 6 [0,1] we have

(...) Now suppose simply A,B : 0 (but neither is necessarily strict). Then for any 5 > 0, we have A + 5] > 0 and B+ 5] > 0 and therefore

(...)

where we used Lemma 13 for the second matrix inequality. Moreover, 01A + (1 + 01)B + 5] % 01A + (1 + 01)B as 5 a 0. Since AP is continuous (when we use the unique PSD root), this line of reasoning proves that (25) holds for A,B : 0. Thus, we proved that

(...)

Recall now that Lemma 14 implies that the gradient of tr(Al/z) is %A’1/2 when A > 0. There- fore, from the concavity of Al/2 and the form of its gradient, we can use the standard first—order inequality for concave functions so that for any A,B > 0,

(...)

Let A = B + VggT : 0 and suppose only that B t 0. We must take some care since B’l/2 may not necessarily exist, and the above inequality does not hold true in the pseudo-inverse sense when B >4 0. However, forany 5 > 0weknow that2VBtr((B+5])1/2) = (B+5])’1/2, andA+B= +vggT. From (26) and Lemma 13, we have

(...) Note that g E Range(B), because if it were not, we could choose some 11 with B11 = 0 and (g,11) 75 0, which would give <11, (B + cggT)11> = +c(g,11)2 < 0, a contradiction. Now let B = Vdiag(71)VT be the eigen-decomposition of B. Since g E Range(B),

(...)

Thus, by taking 5 1 0 in (27), and since both tr(B+ 5])1/2 and tr((B + 5])’1/2ggT) are evidently continuous in 5, we complete the proof. I

Proof of Lemma 9 We begin by noting that 52] : ggT, so from Lemma 13 we get (A + ggT)1/2 j (A + 52])1/2. Since A and ] are simultaneously diagonalizable, we can generalize the inequality Va + b 3 fl + \/b, which holds for a, b 2 0, to positive semi—definite matrices, thus,

(...)

Therefore, ifA +ggT is of full rank, we have (A +ggT)’1/2 : (Al/2 + 5])’1 (Horn and Johnson, 1985, Corollary 7.7.4(a)). Since g E Range((A +ggT)1/2), we can apply an analogous limiting ar- gument to the one used in the proof of Lemma 8 and discard all zero eigenvalues of A + ggT, which completes the lemma. I

Appendix E. Solution to Problem (15)

We prove here a technical lemma that is useful in characterizing the solution of the optimization problem below. Note that the second part of the lemma implies that we can treat the inverse of the solution matrix S ’1 as S1. We consider solving

mSin tr(S’lA) subject to S t 0, tr(S) g c where A t 0. (28) Lemma 15 IfA is 0ffi1ll rank, then the minimizer 0f(28) is S = cA1/tr(A% ). IfA is not 0ffi1ll rank, then setting (...) gives

(...)

Proof Both proofs rely on constructing the Lagrangian for (28). We introduce 0 E R1 for the trace constraint and Z t 0 for the positive semidefinite constraint on S . In this case, the Lagrangian is (...) The derivative of L with respect to S is (...)

If S is full rank, then to satisfy the generalized complementarity conditions for the problem (Boyd and Vandenberghe, 2004), we must have Z = 0. Therefore, we get S’IAS’1 = 0]. We now can multiply by S on the right and the left to get thatA = 0S2, which implies that S <>< A%. If A is of full rank, the optimal solution for S > 0 forces 0 to be positive so that tr(S) = c. This yields the solution S = 614% / tr(A%). In order to verify optimality of this solution, we set Z = 0 and 0 = c’ztr(A1/2)2 which gives V5L(S, 0,Z) = 0, as is indeed required.

Suppose now that A is not full rank and that

(...) is the eigen-decomposition of A. Let n be the dimension of the null—space of A (so the rank of A is d + n). Define the variables (...)

It is easy to see that trS(5) = c, and

(...)

Further, let g(0) = infSL(S,0,Z(0)) be the dual of (28). From the above analysis and (29), it is evident that

(...) So S(0,5) achieves the infimum in the dual for any 5 > 0, tr(S(0)Z(0)) = 0, and 8(9) = V5909) + 7511(A%)+ 70511 7 0c.

Setting 0 = tr(/\%)2/c2 gives g(0) = tr(A%)2/C+ 5ntr(A%)/c. Taking 5 a 0 gives g(0) = tr(A%)2/c, which means that limako tr(S(5)’1A) = tr(A%)2/c = g(0). Thus the duality gap for the original problem is 0 so S (0) is the limiting solution.

The last statement of the lemma is simply plugging S 1 = (A1)% tr(A% ) /c in to the objective being minimized. I

Appendix F. Proofs of Propositions 2 and 3

We begin with the proof of Proposition 2. The proof essentially builds upon Xiao (2010) and Nesterov (2009), with some modification to deal with the indexing of Wt- We include the proof for completeness.

Proof of Proposition 2 Define w: to be the conjugate dual of t(p(x) +01, (21)/n; (...)

Since wt/n is 1 /n-strongly convex with respect to the norm 11'111/1’ the function wt" has n-Lipschitz continuous gradients with respect to 11-1

(...)

for any g1, g2 (see, e.g., Nesterov, 2005, Theorem 1 or Hiriart—Urruty and Lemaréchal, 1996, Chap- ter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has L-Lipschitz gradients, f(y) S f(x) + (Vf(x),y +21) + (L/2) 11y +x112, and

(...)

Using the bound (30) and identity (31), we can give the proof of the corollary. Indeed, letting gt 6 013(0) and defining z): 213:1 gr, we have

(...) Since Wt+1 2 wt, it is clear that

(...)

We can repeat the same sequence of steps that gave the last equality to see that

(...)

Recalling that x1 = argminX€X{(p(x)} and that w6(0) = 0 completes the proof. I

We now turn to the proof of Proposition 3. We begin by stating and fully proving an (essentially) immediate corollary to Lemma 2.3 of Duchi et al. (2010).

Lemma 16 Let {xt} be the sequence defined by the update (4 ) and assume that BW, (-, -) is strongly convex with respect to a norm 1111\1/1‘ Let 11-1 ‘11? be the associated dual norm. Then for any x*,

(...)

Proof The optimality of xt+1 for (4) implies for all x E X and (p’ (Xt+1) E 8(p(xt+1)

(...)

In particular, this obtains for x = x*. From the subgradient inequality for convex functions, we have

16106")? f1(x1)1<f,’(x1),x" 1x1), Off,(x,) :ft(x*) S (ft’(x,),xt :x"), and likewise for (p(x,+1). We thus have

(...)

Now, by (32), the first term in the last equation is non-positive. Thus we have that

(...)

In the above, the first equality follows from simple algebra with Bregman divergences, the second to last inequality follows from Fenchel’s inequality applied to the conjugate functions % 111131,! and

% 11-1 ‘2": (Boyd and Vandenberghe, 2004, Example 3.27), and the last inequality follows from the

assumed strong convexity of BW, with respect to the norm 11- 111/1' I Proof of Proposition 3 Sum the equation in the conclusion of Lemma 16. I

Appendix G. Derivations of Algorithms

In this appendix, we give the formal derivations 0f the solution to the ADAGRAD update for £1- regularization and projection to an I 1-ba11, as described originally in Section 5.

G.1 6 1 -regularizati0n

We give the derivation for the primal-dual subgradient update, as composite mirror—descent is en- tirely similar. We need to solve update (3), which amounts to

(...)

Let J? denote the optimal solution of the above optimization problem. Standard subgradient calculus implies that when 137,711 g 7» the solution is J?,- = 0. Similarly, when gm- < :7», then J?,- > 0, the objective is differentiable, and the solution is obtained by setting the gradient to zero: (...) LflQWiSC, when gm > 7» then J?,- < 0, and the solution is J?,- = %(:gm 1 7L). Combining the three cases, we obtain the simple update (19) for xt+1,1'. 1

G.2 61-ball projections

The derivation we give is somewhat terse, and we refer the interested reader to Brucker (1984) or Pardalos and Rosen (1990) for more depth. Recall that our original problem (20) is symmetric in its objective and constraints, so we assume without loss of generality that v t 0 (otherwise, we reverse the sign of each negative component in v, then flip the sign of the corresponding component in the solution vector). This gives

(...) Clearly, if (11,11) 3 c the optimal z* = v, hence we assume that (11,11) > c. We also assume without loss of generality that 121/11) 2 1),-+1 / (11-+1 for simplicity of our derivation. (We revisit this assumption at

the end of the derivation.) Introducing Lagrange multipliers 0 E R1 for the constraint that (a,z) g c and 01 E R1 for the positivity constraint on z, we get

(...)

Computing the gradient of L, we have VZL(z, 01, 0) = z : v 1 0a : 01. Suppose that we knew the optimal 0* 2 0. Using the complementarity conditions on z and 01 for optimality of z (Boyd and Vandenberghe, 2004), we see that the solution 2:? satisfies

(...) 0 otherwise .

Analogously, the complimentary conditions on (a,z) g c show that given 0*, we have

(...)

Conversely, had we obtained a value 0 2 0 satisfying the above equation, then 0 would evidently induce the optimal 2* through the equation zl- = 1v,- : 011,-] +.

Now, let p be the largest index in {1, . . . ,d} such that v,- : 0*al- > 0 for i g p and v,- : 0*al- g 0 for i > p. From the assumption that 121/111 3 vi+1/ai+1, we have Vp+1/ap+1 g 0* < vp/ap. Thus, had we known the last non-zero index p, we would have obtained

(....)

Given p satisfying the above inequalities, we can reconstruct the optimal 0* by noting that the latter inequality should equal c exactly when we replace vp/ap with 0, that is,

(...)

The above derivation results in the following procedure (when (11,11) > c). We sort v in descending order of v,- / 11,- and find the largest index p such that 2?:1‘111’1 : (vp/ap) 21:11 111-2 < c. We then reconstruct 0* using equality (33) and return the soft—thresholded values of v,- (see Algorithm 3). It is easy to verify that the algorithm can be implemented in O(d log d) time. A randomized search with bookkeeping (Pardalos and Rosen, 1990) can be straightforwardly used to derive a linear time algorithm.

(...)

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2011 AdaptiveSubgradientMethodsforOnYoram Singer
John Duchi
Elad Hazan
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization2011
  1. The proximal term is also referred to as regularization in the online learning literature. We use the phrase proximal term in order to avoid confusion with the statistical regularization function [math]\displaystyle{ \phi }[/math].