2003 TheNonstochasticMultiarmedBandi

(Auer et al., 2003) ⇒ Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. (2003). “The Nonstochastic Multiarmed Bandit Problem.” In: SIAM Journal on Computing, 32(1). doi:10.1137/S0097539701398375

Subject Headings: Adversarial k-Bandit Task.

Notes

Cited By

Quotes

Author Keywords

adversarial bandit problem; unknown matrix games

Abstract

In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines.

In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T^-1/2). We show by a matching lower bound that this is the best possible.

We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N^1/2 T^-1/2). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T^-1/2). ===1 Introduction In the multi-armed bandit problem, originally proposed by Robbins [19], a gambler must choose which of [math]\displaystyle{ K }[/math] slot machines to play. At each time step, he pulls the arm of one of the machines and receives a reward or payoff (possibly zero or negative). The gambler's purpose is to maximize his return, i.e. the sum of the rewards he receives over a sequence of pulls. In this model, each arm is assumed to deliver rewards that are independently drawn from a fuxed and unknown distribution. As reward distributions differ from arm to arm, the goal is to find the arm with the highest expected payoff as early as possible, and then to keep gambling using that best arm. The problem is a paradigmatic example of the trade-off between exploration and exploitation. On the one hand, if the gambler plays exclusively on the machine that he thinks is best (exploitation ), he may fail to discover that one of the other arms actually has a higher expected payoff. On the other hand, if he spends too much time trying out all the machines and gathering statistics (exploration), he may fail to play the best arm often enough to get a high return. The gambler's performance is typically measured in terms of regret. This is the difference between the expected return of the optimal strategy (pulling consistently the best arm) and the gambler's expected return. Lai and Robbins proved that the gambler's regret over [math]\displaystyle{ T }[/math] pulls can be made, for [math]\displaystyle{ T \rightarrow \infty }[/math], as small as [math]\displaystyle{ O(\ln{T}) }[/math]. Furthermore, they prove that this bound is optimal in the following sense: it does not exist a strategy for the gambler with a better asymptotic performance. Though this formulation of the bandit problem allows an elegant statistical treatment of the exploration-exploitation trade-off, it may not be adequate to model certain environments. As a motivating example, consider the task of repeatedly choosing a route for transmitting packets between two points in a communication network. To cast this scenario within the bandit problem, suppose there is a only a fixed number of possible routes and the transmission cost is reported back to the sender. Now, it is likely that the costs associated with each route cannot be modeled by a stationary distribution, so a more sophisticated set of statistical assumptions would be required. In general, it may be difficult or impossible to determine the right statistical assumptions for a given domain, and some domains may exhibit dependencies to an extent that no such assumptions are appropriate. To provide a framework where one could model scenarios like the one sketched above, we present the adversarial bandit problem, a variant of the bandit problem in which no statistical assumptions are made about the generation of rewards. We only assume that each slot machine is initially assigned an arbitrary and unknown sequence of rewards, one for each time step, chosen from a bounded real interval. Each time the gambler pulls the arm of a slot-machine he receives the corresponding reward from the sequence assigned to that slot-machine. To measure the gambler's performance in this setting we replace the notion of (statistical) regret with that of worst-case regret. Given any sequence of pulls, where �� is an arbitrary time horizon and each �� is the index of an arm, the worst-case regret of a gambler for this sequence of pulls is the difference between the return the gambler would have had by pulling arms � �� and the actual gambler's return, where both returns are determined by the initial assignment of rewards. It is easy to see that, in this model, the gambler cannot keep his regret small (say, sublinear in �) for all sequences of pulls and with respect to the worst-case assignment of rewards to the arms. Thus, to make the problem feasible, we allow the regret to depend on the hardness of the sequence of pulls for which it is measured, where the hardness of a sequence is roughly the number of times one has to change the slot machine currently being played in order to pull the arms in the order given by the sequence. This trick allows us to effectively control the worst-case regret simultaneously for all sequences of pulls, even though (as one should expect) our regret bounds become trivial when the hardness of the sequence [math]\displaystyle{ (j_1,...,j_T) }[/math] we compete against gets too close to [math]\displaystyle{ T }[/math]. As a remark, note that a deterministic bandit problem was also considered by Gittins [9] and Ishikida and Varaiya [13]. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received from each arm, and their problem is thus one of optimization, rather than exploration and exploitation. Our most general result is a very efficient, randomized player algorithm whose expected regret for any sequence of pulls is ^[1] �� 􀀀 � � , where � is the hardness of the sequence (see Theorem 8.1 and Corollaries 8.2, 8.4). Note that this bound holds simultaneously for all sequences of pulls, for any assignments of rewards to the arms, and uniformly over the time horizon �. If the gambler is willing to impose an upper bound � on the hardness of the sequences of pulls for which he wants to measure his regret, an improved bound �� 􀀀 � � �􀀀 � on the expected regret for these sequences can be proven (see Corollaries 8.3 and 8.5). With the purpose of establishing connections with certain results in game theory, we also look at a special case of the worst-case regret, which we call weak regret. Given a time horizon �, call best arm the arm that has the highest return (sum of assigned rewards) up to time with respect to the initial assignment of rewards. The gambler's weak regret is the difference between the return of this best arm and the actual gambler's return. In the paper we introduce a randomized player algorithm, tailored to this notion of regret, whose expected weak regret is �� 􀀀�� , where �� is the return of the best arm see Theorem 4.1 in Section 4. As before, this bound holds for any assignments of rewards to the arms and uniformly over the choice of the time horizon � . Using a more complex player algorithm, we also prove that the weak regret is �� 􀀀 � � with probability at least �� over the algorithm's randomization, for any fixed � � � , see Theorems 6.3 and 6.4 in Section 6. This also implies that, asymptotically for � � � and 􀀀 constant, the weak regret is �� with probability 1 for any fixed � � � , see Corollary 6.5. Our worst-case bounds may appear weaker than the bounds proved using statistical assumptions, such as those shown by Lai and Robbins [14] of the form � �� . However, when comparing our results to those in the statistics literature, it is important to point out an important difference in the asymptotic quantification. In the work of Lai and Robbins the assumption is that the distribution of rewards that is associated with each arm is fixed as the total number of iterations increases to infinity. In contrast, our bounds hold for any finite � � , and, by the generality of our model, these bounds are applicable when the payoffs are randomly (or adversarially) chosen in a manner that does depend on � . It is this quantification order, and not the adversarial nature of our framework, which is the cause for the apparent gap. We prove this point in Theorem 5.1 where we show that, for any player algorithm for the 􀀀 -armed bandit problem and for any � , there exists a set of 􀀀 reward distributions such that the expected weak regret of the algorithm when playing on these arms for � time steps is � �� 􀀀 �� So far we have considered notions of regret that compare the return of the gambler to the return of a sequence of pulls or to the return of the best arm. A further notion of regret which we explore is the regret for the best strategy in a given set of strategies that are available to the gambler. The notion of strategy generalizes that of sequence of pulls: at each time step a strategy gives a recommendation, in the form of a probability distribution over the 􀀀 arms, as to which arm to play next. Given an assignment of rewards to the arms and a set of 􀀀 strategies for the gambler, call best strategy the strategy that yields the highest return with respect to this assignment. Then the regret for the best strategy is the difference between the return of this best strategy and the actual gambler's return. Using a randomized player that combines the choices of the 􀀀 strategies (in the same vein as the algorithms for prediction with expert advice from [3]), we show that the expected regret for the best strategy is that the dependence on the number of strategies is only logarithmic, and therefore the bound is quite reasonable even when the player is combining a very large number of strategies. The adversarial bandit problem is closely related to the problem of learning to play an unknown N-person finite game, where the same game is played repeatedly by players. A desirable property for a player is Hannan-consistency, which is similar to saying (in our bandit framework) that the weak regret per time step of the player converges to 0 with probability 1. Examples of Hannan-consistent player strategies have been provided by several authors in the past (see [18] for a survey of these results). By applying (slight extensions of) Theorems 6.3 and 6.4, we can prove provide an example of a simple Hannan-consistent player whose convergence rate is optimal up to logarithmic factors. Our player algorithms are based in part on an algorithm presented by Freund and Schapire [6, 7], which in turn is a variant of Littlestone and Warmuth's [15] weighted majority algorithm, and Vovk's [20] aggregating strategies. In the setting analyzed by Freund and Schapire the player scores on each pull the reward of the chosen arm, but gains access to the rewards associated with all of the arms (not just the one that was chosen).

2 Notation and terminology

An adversarial bandit problem is specified by the number 􀀀 of possible actions, where each action is denoted by an integer �� 􀀀 , and by an assignment of rewards, i.e. an infinite sequence

of vectors � ��

, where �� denotes the reward obtained if action � is chosen at time step (also called trial)

. (Even though throughout the paper we will assume that all rewards belong to the interval, the generalization of our results to rewards in ��for arbitrary �is straightforward.) We assume that the player knows the number 􀀀 of actions. Furthermore, after each trial

, we assume the player only knows the rewards ��! ��

��������� ���#" ��

of the previously chosen actions �� . In this respect, we can view the player algorithm as a sequence $ �� $ �� , where each $ � is a mapping from the set of action indices and previous rewards to the set of action indices.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 TheNonstochasticMultiarmedBandi	Yoav Freund Robert E. Schapire Peter Auer Nicolò Cesa-Bianchi			The Nonstochastic Multiarmed Bandit Problem				10.1137/S0097539701398375		2003

↑ 1: Though in this introduction we use the compact asymptotic notation, our bounds are proven for each finite [math]\displaystyle{ T }[/math] and almost always with explicit constants.

[1] 1: Though in this introduction we use the compact asymptotic notation, our bounds are proven for each finite [math]\displaystyle{ T }[/math] and almost always with explicit constants.

[1]