2006 BanditProblems

Jump to: navigation, search

Subject Headings: k-Armed Bandit; Statistical Decision Model


Cited By


Author Keywords


We survey the literature on multi-armed bandit models and their applications in economics. The multi-armed bandit problem is a statistical decision model of an agent trying to optimize his decisions while improving his information at the same time. This classic problem has received much attention in economics as it concisely models the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff).


The multi-armed bandit problem, originally described by Robbins (1952), is a statistical decision model of an agent trying to optimize his decisions while improving his information at the same time. In the multiarm bandit problem, the gambler has to decide which arm of K different slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Each choice of an arm results in an immediate random payoff, but the process determining these payoffs evolves during the play of the bandit. The distinguishing feature of bandit problems is that the distribution of returns from one arm only changes when that arm is chosen. Hence the rewards from an arm do not depend on the rewards obtained from other arms. This feature also implies that the distributions of returns do not depend explicitly on calendar time.

Practical examples of the bandit problem include clinical trials where different treatments need to be experimented with while minimizing patient losses, or adaptive routing efforts for minimizing delays in a network. In an economics environment, experimental consumption is an example of intertemporal allocation problems where the trade-off between current payoff and value of information plays a key role. Alternatively, the use of arms may change their physical properties as in learning by doing where experience with the arm increases its future payoffs.

Basic Model

It is easiest to formulate the bandit problem as an infinite horizon Markov decision problem in discrete time with time index [math]t = 0, 1, ....[/math] At each [math]t[/math], the decision maker chooses amongst [math]K[/math] arms and we denote this choice by at [math]a_t \in {1, ...,K}[/math]. If [math]a_t = k[/math], a random payoff [math]x^k_t[/math] is realized and we denote the associated random variable by [math]Xk t[/math]. The state variable of the Markovian decision problem is given by [math]s_t[/math]. We can then write the distribution of [math]x^k_t[/math] as [math]F^k (·; s_t)[/math] . The state transition function [math]\phi[/math] depends on the choice of the arm and the realized payoff: :[math]s_{t+1} = \phi (x^k_t;s_t)[/math] Let St denote the set of all possible states in period t. A feasible Markov policy a = {at}1 t=0 selects an available alternative for each conceivable state st, i.e. [math]a_t : S_t \rightarrow {1, ...,K}[/math]

The following two assumptions must be met for the problem to qualify as a bandit problem.

  1. . Payoffs are evaluated according to the discounted expected payoff criterion where the discount factor � satisfies [math]0 \le \delta \lt 1[/math].
  1. . The payoff from each k depends only on outcomes of periods with at = k. In other words, we can decompose the state variable st into K components 􀀀 s1t , ..., sKt � such that for all k : :[math]sk t+1 = skt if at a_t \ne k,[/math] :[math]sk t+1 = �(skt , xt) if at = k,[/math] and :[math]Fk (·, st) = Fk 􀀀 ·; skt.[/math]

Notice that when the second assumption holds, the alternatives must be statistically independent.

It is easy to see that many situations of economic interest are special cases of the above formulation. First, it could be that Fk 􀀀 ·; �k� is a fixed distribution with an unknown parameter �k. The state variable is then the posterior probability distribution on �k. Alternatively, Fk 􀀀 ·; sk � could denote the random yield per period from a resource k after extracting sk units.

The value function V (s0) of the bandit problem can be written as follows. Let Xk 􀀀 skt � denote the random variable with distribution Fk 􀀀 ·; skt � . Then the problem of finding an optimal allocation policy is the solution to the following intertemporal optimization problem:

V (s0) = sup a (E X1 t=0 �tXat (sat t ) ) .

The celebrated index theorem due to Gittins and Jones (1974) transforms the problem of finding the optimal policy into a collection of k stopping problems. For each alternative k, we calculate the following index k 􀀀 skt � , which only depends on the state variable of alternative k:

mk 􀀀 skt � = sup � (E P� u=t �tXk 􀀀 sk u � E P� u=t �t ) , (1)

where \tau is a stopping time with respect to {skt }. The idea is to find for each k the stopping time � that results in the highest discounted expected return per discounted expected number of periods in operation. The Gittins index theorem then states that the optimal way of choosing arms in a bandit problem is to select in each period the arm with the highest Gittins index, mk 􀀀 skt � , as defined by (1).



Market Learning

In economics, Bandit problems have first been used to model search processes. The first paper that used a one-armed bandit problem in economics is Rothschild (1974) in which a single firm is facing a market with unknown demand. The true market demand is given by a specific probability distribution over consumer valuations. However the firm initially has a prior probability over several possible market demands. The problem for the firm is find an optimal sequence of prices to learn more about the true demand while maximizing its expected discounted profits. In particular, Rothschild shows that ex ante optimal pricing rules may well end up using prices that are ex post suboptimal (i.e. suboptimal if the true distribution were to be known). If several firms were to experiment independently in the same market, they might offer different prices in the long run. Optimal experimentation may therefore lead to price dispersion in the long run as shown formally in McLennan (1984).



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 BanditProblemsDirk Bergemann
Juuso Välimäki
Bandit Problems2006