2005 MultiArmedBanditAlgorithmsandEm

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Multi-Armed Bandit Algorithm; Multi-Armed Bandit Task.

Notes

Cited By

Quotes

Abstract

The multi-armed bandit problem for a gambler is to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials. Many real-world learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms.

This paper provides a preliminary empirical evaluation of several multi-armed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the ε-greedy strategy, proves to be often hard to beat.

1 Introduction

In many real-world situations, decisions are made in order to maximize some expected numerical reward. But decisions, or the actions they generate, do not just bring in more reward, they can also help discover new knowledge that could be used to improve future decisions. Such situations include clinical trials [11] where different treatments need to be experimented with while minimizing patient losses, or adaptive routing efforts for minimizing delays in a network [4]. The questions that arise in all these cases are related to the problem of balancing reward maximization based on the knowledge already acquired and attempting new actions to further increase knowledge, which is known as the exploitation vs. exploration tradeoff in reinforcement learning.

The multi-armed bandit problem, originally described by Robins [19], is an instance of this general problem. A multi-armed bandit, also called K-armed bandit, is similar to a traditional slot machine (one-armed bandit) but in general has more than one lever. When pulled, each lever provides a reward drawn from a distribution associated to that specific lever. Initially, the gambler has no knowledge about the levers, but through repeated trials, he can focus on the most rewarding levers.

This paper considers the opaque bandit problem where a unique reward is observed at each round, in contrast with the transparent one where all rewards are observed [14]. To our knowledge, there is no empirical comparison for the transparent bandit problem either. More formally, the opaque stochastic K-armed bandit (bandit for short) can be seen as a set of real distributions [math]\displaystyle{ B = {R_1, . . . ,R_K} }[/math], each distribution being associated to the rewards brought in by a specific lever.[1] Let [math]\displaystyle{ \mu_1,...,\mu_K }[/math] be the mean values associated to these reward distributions. The gambler plays iteratively one lever at each round and observes the associated reward. His objective is to maximize the sum of the collected rewards. The horizon [math]\displaystyle{ H }[/math] is the number of rounds that remains to be played. The bandit problem is formally equivalent to a one-state Markov Decision Process (MDP), but the general study of MDPs goes beyond the scope of this paper.

A different version of the bandit problem has been studied by [10, 23, 9, 8] where the reward distributions are assumed to be known to the player. This problem is not about balancing exploration and exploitation, it admits an optimal solution based on the so-called Gittins indices. This paper deals with bandit problems found in practice where the assumption about the prior knowledge of the payoffs typically does not hold (see for example section 4).

The regret [math]\displaystyle{ \rho }[/math] after T rounds is defined as the difference between the reward sum associated to an optimal strategy and the sum of the collected rewards [math]\displaystyle{ \rho = Tu^* - \Sigma_{t=1}^T \hat{r_t} }[/math] where [math]\displaystyle{ \mu }[/math] is the maximal reward mean, [math]\displaystyle{ ì = \operatorname{max}_k{ì_k} }[/math], and [math]\displaystyle{ r_t }[/math] the reward at time [math]\displaystyle{ t }[/math]. A strategy whose average regret per round tends to zero with probability 1 for any bandit problem when the horizon tends to infinity is a zero-regret strategy. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.

The problem of determining the best strategy for the gambler is called the multi-armed bandit problem. Many strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms. This paper provides the first preliminary empirical evaluation of several multi-armed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments.

The paper is organized as follows. We first present an overview of several bandit strategies or algorithms (Section 2), then introduce a new algorithm, Poker (Section 3), and describe our experiments with both an artificially generated dataset and a real networking dataset. The results of an empirical evaluation of several bandit algorithms, including Poker are reported in Section 4.

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 MultiArmedBanditAlgorithmsandEmJoannès Vermorel
Mehryar Mohri
Multi-armed Bandit Algorithms and Empirical Evaluation10.1007/11564096_422005
  1. Several algorithms have also been designed for the non-stochastic bandit problem [3] where much weaker assumptions are made about the levers’ rewards, but this paper will focus on the stochastic bandit problem which has been studied the most so far.