# k-Armed Bandit Maximization Task

A k-Armed Bandit Maximization Task is an online rewards-maximization task where the decision-making agent must make a finite sequence of choices against [math]k[/math] independent systems such that rewards are maximized.

**Context:**- It can be solved by a k-Armed Bandit System (that implements a k-armed bandit algorithm).
- It can range from being a Contextual k-Armed Bandit Task to being a Non-contextual k-Armed Bandit Task.
- It can range from being an Opaque k-Armed Bandit Task to being a Transparent k-Armed Bandit Task.
- It can range from being an Adversarial k-Armed Bandit Task to being a Nonadversarial k-Armed Bandit Task.
- It can range from being a 2-Armed Bandit Task to being a Multi-Armed Bandit Task.

**Example(s):**- Maximize profits from a set of 5 slot machines.
- Allocate resources among the competing projects whose properties are only partially known at the time of allocation but which may become better understood as time passes.
- a Stochastic Multi-Armed Bandit Task.
- Associative Bandit Problems.

**Counter-Example(s):****See:**Reward Maximization, Thompson Sampling, Markov Processes, PAC Learning.

## References

### 2018

- (McInerney et al., 2018) ⇒ [[::James McInerney]], [[::Benjamin Lacker]], [[::Samantha Hansen]], [[::Karl Higley]], [[::Hugues Bouchard]], [[::Alois Gruson]], and [[::Rishabh Mehrotra]]. ([[::2018]]). “Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits.” In: Proceedings of the 12th ACM Conference on Recommender Systems. ISBN:978-1-4503-5901-6 doi:10.1145/3240323.3240354
- QUOTE: ... The multi-armed bandit is an important framework for balancing exploration with exploitation in recommendation. Exploitation recommends content (e.g., products, movies, music playlists) with the highest predicted user engagement and has traditionally been the focus of recommender systems. Exploration recommends content with uncertain predicted user engagement for the purpose of gathering more information. The importance of exploration has been recognized in recent years, particularly in settings with new users, new items, non-stationary preferences and attributes. ...

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/slot_machine Retrieved:2015-6-20.
- ... A gambler strategically operating multiple machines in order to draw the highest possible profits is called a
*multi-armed bandit*. ...

- ... A gambler strategically operating multiple machines in order to draw the highest possible profits is called a

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/multi-armed_bandit Retrieved:2015-6-20.
- In probability theory, the '
*multi-armed bandit problem (sometimes called the*) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them.*K*-^{[1]}or*N*-armed bandit problem^{[2]}When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.^{[3]}^{[4]}Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". A theorem, the Gittins index published first by John C. Gittins gives an optimal policy in the Markov setting for maximizing the expected discounted reward.In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization, like a science foundation or a pharmaceutical company. Given a fixed budget, the problem is to allocate resources among the competing projects, whose properties are only partially known at the time of allocation, but which may become better understood as time passes.

^{[3]}^{[4]}In early versions of the multi-armed bandit problem, the gambler has no initial knowledge about the machines. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in reinforcement learning.

- In probability theory, the '

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`doi10.1023/A:1013689704352`

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`weber`

- ↑
^{3.0}^{3.1}Cite error: Invalid`<ref>`

tag; no text was provided for refs named`Gittins89`

- ↑
^{4.0}^{4.1}Cite error: Invalid`<ref>`

tag; no text was provided for refs named`BF`

### 2011

- (Mannor, 2011) ⇒ Shie Mannor. (2011). “k-Armed Bandit.” In: (Sammut & Webb, 2011) p.561
- QUOTE: In the classical k-armed bandit problem, there are k alternative arms, each with a stochastic reward whose probability distribution is initially unknown. A decision maker can try these arms in some order, which may depend on the rewards that have been observed so far. A common objective in this context is to find a policy for choosing the next arm to be tried, under which the sum of the expected rewards comes as close as possible to the ideal reward, that is, the expected reward that would be obtained if it were to try the “best” arm at all times. There are many variants of the k-armed bandit problem that are distinguished by the objective of the decision maker, the process governing the reward of each arm, and the information available to the decision maker at the end of every trial.

### 2006

- (Bergemann & Välimäki, 2006) ⇒ Dirk Bergemann, and Juuso Välimäki. (2006). “Bandit Problems." Technical Report-93, HECER.
- QUOTE: The multi-armed bandit problem, originally described by Robbins (1952), is a statistical decision model of an agent trying to optimize his decisions while improving his information at the same time. In the multiarm bandit problem, the gambler has to decide which arm of K different slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Each choice of an arm results in an immediate random payoff, but the process determining these payoffs evolves during the play of the bandit. The distinguishing feature of bandit problems is that the distribution of returns from one arm only changes when that arm is chosen. Hence the rewards from an arm do not depend on the rewards obtained from other arms. This feature also implies that the distributions of returns do not depend explicitly on calendar time.

### 2005

- (Vermorel et al., 2005) ⇒ Joannès Vermorel, and Mehryar Mohri. (2005). “Multi-armed Bandit Algorithms and Empirical Evaluation.” In: Proceedings of the 16th European conference on Machine Learning. doi:10.1007/11564096_42
- The multi-armed bandit problem for a gambler is to decide which arm of a
*K*-slot machine to pull to maximize his total reward in a series of trials. Many real-world learning and optimization problems can be modeled in this way. ... This paper considers the opaque bandit problem where a unique reward is observed at each round, in contrast with the transparent one where all rewards are observed [14].

- The multi-armed bandit problem for a gambler is to decide which arm of a

### 2003

### 1989

- (Gittins, 1989) ⇒ J. C. Gittins. (1989). “Multi-Armed Bandit Allocation Indices." John Wiley & Sons, Ltd., ISBN 0-471-92059-2.

### 1985

- (Berry & Fristedt) ⇒ Donald A. Berry, and Bert Fristedt. (1985). “Bandit Problems: Sequential allocation of experiments." Chapman & Hall, ISBN 0-412-24810-7.

### 1952

- (Robbins, 1952) ⇒ Herbert Robbins. (1952). “Some Aspects of the Sequential Design of Experiments.” In: Bulletin of the American Mathematical Society, 58 (5) doi:10.1090/S0002-9904-1952-09620-8

*
*