Difference between revisions of "k-Armed Bandit Maximization Task"

Jump to: navigation, search
m (Text replacement - "System]] (that applies a " to "System (that implements a [[")
(One intermediate revision by the same user not shown)
Line 18: Line 18:
== References ==
== References ==
=== 2018 ===
* ([[2018_ExploreExploitandExplainPersona|McInerney et al., 2018]]) ⇒ [[::James McInerney]], [[::Benjamin Lacker]], [[::Samantha Hansen]], [[::Karl Higley]], [[::Hugues Bouchard]], [[::Alois Gruson]], and [[::Rishabh Mehrotra]]. ([[::2018]]). “[https://static1.squarespace.com/static/5ae0d0b48ab7227d232c2bea/t/5ba849e3c83025fa56814f45/1537755637453/BartRecSys.pdf Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits].” In: [[Proceedings of the 12th ACM Conference on Recommender Systems]]. ISBN:978-1-4503-5901-6 [http://dx.doi.org/10.1145/3240323.3240354 doi:10.1145/3240323.3240354]
** QUOTE: ... The [[multi-armed bandit]] is an [[important framework]] for [[balancing exploration]] with exploitation in [[recommendation]]. </s> Exploitation [[recommends content]] (e.g., [[product]]s, [[movi]]es, [[music playlist]]s) with the highest [[predicted user engagement]] and has traditionally been the focus of [[recommender system]]s. </s> [[Exploration recommends content]] with [[uncertain predicted user engagement]] for the purpose of [[gathering more information]]. </s> The [[importance of exploration]] has been recognized in recent years, particularly in [[setting]]s with new [[user]]s, new [[item]]s, [[non-stationary preference]]s and [[attribute]]s. </s> ...
=== 2015 ===
=== 2015 ===

Latest revision as of 00:30, 5 December 2019

A k-Armed Bandit Maximization Task is an online rewards-maximization task where the decision-making agent must make a finite sequence of choices against [math]k[/math] independent systems such that rewards are maximized.




  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/multi-armed_bandit Retrieved:2015-6-20.
    • In probability theory, the 'multi-armed bandit problem (sometimes called the K-[1] or N-armed bandit problem ) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them.[2] When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.[3][4] Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". A theorem, the Gittins index published first by John C. Gittins gives an optimal policy in the Markov setting for maximizing the expected discounted reward.

      In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization, like a science foundation or a pharmaceutical company. Given a fixed budget, the problem is to allocate resources among the competing projects, whose properties are only partially known at the time of allocation, but which may become better understood as time passes.[3][4]

      In early versions of the multi-armed bandit problem, the gambler has no initial knowledge about the machines. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in reinforcement learning.

  1. Cite error: Invalid <ref> tag; no text was provided for refs named doi10.1023/A:1013689704352
  2. Cite error: Invalid <ref> tag; no text was provided for refs named weber
  3. 3.0 3.1 Cite error: Invalid <ref> tag; no text was provided for refs named Gittins89
  4. 4.0 4.1 Cite error: Invalid <ref> tag; no text was provided for refs named BF






  • (Gittins, 1989) ⇒ J. C. Gittins. (1989). “Multi-Armed Bandit Allocation Indices." John Wiley & Sons, Ltd., ISBN 0-471-92059-2.


  • (Berry & Fristedt) ⇒ Donald A. Berry, and Bert Fristedt. (1985). “Bandit Problems: Sequential allocation of experiments." Chapman & Hall, ISBN 0-412-24810-7.