Thompson Sampling Algorithm

From GM-RKB
(Redirected from Thompson Sampling)
Jump to navigation Jump to search

A Thompson Sampling Algorithm is a heuristic sampling algorithm that ...



References

2017


2017

  • (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Thompson_sampling#Description Retrieved:2017-4-24.
    • Consider a set of contexts [math]\displaystyle{ \mathcal{X} }[/math] , a set of actions [math]\displaystyle{ \mathcal{A} }[/math] , and rewards in [math]\displaystyle{ \mathbb{R} }[/math] . In each round, the player obtains a context [math]\displaystyle{ x \in \mathcal{X} }[/math] , plays an action [math]\displaystyle{ a \in \mathcal{A} }[/math] and receives a reward [math]\displaystyle{ r \in \mathbb{R} }[/math] following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.

      The elements of Thompson sampling are as follows:

      1. a likelihood function [math]\displaystyle{ P(r|\theta,a,x) }[/math] ;
      2. a set [math]\displaystyle{ \Theta }[/math] of parameters [math]\displaystyle{ \theta }[/math] of the distribution of [math]\displaystyle{ r }[/math] ;
      3. a prior distribution [math]\displaystyle{ P(\theta) }[/math] on these parameters;
      4. past observations triplets [math]\displaystyle{ \mathcal{D} = \{(x; a; r)\} }[/math] ;
      5. a posterior distribution [math]\displaystyle{ P(\theta|\mathcal{D}) \propto P(\mathcal{D}|\theta)P(\theta) }[/math] , where [math]\displaystyle{ P(\mathcal{D}|\theta) }[/math] is the likelihood function.
    • Thompson sampling consists in playing the action [math]\displaystyle{ a^\ast \in \mathcal{A} }[/math] according to the probability that it maximizes the expected reward, i.e. : [math]\displaystyle{ \int \mathbb{I}[\mathbb{E}(r|a^\ast,x,\theta) = \max_{a'} \mathbb{E}(r|a',x,\theta)] P(\theta|\mathcal{D}) \, d\theta, }[/math] where [math]\displaystyle{ \mathbb{I} }[/math] is the indicator function.

      In practice, the rule is implemented by sampling, in each round, a parameter [math]\displaystyle{ \theta^\ast }[/math] from the posterior [math]\displaystyle{ P(\theta|\mathcal{D}) }[/math] , and choosing the action [math]\displaystyle{ a^\ast }[/math] that maximizes [math]\displaystyle{ \mathbb{E}[r|\theta^\ast,a^\ast,x] }[/math] , i.e. the expected reward given the parameter, the action and the current context. Conceptually, this means that the player instantiates his or her beliefs randomly in each round, and then he acts optimally according to them.

2012

2011