Upper Confidence Bound (UCB) Algorithm

From GM-RKB
Jump to navigation Jump to search

An Upper Confidence Bound (UCB) Algorithm is a sequential model-based optimization multi-armed bandit algorithm that balance exploration and exploitation by estimating the mean reward of each arm and choosing the arm with the highest upper confidence bound based on the estimated variance.



References

2023

  • GBard
    • The Upper Confidence Bound (UCB) algorithm is a commonly used algorithm in machine learning, and it works by estimating the mean reward of each arm and choosing the arm with the highest upper confidence bound based on the estimated variance.
    • Multi-Armed Bandit Problem: The UCB algorithm is used to solve the multi-armed bandit problem (MAB problem). The MAB problem is a sequential decision-making problem where an agent has a set of actions (arms) to choose from, and each arm has a different reward distribution. The agent's goal is to maximize its cumulative reward over time.
    • UCB Algorithm: The UCB algorithm works by maintaining an estimate of the mean reward for each arm. At each time step, the algorithm chooses the arm with the highest upper confidence bound. The upper confidence bound is calculated as follows:
      • UCB(a) = Q(a) + sqrt(c * ln(t) / N(a))
    • where:
   a is the arm
   Q(a) is the estimated mean reward for arm a
   c is a constant that controls the trade-off between exploration and exploitation
   t is the current time step
   N(a) is the number of times arm a has been chosen

The UCB algorithm balances exploration and exploitation by choosing arms with both high estimated rewards and high uncertainty. By choosing arms with high uncertainty, the algorithm is able to learn more about the reward distributions of the arms.

2014