No-Regret Online Learning Algorithm
A No-Regret Online Learning Algorithm is an online learning algorithm that provides theoretical guarantees that its performance asymptotically matches or exceeds the performance of the best fixed strategy in hindsight, regardless of the sequence of examples encountered.
- AKA: No-Regret Algorithm, No-Regret Learning, Zero-Regret Learning, No-Regret Online Learner, Regret Minimization Algorithm, External Regret Minimization Algorithm, Sublinear Regret Algorithm, Vanishing Regret Algorithm.
- Context:
- It can typically provide performance guarantees that the cumulative regret grows sublinearly with respect to the horizon length.
- It can typically make sequential decisions under uncertainty where the environment may be non-stationary or even adversarial.
- It can typically define regret as the difference between the algorithm's cumulative reward and the best fixed action's cumulative reward in hindsight.
- It can typically guarantee asymptotic consistency by ensuring that the average regret per round converges to zero as the number of rounds approaches infinity.
- It can typically achieve regret bounds without making distributional assumptions about the data generation process.
- It can typically respond to observed feedback after each decision to improve future performance.
- It can typically generate a stable sequence of predictors where new data causes increasingly smaller changes to the learned hypothesis.
- ...
- It can often serve as a component in more complex learning systems like imitation learning and structured prediction.
- It can often provide theoretical foundations for empirical algorithms in various domains.
- It can often connect learning theory with equilibrium concepts in game theory.
- It can often minimize external regret by comparing against the best single action in hindsight.
- It can often be extended to minimize internal regret by comparing against local action swaps.
- It can often be adapted to bandit settings where only the reward of chosen actions is observed.
- It can often handle continuous action spaces through online convex optimization techniques.
- It can often update learning rates adaptively based on observed sequence characteristics.
- It can often relate to online machine learning algorithms through theoretical guarantees about convergence.
- It can often differ from supervised incremental learning algorithms by focusing on worst-case performance rather than distributional assumptions.
- ...
- It can range from being a Full-Information No-Regret Algorithm to being a Partial-Information No-Regret Algorithm, depending on its feedback model.
- It can range from being a Stochastic No-Regret Algorithm to being an Adversarial No-Regret Algorithm, depending on its environment assumption.
- It can range from being a First-Order No-Regret Algorithm to being a Second-Order No-Regret Algorithm, depending on its update rule complexity.
- It can range from being a External Regret Minimization Algorithm to being an Internal Regret Minimization Algorithm, depending on its regret notion.
- It can range from being a Simple No-Regret Algorithm to being an Adaptive No-Regret Algorithm, depending on its parameter tuning approach.
- It can range from being a Reinforcement Learning-Adjacent No-Regret Algorithm to being a Pure No-Regret Algorithm, depending on its learning framework relationship.
- ...
- Examples:
- Follow The Leader No-Regret Algorithms, such as:
- Multiplicative Weights No-Regret Algorithms, such as:
- Online Gradient Descent No-Regret Algorithms, such as:
- Multi-Armed Bandit No-Regret Algorithms, such as:
- Contextual Bandit No-Regret Algorithms, such as:
- Game-Theoretic No-Regret Algorithms, such as:
- Specialized Application No-Regret Algorithms, such as:
- ...
- Counter-Examples:
- Greedy Algorithms, which make locally optimal choices without regret minimization guarantees.
- Epsilon-Greedy Reinforcement Learning Algorithms, which balance exploration and exploitation but lack formal regret bounds.
- Pure Exploration Algorithms, which focus solely on information gathering rather than reward maximization.
- Batch Learning Algorithms, which process the entire dataset at once without sequential decision making.
- Maximum Likelihood Estimation Algorithms, which optimize for statistical efficiency rather than worst-case regret.
- Posterior Sampling Algorithms (without regret analysis), which focus on Bayesian optimality without adversarial guarantees.
- Typical Supervised Learning Algorithms, which assume i.i.d. data rather than adversarial sequences.
- See: Reinforcement Learning (RL) Algorithm, Online Machine Learning (ML) Algorithm, Supervised Incremental Learning Algorithm, Repeated Interaction, Iterative Learning, Imitation Learning, Gradient Descent, Online Convex Optimization, Game Theory Equilibrium, Expert Advice Problem, Multi-Armed Bandit Problem, Prediction with Expert Advice.
References
2011
- (Ross et al., 2011) ⇒ Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. (2011). “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In: Proceedings of AISTATS-2011.
2007
- (Blum & Monsour, 2007) ⇒ Avrim Blum, and Yishay Monsour. (2007). “Learning, Regret Minimization, and Equilibria.” xx xxx.
- ABSTRACT: Many situations involve repeatedly making decisions in an uncertain envi- ronment: for instance, deciding what route to drive to work each day, or repeated play of a game against an opponent with an unknown strategy. In this chapter we describe learning algorithms with strong guarantees for set- tings of this type, along with connections to game-theoretic equilibria when all players in a system are simultaneously adapting in such a manner.
We begin by presenting algorithms for repeated play of a matrix game with the guarantee that against any opponent, they will perform nearly as well as the best fixed action in hindsight (also called the problem of combining expert advice or minimizing external regret). In a zero-sum game, such algorithms are guaranteed to approach or exceed the minimax value of the game, and even provide a simple proof of the minimax theorem. We then turn to algorithms that minimize an even stronger form of regret, known as internal or swap regret. We present a general reduction showing how to convert any algorithm for minimizing external regret to one that minimizes this stronger form of regret as well. Internal regret is important because when all players in a game minimize this stronger type of regret, the empirical distribution of play is known to converge to correlated equilibrium.
The third part of this chapter explains a different reduction: how to con- vert from the full information setting in which the action chosen by the opponent is revealed after each time step, to the partial information (ban- dit) setting, where at each time step only the payoff of the selected action is observed (such as in routing), and still maintain a small external regret.
Finally, we end by discussing routing games in the Wardrop model, where one can show that if all participants minimize their own external regret, then overall traffic is guaranteed to converge to an approximate Nash Equilibrium. This further motivates price-of-anarchy results.
- ABSTRACT: Many situations involve repeatedly making decisions in an uncertain envi- ronment: for instance, deciding what route to drive to work each day, or repeated play of a game against an opponent with an unknown strategy. In this chapter we describe learning algorithms with strong guarantees for set- tings of this type, along with connections to game-theoretic equilibria when all players in a system are simultaneously adapting in such a manner.
2005
- (Vermorel et al., 2005) ⇒ Joannès Vermorel, and Mehryar Mohri. (2005). “Multi-armed Bandit Algorithms and Empirical Evaluation.” In: Proceedings of the 16th European conference on Machine Learning. doi:10.1007/11564096_42
- QUOTE: A different version of the bandit problem has been studied by [10, 23, 9, 8] where the reward distributions are assumed to be known to the player. This problem is not about balancing exploration and exploitation, it admits an optimal solution based on the so-called Gittins indices. This paper deals with bandit problems found in practice where the assumption about the prior knowledge of the payoffs typically does not hold (see for example section 4).
The regret [math]\displaystyle{ \rho }[/math] after T rounds is defined as the difference between the reward sum associated to an optimal strategy and the sum of the collected rewards [math]\displaystyle{ \rho = Tu^* - \Sigma_{t=1}^T \hat{r_t} }[/math] where [math]\displaystyle{ \mu }[/math] is the maximal reward mean, [math]\displaystyle{ ì = \operatorname{max}_k{ì_k} }[/math], and [math]\displaystyle{ r_t }[/math] the reward at time [math]\displaystyle{ t }[/math]. A strategy whose average regret per round tends to zero with probability 1 for any bandit problem when the horizon tends to infinity is a zero-regret strategy. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.
- QUOTE: A different version of the bandit problem has been studied by [10, 23, 9, 8] where the reward distributions are assumed to be known to the player. This problem is not about balancing exploration and exploitation, it admits an optimal solution based on the so-called Gittins indices. This paper deals with bandit problems found in practice where the assumption about the prior knowledge of the payoffs typically does not hold (see for example section 4).