# Online Reward-Maximization Task

An Online Reward-Maximization Task is an online learning task with rewardable decisions (with a utility function and constraint rules), characterized by the pursuit of maximizing rewards over time through interactions with a dynamic environment.

**AKA:**Trial-and-Error Learning.**Context:**- It can be solved by a Reinforcement Learning System that implements a reinforcement learning algorithm designed to maximize cumulative rewards.
- It can be solved by a Online Reward-Maximization System (that implements a online reward-maximization algorithm).
- It can (typically) require optimizing Future Reward, focusing on long-term gains rather than immediate rewards.
- It can (typically) involve a Credit Assignment Task to determine which actions contribute most to future rewards, especially in contexts with delayed rewards such as multi-decision competitive games.
- It can (typically) require the learning of Behavior Policies, which are strategies that dictate the actions an agent takes in various states, guiding the agent towards reward-maximizing behaviors.
- It can range from being Passive Reinforcement Learning, where the agent's policy is fixed and the goal is to evaluate a given policy, to Active Reinforcement Learning, which requires the agent to explore and exploit to discover the optimal policy in the absence of a fixed strategy.
- It can range from being a Discreate-Space Online Reward-Maximization Task to being a Continuous-Space Online Reward-Maximization Task.
- …

**Example(s):**- A Reinforcement Learning-based Online Reward-Maximization Task, which requires the use of an RL algorithm.
- a k-Armed Bandit Task, such as a scenario involving several slot machines (bandits), each with a different but fixed probability of winning, where the goal is to discover the bandit with the highest payout rate through trial and error.
- an RL Benchmark Task from
`http://www.rl-competition.org/`

, like the CartPole task, where the goal is to balance a pole on a moving cart for as long as possible by applying forces to the cart's base. - an Autonomous Helicopter Flight Task, which involves teaching an autonomous helicopter to fly and perform complex maneuvers autonomously by learning from simulations and real-world trials, dealing with high-dimensional state spaces and the helicopter's dynamic responses.
- a Robot Control Task for a robot control scenario, such as teaching a mobile robot to navigate complex environments and avoid obstacles while optimizing travel time.
- a Game Playing Task, where agents learn to play and excel at complex games like Go or chess, facing challenges such as large state spaces, strategic depth, and delayed rewards.
- an Autonomous System Task for autonomous systems like self-driving cars, focusing on safe and efficient navigation in unpredictable real-world conditions.
- an Adaptive User Interface Task, such as a streaming platform that adapts its interface and recommendations based on individual user behavior to enhance engagement and satisfaction.
- a Dynamic Item Recommendation Task, which involves personalizing content suggestions based on user interactions and feedback, balancing the need to explore new content with the exploitation of known preferences.
- a Real-Time Traffic Light Control Task, aimed at optimizing traffic flow through intersections by adapting signal timings in response to real-time traffic conditions.
- a Personalized Healthcare Decision Support Task, where treatment protocols or patient scheduling need to be continuously optimized based on individual patient data, outcomes, and evolving best practices.
- an Adaptive Energy Management Task, such as optimizing the operation of HVAC systems in commercial buildings based on occupancy patterns and weather forecasts to reduce energy consumption and costs.
- …

**Counter-Example(s):**- an Unsupervised Learning Task, where the focus is on discovering underlying patterns in data without explicit guidance or reward signals.
- a i.i.d. Learning Task, which assumes that the samples are independent and identically distributed, unlike the sequential and interdependent nature of decisions in reinforcement learning tasks.

**See:**Active Learning, Exploration/Exploitation Tradeoff, Operant Conditioning, Behaviorism, Software Agent, Game Theory, Control Theory, Operations Research, Information Theory, Simulation-Based Optimization, Genetic Algorithm, Optimal Control Theory, Bounded Rationality, Motivated Learning Theory.

## References

### 2017

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/reinforcement_learning Retrieved:2017-2-27.
**Reinforcement learning**is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take*actions*in an*environment*so as to maximize some notion of cumulative*reward*. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called*approximate dynamic programming*. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.In machine learning, the environment is typically formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

### 2017b

- https://frnsys.com/ai_notes/artificial_intelligence/reinforcement_learning.html
- QUOTE: ... With active reinforcement learning, the agent is actively trying new things rather than following a fixed policy. The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When you land on a decent strategy, do you just stick with it? What if there's a better strategy out there? How do you balance using your current best strategy and searching for an even better one? ...

### 2016

- http://incompleteideas.net/RL-FAQ.html
- QUOTE: Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. …
… Modern reinforcement learning concerns both trial-and-error learning without a model of the environment, and deliberative planning with a model. By "a model" here we mean a model of the dynamics of the environment. In the simplest case, this means just an estimate of the state-transition probabilities and expected immediate rewards of the environment. In general it means any predictions about the environment's future behavior conditional on the agent's behavior.

- QUOTE: Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. …

### 2015

- (Mnih et al., 2015) ⇒ Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. (2015). “Human-level Control through Deep Reinforcement Learning.” In: Nature, 518(7540).
- QUOTE: The theory of reinforcement learning provides a normative account1, deeply rooted in sychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.

### 2013

- (Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/reinforcement_learning Retrieved:2013-12-4.
**Reinforcement learning**is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take*actions*in an*environment*so as to maximize some notion of cumulative*reward*. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called*approximate dynamic programming*. The problem has been studied in the theory of optimal control, though most studies there are concerned with existence of optimal solutions and their characterization, and not with the learning or approximation aspects.In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.

Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

### 2011

- (Peter Stone, 2011b) ⇒ Peter Stone. (2011). “Reinforcement Learning.” In: (Sammut & Webb, 2011) p.849
- QUOTE: Reinforcement Learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.
Unlike in supervised learning problems, in reinforcement-learning problems, there are no labeled examples of correct and incorrect behavior. However, unlike unsupervised learning problems, a reward signal can be perceived. ...

- QUOTE: Reinforcement Learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.

### 2010

- (Szepesvari, 2010) ⇒ Csaba Szepesvari. (2010). “Algorithms for Reinforcement Learning." Morgan and Claypool Publishers. ISBN:1608454924, 9781608454921 doi:10.2200/S00268ED1V01Y201005AIM009
- QUOTE: Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.