# Online Reward-Maximization Task

An Online Reward-Maximization Task is an online learning task with rewardable decisions (with a utility function and constraint rules).

**AKA:**Reinforcement (Trial-and-Error) Learning.**Context:**- It can be solved by a Reinforcement Learning System (that implements a reinforcement learning algorithm).
- It can (typically) require the optimization of Future Reward.
- It can (typically) involve a Credit Assignment Task (particularly for delayed rewards such as a multi-decision competitive game).
- It can (typically) require the learning of Behavior Policies.
- It can range from being Passive Reinforcement Learning to being Active Reinforcement Learning, depending on whether the agent’s policy is fixed, or whether the agent needs to decide what to do (as there’s no fixed policy).

**Example(s):****Counter-Example(s):****See:**Active Learning, Exploration/Exploitation Tradeoff, Operant Conditioning, Behaviorism, Software Agent, Game Theory, Control Theory, Operations Research, Information Theory, Simulation-Based Optimization, Genetic Algorithm, Optimal Control Theory, Bounded Rationality, Motivated Learning Theory.

## References

### 2017

- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/reinforcement_learning Retrieved:2017-2-27.
**Reinforcement learning**is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take*actions*in an*environment*so as to maximize some notion of cumulative*reward*. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called*approximate dynamic programming*. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.In machine learning, the environment is typically formulated as a Markov decision process (MDP) as many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

^{[1]}The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

### 2017b

- https://frnsys.com/ai_notes/artificial_intelligence/reinforcement_learning.html
- QUOTE: ... With active reinforcement learning, the agent is actively trying new things rather than following a fixed policy. The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When you land on a decent strategy, do you just stick with it? What if there's a better strategy out there? How do you balance using your current best strategy and searching for an even better one? ...

### 2016

- http://incompleteideas.net/RL-FAQ.html
- QUOTE: Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. …
… Modern reinforcement learning concerns both trial-and-error learning without a model of the environment, and deliberative planning with a model. By "a model" here we mean a model of the dynamics of the environment. In the simplest case, this means just an estimate of the state-transition probabilities and expected immediate rewards of the environment. In general it means any predictions about the environment's future behavior conditional on the agent's behavior.

- QUOTE: Reinforcement learning (RL) is learning from interaction with an environment, from the consequences of action, rather than from explicit teaching. …

### 2015

- (Mnih et al., 2015) ⇒ Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. (2015). “Human-level Control through Deep Reinforcement Learning.” In: Nature, 518(7540).
- QUOTE: The theory of reinforcement learning provides a normative account1, deeply rooted in sychological2 and neuroscientific3 perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.

### 2013

- (Wikipedia, 2013) ⇒ http://en.wikipedia.org/wiki/reinforcement_learning Retrieved:2013-12-4.
**Reinforcement learning**is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take*actions*in an*environment*so as to maximize some notion of cumulative*reward*. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called*approximate dynamic programming*. The problem has been studied in the theory of optimal control, though most studies there are concerned with existence of optimal solutions and their characterization, and not with the learning or approximation aspects.In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible.

Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

### 2011

- (Peter Stone, 2011b) ⇒ Peter Stone. (2011). “Reinforcement Learning.” In: (Sammut & Webb, 2011) p.849
- QUOTE: Reinforcement Learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.
Unlike in supervised learning problems, in reinforcement-learning problems, there are no labeled examples of correct and incorrect behavior. However, unlike unsupervised learning problems, a reward signal can be perceived. ...

- QUOTE: Reinforcement Learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement learning algorithms seek to learn a policy (mapping from states to actions) that maximize the reward received over time.

### 2010

- (Szepesvari, 2010) ⇒ Csaba Szepesvari. (2010). “Algorithms for Reinforcement Learning." Morgan and Claypool Publishers. ISBN:1608454924, 9781608454921 doi:10.2200/S00268ED1V01Y201005AIM009
- QUOTE: Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective.