# Reinforcement Learning (RL) Algorithm

A Reinforcement Learning (RL) Algorithm is an online learning algorithm that can be implemented into a reinforcement learning system to solve an online reward maximization task (to maximize a cumulative reward metric).

**AKA:**Online Reward-Maximization Algorithm.**Context:**- It can (typically) have an RL Reward Function.
- ...
- It can range from being a Model-Free Reinforcement Learning Algorithm to being a Model-Based Reinforcement Learning Algorithm.
- ...
- It can be implemented by a Reinforcement Learning System (to solve a reinforcement learning task).
- …

**Example(s):**- a k-Armed Bandit Algorithm,
- an Associative Reinforcement Learning Algorithm,
- an Average-Reward Reinforcement Learning Algotithm,
- a Bayesian Reinforcement Learning Algorithm,
- a Deep Reinforcement Learning Algorithm,
- a Gaussian Process Reinforcement Learning Alogrithm,
- a Hierarchical Reinforcement Learning Algorithm,
- a Instance-Based Reinforcement Learning Algorithm,
- a Least Squares Reinforcement Learning Algorithm,
- an One-Step Reinforcement Learning Algorithm,
- a Q-Learning Algorithm,
- a Relational Reinforcement Learning Algorithm,
- a Reinforcement Neural Learning Algorithm, such as a deep reinforcement learning algorithm.
- a Reinforcement Learning from Human Feedback (RLHF).
- …

**Counter-Example(s):****See:**Serial Decision Task, Average-Reward Reinforcement Learning; Efficient Exploration in Reinforcement Learning; Gaussian Process Reinforcement Learning; Inverse Reinforcement Learning; Policy Gradient Methods;Reward Shaping; Symbolic Dynamic Programming; Temporal Difference Learning; Value Function Approximation.

## References

### 2024

- (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Reinforcement_learning Retrieved:2024-4-10.
**Reinforcement learning**(**RL**) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.Reinforcement learning differs from supervised learning in not needing labelled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the long term reward, whose feedback might be incomplete or delayed.

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques. The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the Markov decision process and they target large Markov decision processes where exact methods become infeasible.

^{[1]}

- ↑ Cite error: Invalid
`<ref>`

tag; no text was provided for refs named`Li-2023`

### 2019a

- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Reinforcement_learning Retrieved:2019-5-12.
**Reinforcement learning**(**RL**) is an area of machine learning concerned with how software agents ought to take*actions*in an*environment*so as to maximize some notion of cumulative*reward*. Reinforcement learning is considered as one of three machine learning paradigms, alongside supervised learning and unsupervised learning.It differs from supervised learning in that labelled input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). ...

### 2019b

- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Q-learning Retrieved:2019-5-12.
is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model (hence the connotation "model-free") of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.*Q*-learningFor any finite Markov decision process (FMDP),

*Q*-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.^{[1]}*Q*-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy. "Q" names the function that returns the reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken in a given state.^{[2]}

- ↑ Melo, Francisco S. "Convergence of Q-learning: a simple proof" (PDF).
- ↑ Matiisen, Tambet (December 19, 2015). "Demystifying Deep Reinforcement Learning". neuro.cs.ut.ee. Computational Neuroscience Lab. Retrieved 2018-04-06.

### 2017

- (Stone, 2017) ⇒ Peter Stone (2017). Reinforcement Learning. In: Sammut & Webb (2017).
- QUOTE: Reinforcement learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement-learning algorithms seek to learn a policy (mapping from states to actions) that maximizes the reward received over time.
Unlike in supervised learning problems, in reinforcement-learning problems, there are no labeled examples of correct and incorrect behavior. However, unlike unsupervised learning problems, a reward signal can be perceived.

Many different algorithms for solving reinforcement-learning problems are covered in other entries. This entry provides just a brief high-level classification of the algorithms. Perhaps the most well-known approach to solving reinforcement-learning problems, as covered in detail by Sutton and Barto (1998), is based on learning a value function, which represents the long-term expected reward of each state the agent may encounter, given a particular policy.

- QUOTE: Reinforcement learning describes a large class of learning problems characteristic of autonomous agents interacting in an environment: sequential decision-making problems with delayed reward. Reinforcement-learning algorithms seek to learn a policy (mapping from states to actions) that maximizes the reward received over time.

### 2016

- (Krakovsky, 2016) ⇒ Marina Krakovsky. (2016). “Reinforcement Renaissance.” In: Communications of the ACM Journal, 59(8). doi:10.1145/2949662
- QUOTE: The two types of learning — reinforcement learning and deep learning through deep neural networks — complement each other beautifully, says Sutton. " Deep learning is the greatest thing since sliced bread, but it quickly becomes limited by the data, " he explains. " If we can use reinforcement learning to automatically generate data, even if the data is more weakly labeled than having humans go in and label everything, there can be much more of it because we can generate it automatically, so these two together really fit well. “ Despite the buzz around DeepMind, combining reinforcement learning with neural networks is not new. TD-Gammon, a backgammon-playing program developed by IBM's Gerald Tesauro in 1992, was a neural network that learned to play backgammon through reinforcement learning (the TD in the name stands for Temporal-Difference learning, still a dominant algorithm in reinforcement learning). “Back then, computers were 10,000 times slower per dollar, which meant you couldn't have very deep networks because those are harder to train ... “Deep reinforcement learning is just a but word for traditional reinforcement learning combined with deeper neural networks, " he says.

### 1998

- (Sutton & Barto, 1998) ⇒ Richard S. Sutton, and Andrew G. Barto. (1998). “Reinforcement Learning: An introduction." MIT Press. ISBN:0262193981
- BOOK OVERVIEW: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment....

### 1996

- (Kaelbling et al., 1996) ⇒ L. P. Kaelbling, M. L. Littman, and A. W. Moore. (1996). “Reinforcement Learning: A Survey.” In: Journal of Artificial Intelligence Research, Vol 4, (1996), 237-285
- ABSTRACT: This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word “reinforcement.
*The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.*

- ABSTRACT: This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word “reinforcement.

### 1997

- (Mitchell, 1997) ⇒ Tom M. Mitchell. (1997). “Machine Learning." McGraw-Hill.

### 1990

- (Sutton, 1990) ⇒ Richard S. Sutton. (1990). “Integrated Architecture for Learning, Planning, and Reacting based on Approximating Dynamic Programming.” In: Proceedings of the seventh international conference (1990) on Machine learning. ISBN:1-55860-141-4
- QUOTE: This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned model of the world.