# Q-Learning Algorithm

A Q-Learning Algorithm is a model-free reinforcement learning algorithm that searches for an optimal action-selection policy for any given finite Markov decision process.

## References

### 2016

- (Wikipedia, 2016) ⇒ http://wikipedia.org/wiki/Q-learning Retrieved:2016-3-31.
**Q-learning**is a model-free reinforcement learning technique. Specifically,*Q*-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of*Q*-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally,*Q*-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP,*Q*-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.

### 2011

- (Peter Stone, 2011a) ⇒ Peter Stone. (2011). “Q-Learning.” In: (Sammut & Webb, 2011) p.819
- QUOTE: Q-learning is a form of temporal difference learning. As such, it is a model-free reinforcement learning method combining elements of dynamic programming with Monte Carlo estimation. Due in part to Watkins’ (1989) proof that it converges to the optimal value function, Q-learning is among the most commonly used and well-known reinforcement learning algorithms.

### 2001

- (Precup et al., 2001) ⇒ Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. (2001). “Off-policy Temporal-difference Learning with Function Approximation.” In: Proceedings of ICML-2001 (ICML-2001).
- QUOTE: ... … , called off-policy methods. Q-learning is an off-policy method in that it learns the optimal policy even when actions are selected according to a more exploratory or even random policy. Q-learning requires only that all …

### 1992

- (Watkins & Dayan, 1992) ⇒ Christopher J. C. H. Watkins, and Peter Dayan. (1992). “Technical Note : [math]\cal{Q}[/math]-Learning.” In: Machine Learning Journal, 8(3-4). doi:10.1007/BF00992698
- ABSTRACT: [math]\cal{Q}[/math]-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.

### 1989

- (Watkins, 1989) ⇒ Christopher Watkins. (1989). “Learning from Delayed Rewards.” PhD diss., King's College, Cambridge,