# Partially Observable Markov Decision Process (POMDP)

## References

### 2017a

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process Retrieved:2017-6-14.
• A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the operations research community, and was later adapted by the artificial intelligence and automated planning communities.

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

### 2012a

• (Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process
• A Partially Observable Markov Decision Process (POMDP) is a generalization of a Markov Decision Process. A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the Operations Research community, and was later taken over by the Artificial Intelligence and Automated Planning communities.

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

### 2012b

• (Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process#Formal_definition
• A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a tuple $(S,A,O,T,\Omega,R)$, where
• $S$ is a set of states,
• $A$ is a set of actions,
• $O$ is a set of observations,
• $T$ is a set of conditional transition probabilities,
• $\Omega$ is a set of conditional observation probabilities,
• $R: A,S \to \mathbb{R}$ is the reward function.
• At each time period, the environment is in some state $s \in S$. The agent takes an action $a \in A$,

which causes the environment to transition to state $s'$ with probability $T(s'\mid s,a)$. Finally, the agent receives a reward with expected value, say $r$, and the process repeats.

### 2004

Markov
Models
Do we have control
over the state transitons?
NO YES
Are the states
completely
observable?
YES

## MDP

Markov Decision Process
NO

## HMM

Hidden Markov Model

## POMDP

Partially Observable
Markov Decision Process