2015 OnMarkovDecisionProcesses

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Markov Decision Process

Notes

Cited By

Quotes

Abstract

Sequential planning under uncertainty is a basic optimization problem that arises in many different settings, ranging from artificial intelligence to operations research. In a generic system, we have an agent who chooses among different actions and then receives a reward, after which the system moves on in a stochastic way. Usually the aim is to maximize the expected (discounted) reward of the system over a finite or, in certain cases, as described below, an infinite time horizon.

To obtain a tractable problem, it is often assumed that the transition law of the underlying state process is Markovian, i.e., that only the current state has an influence on future states. Such a situation leads to a Markov decision process (MDP); textbooks on MDPs include [1, 3, 5, 7]. MDPs differ from general stochastic control problems in that the actions are taken at discrete time points, rather than continuously. Stochastic shortest-path problems, consumption and investment of money, allocation of resources, production planning, and harvesting problems are a few examples of MDPs.

The formulation and development of MDPs started in the 1950s with Shapley, Bellman, and, later, Howard, Dubins, Savage, and Blackwell. The early achievements are closely related to the study of stochastic dynamic games. Subtle mathematical problems in the theory include measurability issues with arbitrary Borel state spaces, which naturally arise in, for example, partially observable Markov decision processes. However, with a problem that has enough structure, the solution algorithm is a very simple backward induction algorithm, which works as follows:

Suppose that the decision time points are numbered [math]\displaystyle{ n=0, 1,...,N }[/math], the one-stage rewards are [math]\displaystyle{ r_n }[/math], the terminal reward is [math]\displaystyle{ r_N }[/math], the transition kernel of the state process is [math]\displaystyle{ Q_n (\rdy|x, a) }[/math], and the set of admissible actions at stage [math]\displaystyle{ n }[/math] given state [math]\displaystyle{ x }[/math] is [math]\displaystyle{ D_n(x) }[/math]. The value functions Vn (x) that represent the maximal expected reward starting in state x from stage n up to the final stage N can then be recursively computed by the following optimality equation:

[math]\displaystyle{ V_N(x) =r_N(x). \ \ (1) }[/math]
[math]\displaystyle{ V_n(x) = \underset{a \in D_n(x)}{\operatorname{sup}} \lbrace r_n(x,a) + \intop V_{n+1}(y) Q_n (\rdy|x, a) \rbrace. \ \ (2) }[/math]

When we finally reach V0 we have computed the maximal expected reward of the system up to the time horizon N; the maximizers in this recursion yield the optimal strategy. This algorithm is very straightforward; the real challenge in applying the theory comes in the “curse of dimensionality.” Going forward in time, the number of states that can be reached often increases dramatically with the number of admissible actions. For large n, the number of optimization problems to be solved is vast. If, for example, every state can have just two possible successors, then at stage n there are already 2n different states for which we have to solve (2). This is not feasible for most interesting applications.

But we do have hope of being able to solve such problems. If we have a stationary model and a long time horizon, one trick is to approximate the problem by one with an infinite time horizon. With an infinite time horizon, we are left with a fixed-point problem to solve; various algorithms are available for such problems, including policy iteration and linear programming.

Many problems are not stationary, however. In such cases we need to use tools from approximate dynamic programming (see, for example, [6]). Loosely speaking, the solution methods combine backward induction with a forward simulation of states. The idea is to improve a given approximation of the value functions along a simulated path of the state process.

---

In the remainder of this article, we look at a specific application: valuation of a gas storage facility. A storage facility, which is often a depleted reservoir in an oil or gas field or a salt cavern, can be used not only to balance supply and demand but also to create profit through active storage management on a mark-to-market basis. Contracts for natural gas storage essentially represent real options on natural gas prices.

To find a fair price for storage, we first need a stochastic model for the gas pricing process. Most gas price models are continuous. One of the first models is the Schwartz model (see [8]), in which the log-price dynamics for gas is given by

References

  1. N. Bäuerle and U. Rieder, Markov Decision Processes with Application to Finance, Springer, Berlin, 2011.
  2. N. Bäuerle and V. Riess, Gas storage valuation with regime switching, (2014), arXiv: 1412.1298.
  3. D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. 2, Athena Scientific, Belmont, MA, 2012.
  4. A. Boogert and C. de Jong, Gas storage valuation using a multifactor price process, J. Energy Markets, 4 (2011), 29–52.
  5. O. Hernández-Lerma and J.B. Lasserre, Discrete-time Markov Control Processes, Springer, Berlin, 1996.
  6. W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, John Wiley & Sons, Hoboken, NJ, 2011.
  7. M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Hoboken, NJ, 2009.
  8. E. Schwartz, The stochastic behavior of commodity prices: Implications for valuation and hedging, J. Finance, 52 (1997), 923–973.
  9. N. Secomandi, Optimal commodity trading with a capacitated storage asset, Management Science, 56 (2010), 449–467.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 OnMarkovDecisionProcessesViola Riess
Nicole Bäuerle
On Markov Decision Processes