# ddp¶

Module for solving dynamic programs (also known as Markov decision processes) with finite states and actions.

## Discrete Dynamic Programming¶

A discrete dynamic program consists of the following components:

• finite set of states $$S = \{0, \ldots, n-1\}$$;
• finite set of available actions $$A(s)$$ for each state $$s \in S$$ with $$A = \bigcup_{s \in S} A(s) = \{0, \ldots, m-1\}$$, where $$\mathit{SA} = \{(s, a) \in S \times A \mid a \in A(s)\}$$ is the set of feasible state-action pairs;
• reward function $$r\colon \mathit{SA} \to \mathbb{R}$$, where $$r(s, a)$$ is the reward when the current state is $$s$$ and the action chosen is $$a$$;
• transition probability function $$q\colon \mathit{SA} \to \Delta(S)$$, where $$q(s'|s, a)$$ is the probability that the state in the next period is $$s'$$ when the current state is $$s$$ and the action chosen is $$a$$; and
• discount factor $$0 \leq \beta < 1$$.

For a policy function $$\sigma$$, let $$r_{\sigma}$$ and $$Q_{\sigma}$$ be the reward vector and the transition probability matrix for $$\sigma$$, which are defined by $$r_{\sigma}(s) = r(s, \sigma(s))$$ and $$Q_{\sigma}(s, s') = q(s'|s, \sigma(s))$$, respectively. The policy value function $$v_{\sigma}$$ for $$\sigma$$ is defined by

$v_{\sigma}(s) = \sum_{t=0}^{\infty} \beta^t (Q_{\sigma}^t r_{\sigma})(s) \quad (s \in S).$

The optimal value function $$v^*$$ is the function such that $$v^*(s) = \max_{\sigma} v_{\sigma}(s)$$ for all $$s \in S$$. A policy function $$\sigma^*$$ is optimal if $$v_{\sigma^*}(s) = v^*(s)$$ for all $$s \in S$$.

The Bellman equation is written as

$v(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).$

The Bellman operator $$T$$ is defined by the right hand side of the Bellman equation:

$(T v)(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).$

For a policy function $$\sigma$$, the operator $$T_{\sigma}$$ is defined by

$(T_{\sigma} v)(s) = r(s, \sigma(s)) + \beta \sum_{s' \in S} q(s'|s, \sigma(s)) v(s') \quad (s \in S),$

or $$T_{\sigma} v = r_{\sigma} + \beta Q_{\sigma} v$$.

The main result of the theory of dynamic programming states that the optimal value function $$v^*$$ is the unique solution to the Bellman equation, or the unique fixed point of the Bellman operator, and that $$\sigma^*$$ is an optimal policy function if and only if it is $$v^*$$-greedy, i.e., it satisfies $$T v^* = T_{\sigma^*} v^*$$.

## Solution Algorithms¶

The DiscreteDP class currently implements the following solution algorithms:

• value iteration;
• policy iteration;
• modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an $$\varepsilon$$-optimal policy and an $$\varepsilon/2$$-approximation of the optimal value function for a prespecified value of $$\varepsilon$$.

Our implementations of value iteration and modified policy iteration employ the norm-based and span-based termination rules, respectively.

• Value iteration is terminated when the condition $$\lVert T v - v \rVert < [(1 - \beta) / (2\beta)] \varepsilon$$ is satisfied.
• Modified policy iteration is terminated when the condition $$\mathrm{span}(T v - v) < [(1 - \beta) / \beta] \varepsilon$$ is satisfied, where $$\mathrm{span}(z) = \max(z) - \min(z)$$.