ddp

Filename: ddp.py

Author: Daisuke Oyama

Module for solving dynamic programs (also known as Markov decision processes) with finite states and actions.

Discrete Dynamic Programming

A discrete dynamic program consists of the following components:

  • finite set of states \(S = \{0, \ldots, n-1\}\);
  • finite set of available actions \(A(s)\) for each state \(s \in S\) with \(A = \bigcup_{s \in S} A(s) = \{0, \ldots, m-1\}\), where \(\mathit{SA} = \{(s, a) \in S \times A \mid a \in A(s)\}\) is the set of feasible state-action pairs;
  • reward function \(r\colon \mathit{SA} \to \mathbb{R}\), where \(r(s, a)\) is the reward when the current state is \(s\) and the action chosen is \(a\);
  • transition probability function \(q\colon \mathit{SA} \to \Delta(S)\), where \(q(s'|s, a)\) is the probability that the state in the next period is \(s'\) when the current state is \(s\) and the action chosen is \(a\); and
  • discount factor \(0 \leq \beta < 1\).

For a policy function \(\sigma\), let \(r_{\sigma}\) and \(Q_{\sigma}\) be the reward vector and the transition probability matrix for \(\sigma\), which are defined by \(r_{\sigma}(s) = r(s, \sigma(s))\) and \(Q_{\sigma}(s, s') = q(s'|s, \sigma(s))\), respectively. The policy value function \(v_{\sigma}\) for \(\sigma\) is defined by

\[v_{\sigma}(s) = \sum_{t=0}^{\infty} \beta^t (Q_{\sigma}^t r_{\sigma})(s) \quad (s \in S).\]

The optimal value function \(v^*\) is the function such that \(v^*(s) = \max_{\sigma} v_{\sigma}(s)\) for all \(s \in S\). A policy function \(\sigma^*\) is optimal if \(v_{\sigma^*}(s) = v^*(s)\) for all \(s \in S\).

The Bellman equation is written as

\[v(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).\]

The Bellman operator \(T\) is defined by the right hand side of the Bellman equation:

\[(T v)(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).\]

For a policy function \(\sigma\), the operator \(T_{\sigma}\) is defined by

\[(T_{\sigma} v)(s) = r(s, \sigma(s)) + \beta \sum_{s' \in S} q(s'|s, \sigma(s)) v(s') \quad (s \in S),\]

or \(T_{\sigma} v = r_{\sigma} + \beta Q_{\sigma} v\).

The main result of the theory of dynamic programming states that the optimal value function \(v^*\) is the unique solution to the Bellman equation, or the unique fixed point of the Bellman operator, and that \(\sigma^*\) is an optimal policy function if and only if it is \(v^*\)-greedy, i.e., it satisfies \(T v^* = T_{\sigma^*} v^*\).

Solution Algorithms

The DiscreteDP class currently implements the following solution algorithms:

  • value iteration;
  • policy iteration;
  • modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an \(\varepsilon\)-optimal policy and an \(\varepsilon/2\)-approximation of the optimal value function for a prespecified value of \(\varepsilon\).

Our implementations of value iteration and modified policy iteration employ the norm-based and span-based termination rules, respectively.

  • Value iteration is terminated when the condition \(\lVert T v - v \rVert < [(1 - \beta) / (2\beta)] \varepsilon\) is satisfied.
  • Modified policy iteration is terminated when the condition \(\mathrm{span}(T v - v) < [(1 - \beta) / \beta] \varepsilon\) is satisfied, where \(\mathrm{span}(z) = \max(z) - \min(z)\).

References

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.

class quantecon.markov.ddp.DPSolveResult[source]

Bases: dict

Contain the information about the dynamic programming result.

Attributes

v (ndarray(float, ndim=1)) Computed optimal value function
sigma (ndarray(int, ndim=1)) Computed optimal policy function
num_iter (int) Number of iterations
mc (MarkovChain) Controlled Markov chain
method (str) Method employed
epsilon (float) Value of epsilon
max_iter (int) Maximum number of iterations

Methods

clear(() -> None.  Remove all items from D.)
copy(() -> a shallow copy of D)
fromkeys Returns a new dict with keys from iterable and values equal to value.
get((k[,d]) -> D[k] if k in D, ...)
items(...)
keys(...)
pop((k[,d]) -> v, ...) If key is not found, d is returned if given, otherwise KeyError is raised
popitem(() -> (k, v), ...) 2-tuple; but raise KeyError if D is empty.
setdefault((k[,d]) -> D.get(k,d), ...)
update(([E, ...) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
values(...)
class quantecon.markov.ddp.DiscreteDP(R, Q, beta, s_indices=None, a_indices=None)[source]

Bases: object

Class for dealing with a discrete dynamic program.

There are two ways to represent the data for instantiating a DiscreteDP object. Let n, m, and L denote the numbers of states, actions, and feasbile state-action pairs, respectively.

  1. DiscreteDP(R, Q, beta)

    with parameters:

    • n x m reward array R,
    • n x m x n transition probability array Q, and
    • discount factor beta,

    where R[s, a] is the reward for action a when the state is s and Q[s, a, s’] is the probability that the state in the next period is s’ when the current state is s and the action chosen is a.

  2. DiscreteDP(R, Q, beta, s_indices, a_indices)

    with parameters:

    • length L reward vector R,
    • L x n transition probability array Q,
    • discount factor beta,
    • length L array s_indices, and
    • length L array a_indices,

    where the pairs (s_indices[0], a_indices[0]), ..., (s_indices[L-1], a_indices[L-1]) enumerate feasible state-action pairs, and R[i] is the reward for action a_indices[i] when the state is s_indices[i] and Q[i, s’] is the probability that the state in the next period is s’ when the current state is s_indices[i] and the action chosen is a_indices[i]. With this formulation, Q may be represented by a scipy.sparse matrix.

Parameters:

R : array_like(float, ndim=2 or 1)

Reward array.

Q : array_like(float, ndim=3 or 2) or scipy.sparse matrix

Transition probability array.

beta : scalar(float)

Discount factor. Must be in [0, 1].

s_indices : array_like(int, ndim=1), optional(default=None)

Array containing the indices of the states.

a_indices : array_like(int, ndim=1), optional(default=None)

Array containing the indices of the actions.

Notes

DiscreteDP accepts beta=1 for convenience. In this case, infinite horizon solution methods are disabled, and the instance is then seen as merely an object carrying the Bellman operator, which may be used for backward induction for finite horizon problems.

Examples

Consider the following example, taken from Puterman (2005), Section 3.1, pp.33-35.

  • Set of states S = {0, 1}

  • Set of actions A = {0, 1}

  • Set of feasible state-action pairs SA = {(0, 0), (0, 1), (1, 0)}

  • Rewards r(s, a):

    r(0, 0) = 5, r(0, 1) =10, r(1, 0) = -1

  • Transition probabilities q(s’|s, a):

    q(0|0, 0) = 0.5, q(1|0, 0) = 0.5, q(0|0, 1) = 0, q(1|0, 1) = 1, q(0|1, 0) = 0, q(1|1, 0) = 1

  • Discount factor 0.95

Creating a `DiscreteDP` instance

Product formulation

This approach uses the product set S x A as the domain by treating action 1 as yielding a reward negative infinity at state 1.

>>> R = [[5, 10], [-1, -float('inf')]]
>>> Q = [[(0.5, 0.5), (0, 1)], [(0, 1), (0.5, 0.5)]]
>>> beta = 0.95
>>> ddp = DiscreteDP(R, Q, beta)

(Q[1, 1] is an arbitrary probability vector.)

State-action pairs formulation

This approach takes the set of feasible state-action pairs SA as given.

>>> s_indices = [0, 0, 1]  # State indices
>>> a_indices = [0, 1, 0]  # Action indices
>>> R = [5, 10, -1]
>>> Q = [(0.5, 0.5), (0, 1), (0, 1)]
>>> beta = 0.95
>>> ddp = DiscreteDP(R, Q, beta, s_indices, a_indices)

Solving the model

Policy iteration

>>> res = ddp.solve(method='policy_iteration', v_init=[0, 0])
>>> res.sigma  # Optimal policy function
array([0, 0])
>>> res.v  # Optimal value function
array([ -8.57142857, -20.        ])
>>> res.num_iter  # Number of iterations
2

Value iteration

>>> res = ddp.solve(method='value_iteration', v_init=[0, 0],
...                 epsilon=0.01)
>>> res.sigma  # (Approximate) optimal policy function
array([0, 0])
>>> res.v  # (Approximate) optimal value function
array([ -8.5665053 , -19.99507673])
>>> res.num_iter  # Number of iterations
162

Modified policy iteration

>>> res = ddp.solve(method='modified_policy_iteration',
...                 v_init=[0, 0], epsilon=0.01)
>>> res.sigma  # (Approximate) optimal policy function
array([0, 0])
>>> res.v  # (Approximate) optimal value function
array([ -8.57142826, -19.99999965])
>>> res.num_iter  # Number of iterations
3

Attributes

R, Q, beta (see Parameters.)
num_states (scalar(int)) Number of states.
num_sa_pairs (scalar(int)) Number of feasible state-action pairs (or those that yield finite rewards).
epsilon (scalar(float), default=1e-3) Default value for epsilon-optimality.
max_iter (scalar(int), default=250) Default value for the maximum number of iterations.

Methods

RQ_sigma(sigma) Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.
T_sigma(sigma) Given a policy sigma, return the T_sigma operator.
bellman_operator(v[, Tv, sigma]) The Bellman operator, which computes and returns the updated value function Tv for a value function v.
compute_greedy(v[, sigma]) Compute the v-greedy policy.
controlled_mc(sigma) Returns the controlled Markov chain for a given policy sigma.
evaluate_policy(sigma) Compute the value of a policy.
modified_policy_iteration([v_init, epsilon, ...]) Solve the optimization problem by modified policy iteration.
operator_iteration(T, v, max_iter[, tol]) Iteratively apply the operator T to v.
policy_iteration([v_init, max_iter]) Solve the optimization problem by policy iteration.
solve([method, v_init, epsilon, max_iter, k]) Solve the dynamic programming problem.
value_iteration([v_init, epsilon, max_iter]) Solve the optimization problem by value iteration.
RQ_sigma(sigma)[source]

Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

R_sigma : ndarray(float, ndim=1)

Reward vector for sigma, of length n.

Q_sigma : ndarray(float, ndim=2)

Transition probability matrix for sigma, of shape (n, n).

T_sigma(sigma)[source]

Given a policy sigma, return the T_sigma operator.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

callable

The T_sigma operator.

bellman_operator(v, Tv=None, sigma=None)[source]

The Bellman operator, which computes and returns the updated value function Tv for a value function v.

Parameters:

v : array_like(float, ndim=1)

Value function vector, of length n.

Tv : ndarray(float, ndim=1), optional(default=None)

Optional output array for Tv.

sigma : ndarray(int, ndim=1), optional(default=None)

If not None, the v-greedy policy vector is stored in this array. Must be of length n.

Returns:

Tv : ndarray(float, ndim=1)

Updated value function vector, of length n.

compute_greedy(v, sigma=None)[source]

Compute the v-greedy policy.

Parameters:

v : array_like(float, ndim=1)

Value function vector, of length n.

sigma : ndarray(int, ndim=1), optional(default=None)

Optional output array for sigma.

Returns:

sigma : ndarray(int, ndim=1)

v-greedy policy vector, of length n.

controlled_mc(sigma)[source]

Returns the controlled Markov chain for a given policy sigma.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

mc : MarkovChain

Controlled Markov chain.

evaluate_policy(sigma)[source]

Compute the value of a policy.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

v_sigma : ndarray(float, ndim=1)

Value vector of sigma, of length n.

modified_policy_iteration(v_init=None, epsilon=None, max_iter=None, k=20)[source]

Solve the optimization problem by modified policy iteration. See the solve method.

operator_iteration(T, v, max_iter, tol=None, *args, **kwargs)[source]

Iteratively apply the operator T to v. Modify v in-place. Iteration is performed for at most a number max_iter of times. If tol is specified, it is terminated once the distance of T(v) from v (in the max norm) is less than tol.

Parameters:

T : callable

Operator that acts on v.

v : ndarray

Object on which T acts. Modified in-place.

max_iter : scalar(int)

Maximum number of iterations.

tol : scalar(float), optional(default=None)

Error tolerance.

args, kwargs :

Other arguments and keyword arguments that are passed directly to the function T each time it is called.

Returns:

num_iter : scalar(int)

Number of iterations performed.

policy_iteration(v_init=None, max_iter=None)[source]

Solve the optimization problem by policy iteration. See the solve method.

solve(method='policy_iteration', v_init=None, epsilon=None, max_iter=None, k=20)[source]

Solve the dynamic programming problem.

Parameters:

method : str in {‘value_iteration’, ‘vi’, ‘policy_iteration’,

‘pi’, ‘modified_policy_iteration’, ‘mpi’},

optinal(default=’policy_iteration’)

Solution method.

v_init : array_like(float, ndim=1), optional(default=None)

Initial value function, of length n. If None, v_init is set such that v_init(s) = max_a r(s, a) for value iteration and policy iteration; for modified policy iteration, v_init(s) = min_(s’, a) r(s’, a)/(1 - beta) to guarantee convergence.

epsilon : scalar(float), optional(default=None)

Value for epsilon-optimality. If None, the value stored in the attribute epsilon is used.

max_iter : scalar(int), optional(default=None)

Maximum number of iterations. If None, the value stored in the attribute max_iter is used.

k : scalar(int), optional(default=20)

Number of iterations for partial policy evaluation in modified policy iteration (irrelevant for other methods).

Returns:

res : DPSolveResult

Optimization result represetned as a DPSolveResult. See DPSolveResult for details.

value_iteration(v_init=None, epsilon=None, max_iter=None)[source]

Solve the optimization problem by value iteration. See the solve method.

quantecon.markov.ddp.backward_induction(ddp, T, v_term=None)[source]

Solve by backward induction a \(T\)-period finite horizon discrete dynamic program with stationary reward and transition probability functions \(r\) and \(q\) and discount factor \(\beta \in [0, 1]\).

The optimal value functions \(v^*_0, \ldots, v^*_T\) and policy functions \(\sigma^*_0, \ldots, \sigma^*_{T-1}\) are obtained by \(v^*_T = v_T\), and

\[v^*_{t-1}(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v^*_t(s') \quad (s \in S)\]

and

\[\sigma^*_{t-1}(s) \in \operatorname*{arg\,max}_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v^*_t(s') \quad (s \in S)\]

for \(t = T, \ldots, 1\), where the terminal value function \(v_T\) is exogenously given.

Parameters:

ddp : DiscreteDP

DiscreteDP instance storing reward array R, transition probability array Q, and discount factor beta.

T : scalar(int)

Number of decision periods.

v_term : array_like(float, ndim=1), optional(default=None)

Terminal value function, of length equal to n (the number of states). If None, it defaults to the vector of zeros.

Returns:

vs : ndarray(float, ndim=2)

Array of shape (T+1, n) where vs[t] contains the optimal value function at period t = 0, ..., T.

sigmas : ndarray(int, ndim=2)

Array of shape (T, n) where sigmas[t] contains the optimal policy function at period t = 0, ..., T-1.