ddp

Filename: ddp.py

Author: Daisuke Oyama

Module for solving dynamic programs (also known as Markov decision processes) with finite states and actions.

Discrete Dynamic Programming

A discrete dynamic program consists of the following components:

  • finite set of states \(S = \{0, \ldots, n-1\}\);
  • finite set of available actions \(A(s)\) for each state \(s \in S\) with \(A = \bigcup_{s \in S} A(s) = \{0, \ldots, m-1\}\), where \(\mathit{SA} = \{(s, a) \in S \times A \mid a \in A(s)\}\) is the set of feasible state-action pairs;
  • reward function \(r\colon \mathit{SA} \to \mathbb{R}\), where \(r(s, a)\) is the reward when the current state is \(s\) and the action chosen is \(a\);
  • transition probability function \(q\colon \mathit{SA} \to \Delta(S)\), where \(q(s'|s, a)\) is the probability that the state in the next period is \(s'\) when the current state is \(s\) and the action chosen is \(a\); and
  • discount factor \(0 \leq \beta < 1\).

For a policy function \(\sigma\), let \(r_{\sigma}\) and \(Q_{\sigma}\) be the reward vector and the transition probability matrix for \(\sigma\), which are defined by \(r_{\sigma}(s) = r(s, \sigma(s))\) and \(Q_{\sigma}(s, s') = q(s'|s, \sigma(s))\), respectively. The policy value function \(v_{\sigma}\) for \(\sigma\) is defined by

\[v_{\sigma}(s) = \sum_{t=0}^{\infty} \beta^t (Q_{\sigma}^t r_{\sigma})(s) \quad (s \in S).\]

The optimal value function \(v^*\) is the function such that \(v^*(s) = \max_{\sigma} v_{\sigma}(s)\) for all \(s \in S\). A policy function \(\sigma^*\) is optimal if \(v_{\sigma}(s) = v(s)\) for all \(s \in S\).

The Bellman equation is written as

\[v(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).\]

The Bellman operator \(T\) is defined by the right hand side of the Bellman equation:

\[(T v)(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).\]

For a policy function \(\sigma\), the operator \(T_{\sigma}\) is defined by

\[(T_{\sigma} v)(s) = r(s, \sigma(s)) + \beta \sum_{s' \in S} q(s'|s, \sigma(s)) v(s') \quad (s \in S),\]

or \(T_{\sigma} v = r_{\sigma} + \beta Q_{\sigma} v\).

The main result of the theory of dynamic programming states that the optimal value function \(v^*\) is the unique solution to the Bellman equation, or the unique fixed point of the Bellman operator, and that \(\sigma^*\) is an optimal policy function if and only if it is \(v^*\)-greedy, i.e., it satisfies \(T v^* = T_{\sigma^*} v^*\).

Solution Algorithms

The DiscreteDP class currently implements the following solution algorithms:

  • value iteration;
  • policy iteration;
  • modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an \(\varepsilon\)-optimal policy and an \(\varepsilon/2\)-approximation of the optimal value function for a prespecified value of \(\varepsilon\).

Our implementations of value iteration and modified policy iteration employ the norm-based and span-based termination rules, respectively.

  • Value iteration is terminated when the condition \(\lVert T v - v \rVert < [(1 - \beta) / (2\beta)] \varepsilon\) is satisfied.
  • Modified policy iteration is terminated when the condition \(\mathrm{span}(T v - v) < [(1 - \beta) / \beta] \varepsilon\) is satisfied, where \(\mathrm{span}(z) = \max(z) - \min(z)\).

References

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.

class quantecon.markov.ddp.DPSolveResult[source]

Bases: dict

Contain the information about the dynamic programming result.

Attributes

v (ndarray(float, ndim=1)) Computed optimal value function
sigma (ndarray(int, ndim=1)) Computed optimal policy function
num_iter (int) Number of iterations
mc (MarkovChain) Controlled Markov chain
method (str) Method employed
epsilon (float) Value of epsilon
max_iter (int) Maximum number of iterations

Methods

clear(() -> None.  Remove all items from D.)
copy(() -> a shallow copy of D)
fromkeys Returns a new dict with keys from iterable and values equal to value.
get((k[,d]) -> D[k] if k in D, ...)
items(...)
keys(...)
pop((k[,d]) -> v, ...) If key is not found, d is returned if given, otherwise KeyError is raised
popitem(() -> (k, v), ...) 2-tuple; but raise KeyError if D is empty.
setdefault((k[,d]) -> D.get(k,d), ...)
update(([E, ...) If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
values(...)
class quantecon.markov.ddp.DiscreteDP(R, Q, beta, s_indices=None, a_indices=None)[source]

Bases: object

Class for dealing with a discrete dynamic program.

There are two ways to represent the data for instantiating a DiscreteDP object. Let n, m, and L denote the numbers of states, actions, and feasbile state-action pairs, respectively.

  1. DiscreteDP(R, Q, beta)

    with parameters:

    • n x m reward array R,
    • n x m x n transition probability array Q, and
    • discount factor beta,

    where R[s, a] is the reward for action a when the state is s and Q[s, a, s’] is the probability that the state in the next period is s’ when the current state is s and the action chosen is a.

  2. DiscreteDP(R, Q, beta, s_indices, a_indices)

    with parameters:

    • length L reward vector R,
    • L x n transition probability array Q,
    • discount factor beta,
    • length L array s_indices, and
    • length L array a_indices,

    where the pairs (s_indices[0], a_indices[0]), ..., (s_indices[L-1], a_indices[L-1]) enumerate feasible state-action pairs, and R[i] is the reward for action a_indices[i] when the state is s_indices[i] and Q[i, s’] is the probability that the state in the next period is s’ when the current state is s_indices[i] and the action chosen is a_indices[i]. With this formulation, Q may be represented by a scipy.sparse matrix.

Parameters:

R : array_like(float, ndim=2 or 1)

Reward array.

Q : array_like(float, ndim=3 or 2) or scipy.sparse matrix

Transition probability array.

beta : scalar(float)

Discount factor. Must be 0 <= beta < 1.

s_indices : array_like(int, ndim=1), optional(default=None)

Array containing the indices of the states.

a_indices : array_like(int, ndim=1), optional(default=None)

Array containing the indices of the actions.

Examples

Consider the following example, taken from Puterman (2005), Section 3.1, pp.33-35.

  • Set of states S = {0, 1}

  • Set of actions A = {0, 1}

  • Set of feasible state-action pairs SA = {(0, 0), (0, 1), (1, 0)}

  • Rewards r(s, a):

    r(0, 0) = 5, r(0, 1) =10, r(1, 0) = -1

  • Transition probabilities q(s’|s, a):

    q(0|0, 0) = 0.5, q(1|0, 0) = 0.5, q(0|0, 1) = 0, q(1|0, 1) = 1, q(0|1, 0) = 0, q(1|1, 0) = 1

  • Discount factor 0.95

Creating a `DiscreteDP` instance

Product formulation

This approach uses the product set S x A as the domain by treating action 1 as yielding a reward negative infinity at state 1.

>>> R = [[5, 10], [-1, -float('inf')]]
>>> Q = [[(0.5, 0.5), (0, 1)], [(0, 1), (0.5, 0.5)]]
>>> beta = 0.95
>>> ddp = DiscreteDP(R, Q, beta)

(Q[1, 1] is an arbitrary probability vector.)

State-action pairs formulation

This approach takes the set of feasible state-action pairs SA as given.

>>> s_indices = [0, 0, 1]  # State indices
>>> a_indices = [0, 1, 0]  # Action indices
>>> R = [5, 10, -1]
>>> Q = [(0.5, 0.5), (0, 1), (0, 1)]
>>> beta = 0.95
>>> ddp = DiscreteDP(R, Q, beta, s_indices, a_indices)

Solving the model

Policy iteration

>>> res = ddp.solve(method='policy_iteration', v_init=[0, 0])
>>> res.sigma  # Optimal policy function
array([0, 0])
>>> res.v  # Optimal value function
array([ -8.57142857, -20.        ])
>>> res.num_iter  # Number of iterations
2

Value iteration

>>> res = ddp.solve(method='value_iteration', v_init=[0, 0],
...                 epsilon=0.01)
>>> res.sigma  # (Approximate) optimal policy function
array([0, 0])
>>> res.v  # (Approximate) optimal value function
array([ -8.5665053 , -19.99507673])
>>> res.num_iter  # Number of iterations
162

Modified policy iteration

>>> res = ddp.solve(method='modified_policy_iteration',
...                 v_init=[0, 0], epsilon=0.01)
>>> res.sigma  # (Approximate) optimal policy function
array([0, 0])
>>> res.v  # (Approximate) optimal value function
array([ -8.57142826, -19.99999965])
>>> res.num_iter  # Number of iterations
3

Attributes

R, Q, beta (see Parameters.)
num_states (scalar(int)) Number of states.
num_actions (scalar(int)) Number of actions.
num_sa_pairs (scalar(int)) Number of feasible state-action pairs (or those that yield finite rewards).
epsilon (scalar(float), default=1e-3) Default value for epsilon-optimality.
max_iter (scalar(int), default=250) Default value for the maximum number of iterations.

Methods

RQ_sigma(sigma) Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.
T_sigma(sigma) Given a policy sigma, return the T_sigma operator.
bellman_operator(v[, Tv, sigma]) The Bellman operator, which computes and returns the updated value function Tv for a value function v.
compute_greedy(v[, sigma]) Compute the v-greedy policy.
controlled_mc(sigma) Returns the controlled Markov chain for a given policy sigma.
evaluate_policy(sigma) Compute the value of a policy.
modified_policy_iteration([v_init, epsilon, ...]) Solve the optimization problem by modified policy iteration.
operator_iteration(T, v, max_iter[, tol]) Iteratively apply the operator T to v.
policy_iteration([v_init, max_iter]) Solve the optimization problem by policy iteration.
solve([method, v_init, epsilon, max_iter, k]) Solve the dynamic programming problem.
value_iteration([v_init, epsilon, max_iter]) Solve the optimization problem by value iteration.
RQ_sigma(sigma)[source]

Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

R_sigma : ndarray(float, ndim=1)

Reward vector for sigma, of length n.

Q_sigma : ndarray(float, ndim=2)

Transition probability matrix for sigma, of shape (n, n).

T_sigma(sigma)[source]

Given a policy sigma, return the T_sigma operator.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

callable

The T_sigma operator.

bellman_operator(v, Tv=None, sigma=None)[source]

The Bellman operator, which computes and returns the updated value function Tv for a value function v.

Parameters:

v : array_like(float, ndim=1)

Value function vector, of length n.

Tv : ndarray(float, ndim=1), optional(default=None)

Optional output array for Tv.

sigma : ndarray(int, ndim=1), optional(default=None)

If not None, the v-greedy policy vector is stored in this array. Must be of length n.

Returns:

Tv : ndarray(float, ndim=1)

Updated value function vector, of length n.

compute_greedy(v, sigma=None)[source]

Compute the v-greedy policy.

Parameters:

v : array_like(float, ndim=1)

Value function vector, of length n.

sigma : ndarray(int, ndim=1), optional(default=None)

Optional output array for sigma.

Returns:

sigma : ndarray(int, ndim=1)

v-greedy policy vector, of length n.

controlled_mc(sigma)[source]

Returns the controlled Markov chain for a given policy sigma.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

mc : MarkovChain

Controlled Markov chain.

evaluate_policy(sigma)[source]

Compute the value of a policy.

Parameters:

sigma : array_like(int, ndim=1)

Policy vector, of length n.

Returns:

v_sigma : ndarray(float, ndim=1)

Value vector of sigma, of length n.

modified_policy_iteration(v_init=None, epsilon=None, max_iter=None, k=20)[source]

Solve the optimization problem by modified policy iteration. See the solve method.

operator_iteration(T, v, max_iter, tol=None, *args, **kwargs)[source]

Iteratively apply the operator T to v. Modify v in-place. Iteration is performed for at most a number max_iter of times. If tol is specified, it is terminated once the distance of T(v) from v (in the max norm) is less than tol.

Parameters:

T : callable

Operator that acts on v.

v : ndarray

Object on which T acts. Modified in-place.

max_iter : scalar(int)

Maximum number of iterations.

tol : scalar(float), optional(default=None)

Error tolerance.

args, kwargs :

Other arguments and keyword arguments that are passed directly to the function T each time it is called.

Returns:

num_iter : scalar(int)

Number of iterations performed.

policy_iteration(v_init=None, max_iter=None)[source]

Solve the optimization problem by policy iteration. See the solve method.

solve(method='policy_iteration', v_init=None, epsilon=None, max_iter=None, k=20)[source]

Solve the dynamic programming problem.

Parameters:

method : str in {‘value_iteration’, ‘vi’, ‘policy_iteration’,

‘pi’, ‘modified_policy_iteration’, ‘mpi’},

optinal(default=’policy_iteration’)

Solution method.

v_init : array_like(float, ndim=1), optional(default=None)

Initial value function, of length n. If None, v_init is set such that v_init(s) = max_a r(s, a) for value iteration and policy iteration; for modified policy iteration, v_init(s) = min_(s’, a) r(s’, a)/(1 - beta) to guarantee convergence.

epsilon : scalar(float), optional(default=None)

Value for epsilon-optimality. If None, the value stored in the attribute epsilon is used.

max_iter : scalar(int), optional(default=None)

Maximum number of iterations. If None, the value stored in the attribute max_iter is used.

k : scalar(int), optional(default=20)

Number of iterations for partial policy evaluation in modified policy iteration (irrelevant for other methods).

Returns:

res : DPSolveResult

Optimization result represetned as a DPSolveResult. See DPSolveResult for details.

value_iteration(v_init=None, epsilon=None, max_iter=None)[source]

Solve the optimization problem by value iteration. See the solve method.