ddp
Module for solving dynamic programs (also known as Markov decision processes) with finite states and actions.
Discrete Dynamic Programming
A discrete dynamic program consists of the following components:
finite set of states \(S = \{0, \ldots, n-1\}\);
finite set of available actions \(A(s)\) for each state \(s \in S\) with \(A = \bigcup_{s \in S} A(s) = \{0, \ldots, m-1\}\), where \(\mathit{SA} = \{(s, a) \in S \times A \mid a \in A(s)\}\) is the set of feasible state-action pairs;
reward function \(r\colon \mathit{SA} \to \mathbb{R}\), where \(r(s, a)\) is the reward when the current state is \(s\) and the action chosen is \(a\);
transition probability function \(q\colon \mathit{SA} \to \Delta(S)\), where \(q(s'|s, a)\) is the probability that the state in the next period is \(s'\) when the current state is \(s\) and the action chosen is \(a\); and
discount factor \(0 \leq \beta < 1\).
For a policy function \(\sigma\), let \(r_{\sigma}\) and \(Q_{\sigma}\) be the reward vector and the transition probability matrix for \(\sigma\), which are defined by \(r_{\sigma}(s) = r(s, \sigma(s))\) and \(Q_{\sigma}(s, s') = q(s'|s, \sigma(s))\), respectively. The policy value function \(v_{\sigma}\) for \(\sigma\) is defined by
The optimal value function \(v^*\) is the function such that \(v^*(s) = \max_{\sigma} v_{\sigma}(s)\) for all \(s \in S\). A policy function \(\sigma^*\) is optimal if \(v_{\sigma^*}(s) = v^*(s)\) for all \(s \in S\).
The Bellman equation is written as
The Bellman operator \(T\) is defined by the right hand side of the Bellman equation:
For a policy function \(\sigma\), the operator \(T_{\sigma}\) is defined by
or \(T_{\sigma} v = r_{\sigma} + \beta Q_{\sigma} v\).
The main result of the theory of dynamic programming states that the optimal value function \(v^*\) is the unique solution to the Bellman equation, or the unique fixed point of the Bellman operator, and that \(\sigma^*\) is an optimal policy function if and only if it is \(v^*\)-greedy, i.e., it satisfies \(T v^* = T_{\sigma^*} v^*\).
Solution Algorithms
The DiscreteDP class currently implements the following solution algorithms:
value iteration;
policy iteration;
modified policy iteration;
linear programming.
Policy iteration computes an exact optimal policy in finitely many iterations, while value iteration and modified policy iteration return an \(\varepsilon\)-optimal policy and an \(\varepsilon/2\)-approximation of the optimal value function for a prespecified value of \(\varepsilon\).
Our implementations of value iteration and modified policy iteration employ the norm-based and span-based termination rules, respectively.
Value iteration is terminated when the condition \(\lVert T v - v \rVert < [(1 - \beta) / (2\beta)] \varepsilon\) is satisfied.
Modified policy iteration is terminated when the condition \(\mathrm{span}(T v - v) < [(1 - \beta) / \beta] \varepsilon\) is satisfied, where \(\mathrm{span}(z) = \max(z) - \min(z)\).
The linear programming method solves the problem as a linear program by the simplex method with optimize.linprog_simplex routines (implemented only for dense matrix formulation).
References
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley-Interscience, 2005.
- class quantecon.markov.ddp.DPSolveResult[source]
Bases:
dict
Contain the information about the dynamic programming result.
- Attributes:
- vndarray(float, ndim=1)
Computed optimal value function
- sigmandarray(int, ndim=1)
Computed optimal policy function
- num_iterint
Number of iterations
- mcMarkovChain
Controlled Markov chain
- methodstr
Method employed
- epsilonfloat
Value of epsilon
- max_iterint
Maximum number of iterations
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
(key[, default])If key is not found, default is returned if given, otherwise KeyError is raised
popitem
(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault
(key[, default])Insert key with a value of default if key is not in the dictionary.
update
([E, ]**F)If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
- class quantecon.markov.ddp.DiscreteDP(R, Q, beta, s_indices=None, a_indices=None)[source]
Bases:
object
Class for dealing with a discrete dynamic program.
There are two ways to represent the data for instantiating a DiscreteDP object. Let n, m, and L denote the numbers of states, actions, and feasbile state-action pairs, respectively.
DiscreteDP(R, Q, beta)
with parameters:
n x m reward array R,
n x m x n transition probability array Q, and
discount factor beta,
where R[s, a] is the reward for action a when the state is s and Q[s, a, s_next] is the probability that the state in the next period is s_next when the current state is s and the action chosen is a.
DiscreteDP(R, Q, beta, s_indices, a_indices)
with parameters:
length L reward vector R,
L x n transition probability array Q,
discount factor beta,
length L array s_indices, and
length L array a_indices,
where the pairs (s_indices[0], a_indices[0]), …, (s_indices[L-1], a_indices[L-1]) enumerate feasible state-action pairs, and R[i] is the reward for action a_indices[i] when the state is s_indices[i] and Q[i, s_next] is the probability that the state in the next period is s_next when the current state is s_indices[i] and the action chosen is a_indices[i]. With this formulation, Q may be represented by a scipy.sparse matrix.
- Parameters:
- Rarray_like(float, ndim=2 or 1)
Reward array.
- Qarray_like(float, ndim=3 or 2) or scipy.sparse matrix
Transition probability array.
- betascalar(float)
Discount factor. Must be in [0, 1].
- s_indicesarray_like(int, ndim=1), optional(default=None)
Array containing the indices of the states.
- a_indicesarray_like(int, ndim=1), optional(default=None)
Array containing the indices of the actions.
- Attributes:
- R, Q, betasee Parameters.
- num_statesscalar(int)
Number of states.
- num_sa_pairsscalar(int)
Number of feasible state-action pairs (or those that yield finite rewards).
- epsilonscalar(float), default=1e-3
Default value for epsilon-optimality.
- max_iterscalar(int), default=250
Default value for the maximum number of iterations.
Methods
RQ_sigma
(sigma)Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.
T_sigma
(sigma)Given a policy sigma, return the T_sigma operator.
bellman_operator
(v[, Tv, sigma])The Bellman operator, which computes and returns the updated value function Tv for a value function v.
compute_greedy
(v[, sigma])Compute the v-greedy policy.
controlled_mc
(sigma)Returns the controlled Markov chain for a given policy sigma.
evaluate_policy
(sigma)Compute the value of a policy.
modified_policy_iteration
([v_init, epsilon, ...])Solve the optimization problem by modified policy iteration.
operator_iteration
(T, v, max_iter[, tol])Iteratively apply the operator T to v.
policy_iteration
([v_init, max_iter])Solve the optimization problem by policy iteration.
solve
([method, v_init, epsilon, max_iter, k])Solve the dynamic programming problem.
Convert this instance of DiscreteDP to the "product" form.
to_sa_pair_form
([sparse])Convert this instance of DiscreteDP to SA-pair form
value_iteration
([v_init, epsilon, max_iter])Solve the optimization problem by value iteration.
linprog_simplex
Notes
DiscreteDP accepts beta=1 for convenience. In this case, infinite horizon solution methods are disabled, and the instance is then seen as merely an object carrying the Bellman operator, which may be used for backward induction for finite horizon problems.
Examples
Consider the following example, taken from Puterman (2005), Section 3.1, pp.33-35.
Set of states S = {0, 1}
Set of actions A = {0, 1}
Set of feasible state-action pairs SA = {(0, 0), (0, 1), (1, 0)}
Rewards r(s, a):
r(0, 0) = 5, r(0, 1) =10, r(1, 0) = -1
Transition probabilities q(s_next|s, a):
q(0|0, 0) = 0.5, q(1|0, 0) = 0.5, q(0|0, 1) = 0, q(1|0, 1) = 1, q(0|1, 0) = 0, q(1|1, 0) = 1
Discount factor 0.95
Creating a `DiscreteDP` instance
Product formulation
This approach uses the product set S x A as the domain by treating action 1 as yielding a reward negative infinity at state 1.
>>> R = [[5, 10], [-1, -float('inf')]] >>> Q = [[(0.5, 0.5), (0, 1)], [(0, 1), (0.5, 0.5)]] >>> beta = 0.95 >>> ddp = DiscreteDP(R, Q, beta)
(Q[1, 1] is an arbitrary probability vector.)
State-action pairs formulation
This approach takes the set of feasible state-action pairs SA as given.
>>> s_indices = [0, 0, 1] # State indices >>> a_indices = [0, 1, 0] # Action indices >>> R = [5, 10, -1] >>> Q = [(0.5, 0.5), (0, 1), (0, 1)] >>> beta = 0.95 >>> ddp = DiscreteDP(R, Q, beta, s_indices, a_indices)
Solving the model
Policy iteration
>>> res = ddp.solve(method='policy_iteration', v_init=[0, 0]) >>> res.sigma # Optimal policy function array([0, 0]) >>> res.v # Optimal value function array([ -8.57142857, -20. ]) >>> res.num_iter # Number of iterations 2
Value iteration
>>> res = ddp.solve(method='value_iteration', v_init=[0, 0], ... epsilon=0.01) >>> res.sigma # (Approximate) optimal policy function array([0, 0]) >>> res.v # (Approximate) optimal value function array([ -8.5665053 , -19.99507673]) >>> res.num_iter # Number of iterations 162
Modified policy iteration
>>> res = ddp.solve(method='modified_policy_iteration', ... v_init=[0, 0], epsilon=0.01) >>> res.sigma # (Approximate) optimal policy function array([0, 0]) >>> res.v # (Approximate) optimal value function array([ -8.57142826, -19.99999965]) >>> res.num_iter # Number of iterations 3
Linear programming
>>> res = ddp.solve(method='linear_programming', v_init=[0, 0]) >>> res.sigma # Optimal policy function array([0, 0]) >>> res.v # Optimal value function array([ -8.57142857, -20. ]) >>> res.num_iter # Number of iterations (within the LP solver) 4
- RQ_sigma(sigma)[source]
Given a policy sigma, return the reward vector R_sigma and the transition probability matrix Q_sigma.
- Parameters:
- sigmaarray_like(int, ndim=1)
Policy vector, of length n.
- Returns:
- R_sigmandarray(float, ndim=1)
Reward vector for sigma, of length n.
- Q_sigmandarray(float, ndim=2)
Transition probability matrix for sigma, of shape (n, n).
- T_sigma(sigma)[source]
Given a policy sigma, return the T_sigma operator.
- Parameters:
- sigmaarray_like(int, ndim=1)
Policy vector, of length n.
- Returns:
- callable
The T_sigma operator.
- bellman_operator(v, Tv=None, sigma=None)[source]
The Bellman operator, which computes and returns the updated value function Tv for a value function v.
- Parameters:
- varray_like(float, ndim=1)
Value function vector, of length n.
- Tvndarray(float, ndim=1), optional(default=None)
Optional output array for Tv.
- sigmandarray(int, ndim=1), optional(default=None)
If not None, the v-greedy policy vector is stored in this array. Must be of length n.
- Returns:
- Tvndarray(float, ndim=1)
Updated value function vector, of length n.
- compute_greedy(v, sigma=None)[source]
Compute the v-greedy policy.
- Parameters:
- varray_like(float, ndim=1)
Value function vector, of length n.
- sigmandarray(int, ndim=1), optional(default=None)
Optional output array for sigma.
- Returns:
- sigmandarray(int, ndim=1)
v-greedy policy vector, of length n.
- controlled_mc(sigma)[source]
Returns the controlled Markov chain for a given policy sigma.
- Parameters:
- sigmaarray_like(int, ndim=1)
Policy vector, of length n.
- Returns:
- mcMarkovChain
Controlled Markov chain.
- evaluate_policy(sigma)[source]
Compute the value of a policy.
- Parameters:
- sigmaarray_like(int, ndim=1)
Policy vector, of length n.
- Returns:
- v_sigmandarray(float, ndim=1)
Value vector of sigma, of length n.
- modified_policy_iteration(v_init=None, epsilon=None, max_iter=None, k=20)[source]
Solve the optimization problem by modified policy iteration. See the solve method.
- operator_iteration(T, v, max_iter, tol=None, *args, **kwargs)[source]
Iteratively apply the operator T to v. Modify v in-place. Iteration is performed for at most a number max_iter of times. If tol is specified, it is terminated once the distance of T(v) from v (in the max norm) is less than tol.
- Parameters:
- Tcallable
Operator that acts on v.
- vndarray
Object on which T acts. Modified in-place.
- max_iterscalar(int)
Maximum number of iterations.
- tolscalar(float), optional(default=None)
Error tolerance.
- args, kwargs
Other arguments and keyword arguments that are passed directly to the function T each time it is called.
- Returns:
- num_iterscalar(int)
Number of iterations performed.
- policy_iteration(v_init=None, max_iter=None)[source]
Solve the optimization problem by policy iteration. See the solve method.
- solve(method='policy_iteration', v_init=None, epsilon=None, max_iter=None, k=20)[source]
Solve the dynamic programming problem.
- Parameters:
- methodstr, optinal(default=’policy_iteration’)
Solution method, str in {‘value_iteration’, ‘vi’, ‘policy_iteration’, ‘pi’, ‘modified_policy_iteration’, ‘mpi’, ‘linear_programming’, ‘lp’}.
- v_initarray_like(float, ndim=1), optional(default=None)
Initial value function, of length n. If None, v_init is set such that v_init(s) = max_a r(s, a) for value iteration, policy iteration, and linear programming; for modified policy iteration, v_init(s) = min_(s_next, a) r(s_next, a)/(1 - beta) to guarantee convergence.
- epsilonscalar(float), optional(default=None)
Value for epsilon-optimality. If None, the value stored in the attribute epsilon is used.
- max_iterscalar(int), optional(default=None)
Maximum number of iterations. If None, the value stored in the attribute max_iter is used.
- kscalar(int), optional(default=20)
Number of iterations for partial policy evaluation in modified policy iteration (irrelevant for other methods).
- Returns:
- resDPSolveResult
Optimization result represetned as a DPSolveResult. See DPSolveResult for details.
- to_product_form()[source]
Convert this instance of DiscreteDP to the “product” form.
The product form uses the version of the init method taking R, Q and beta.
- Returns:
- ddp_saDiscreteDP
The correspnoding DiscreteDP instance in product form
Notes
If this instance is already in product form then it is returned un-modified
- to_sa_pair_form(sparse=True)[source]
Convert this instance of DiscreteDP to SA-pair form
- Parameters:
- sparsebool, optional(default=True)
Should the Q matrix be stored as a sparse matrix? If true the CSR format is used
- Returns:
- ddp_saDiscreteDP
The correspnoding DiscreteDP instance in SA-pair form
Notes
If this instance is already in SA-pair form then it is returned un-modified
- quantecon.markov.ddp.backward_induction(ddp, T, v_term=None)[source]
Solve by backward induction a \(T\)-period finite horizon discrete dynamic program with stationary reward and transition probability functions \(r\) and \(q\) and discount factor \(\beta \in [0, 1]\).
The optimal value functions \(v^*_0, \ldots, v^*_T\) and policy functions \(\sigma^*_0, \ldots, \sigma^*_{T-1}\) are obtained by \(v^*_T = v_T\), and
\[v^*_{t-1}(s) = \max_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v^*_t(s') \quad (s \in S)\]and
\[\sigma^*_{t-1}(s) \in \operatorname*{arg\,max}_{a \in A(s)} r(s, a) + \beta \sum_{s' \in S} q(s'|s, a) v^*_t(s') \quad (s \in S)\]for \(t = T, \ldots, 1\), where the terminal value function \(v_T\) is exogenously given.
- Parameters:
- ddpDiscreteDP
DiscreteDP instance storing reward array R, transition probability array Q, and discount factor beta.
- Tscalar(int)
Number of decision periods.
- v_termarray_like(float, ndim=1), optional(default=None)
Terminal value function, of length equal to n (the number of states). If None, it defaults to the vector of zeros.
- Returns:
- vsndarray(float, ndim=2)
Array of shape (T+1, n) where vs[t] contains the optimal value function at period t = 0, …, T.
- sigmasndarray(int, ndim=2)
Array of shape (T, n) where sigmas[t] contains the optimal policy function at period t = 0, …, T-1.