Dynamic Programming
Dynamic Programming (DP) refers to a collection of algorithms that can compute optimal policies given a perfect model of the environment (i.e., complete knowledge of the MDP). Classical DP algorithms have limited practical utility in reinforcement learning (due to their requirement for a perfect model and high computational cost), but they are of great theoretical importance — nearly all subsequent RL methods can be viewed as attempts to achieve the same effect as DP with less computation and without requiring a perfect environment model.
Prerequisites: Before starting this chapter, you should be familiar with the basic concepts of finite MDPs, including the state-value function \(v_\pi(s)\), the action-value function \(q_\pi(s,a)\), Bellman equations, etc. (see Finite MDPs).
Core Idea of DP
The core idea of DP is to use Bellman equations as update rules, turning the computation of value functions into an iterative update process.
Recall the Bellman expectation equation (state-value function version):
And the Bellman optimality equation:
The key assumption of DP methods is that we know the transition probabilities \(p(s', r | s, a)\), i.e., the complete environment model.
Policy Evaluation
Policy evaluation addresses the following problem: given a policy \(\pi\), compute its state-value function \(v_\pi\).
Iterative Policy Evaluation
We turn the Bellman expectation equation into an iterative update rule:
Starting from an arbitrary initial value \(v_0\) (typically all zeros), we iterate until convergence. It can be shown that as \(k \to \infty\), \(v_k\) converges to \(v_\pi\).
Algorithm:
- Initialize \(V(s) = 0\) for all \(s \in \mathcal{S}\)
- Repeat until convergence (\(\Delta < \theta\), where \(\theta\) is a small positive threshold):
- \(\Delta \leftarrow 0\)
- For each state \(s\):
- \(v \leftarrow V(s)\)
- \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]\)
- \(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)
Update scheme: The algorithm above uses an "in-place update," meaning that when updating \(V(s)\), it uses the already-updated new values of other states. This typically converges faster than using old values for updates (which would require maintaining two arrays).
Gridworld Example
Consider a 4x4 gridworld where the top-left and bottom-right corners are terminal states. The agent has four equiprobable actions (up, down, left, right) in each non-terminal state, and every step yields a reward of -1.
Under a uniform random policy (probability 0.25 for each direction), iterative policy evaluation gradually computes the value of each state. States farther from the terminal states have lower values (since more steps are needed to reach a terminal state, accumulating more -1 rewards).
Policy Improvement
Policy evaluation tells us how good a policy is, while policy improvement uses the value function to find a better policy.
Policy Improvement Theorem
For a policy \(\pi\), if we select action \(a\) in state \(s\) such that \(q_\pi(s, a) \geq v_\pi(s)\), then the new policy is at least as good as \(\pi\).
Specifically, given \(v_\pi\), we can construct a greedy policy \(\pi'\):
This greedy policy \(\pi'\) satisfies \(v_{\pi'}(s) \geq v_\pi(s)\) for all \(s\). Equality \(v_{\pi'} = v_\pi\) holds if and only if \(\pi\) is already optimal.
Policy Iteration
Policy iteration alternates between policy evaluation and policy improvement, progressively approaching the optimal policy:
Algorithm:
- Initialization: Arbitrarily initialize \(V(s)\) and \(\pi(s)\) for all \(s\)
- Policy Evaluation: Use iterative policy evaluation to compute \(V \approx v_\pi\) (until convergence)
- Policy Improvement: For each \(s\), set \(\pi(s) \leftarrow \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)
- If the policy is unchanged, stop; otherwise, return to step 2
Convergence: Since a finite MDP has only a finite number of policies and each policy improvement step strictly improves the policy (unless it is already optimal), policy iteration is guaranteed to converge to the optimal policy \(\pi_*\) in a finite number of steps.
Value Iteration
One drawback of policy iteration is that each iteration requires a complete policy evaluation (multiple sweeps), which is computationally expensive. Value iteration truncates the policy evaluation to a single sweep while simultaneously incorporating policy improvement:
This is essentially the iterative form of the Bellman optimality equation.
Algorithm:
- Initialize \(V(s)\) to arbitrary values (terminal states to 0)
- Repeat until convergence:
- For each state \(s\):
- \(V(s) \leftarrow \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)
- For each state \(s\):
- After convergence, output the deterministic optimal policy:
- \(\pi(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)
Relationship to policy iteration: Value iteration can be viewed as a special case of policy iteration — policy improvement is performed after just one step of policy evaluation. In practice, value iteration often converges faster than policy iteration.
Policy Iteration vs. Value Iteration
| Dimension | Policy Iteration | Value Iteration |
|---|---|---|
| Per-iteration operation | Full policy evaluation + policy improvement | One-step Bellman optimality update |
| Convergence speed (iterations) | Typically fewer outer iterations | Typically more outer iterations |
| Computation per iteration | High (full policy evaluation) | Low (single sweep) |
| Total computation | Usually faster for small state spaces | Usually faster for large state spaces |
Generalized Policy Iteration (GPI)
GPI is a central concept that pervades all of reinforcement learning. It refers to the general idea of interleaving policy evaluation and policy improvement, without being tied to any specific implementation:
- Policy evaluation drives the value function toward the true value of the current policy
- Policy improvement makes the policy greedy with respect to the current value function
These two processes compete with and complement each other — evaluation and improvement can be complete (as in policy iteration), partial (as in value iteration), or interleaved (as in the TD methods discussed in later chapters).
Nearly all RL methods introduced later in the book can be understood through the GPI framework: Monte Carlo methods, TD learning, Sarsa, Q-learning, etc. They differ mainly in what method is used for policy evaluation (sampling vs. full expectations) and in the frequency of evaluation and improvement.
Limitations of DP
- Requires a complete environment model: The transition probabilities \(p(s',r|s,a)\) must be known, which is unavailable in most real-world problems
- High computational cost: Each update requires sweeping over all states, which becomes infeasible for large state spaces (the "curse of dimensionality")
- High memory requirements: The entire value function table must be stored
These limitations motivate the methods covered in subsequent chapters: Monte Carlo methods (model-free), TD learning (no need for complete episodes), and function approximation (handling large state spaces).
References
- Sutton & Barto, "Reinforcement Learning: An Introduction", 2nd Edition, Chapter 4