Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that can compute optimal policies given a perfect model of the environment (i.e., complete knowledge of the MDP). Classical DP algorithms have limited practical utility in reinforcement learning (due to their requirement for a perfect model and high computational cost), but they are of great theoretical importance — nearly all subsequent RL methods can be viewed as attempts to achieve the same effect as DP with less computation and without requiring a perfect environment model.

Prerequisites: Before starting this chapter, you should be familiar with the basic concepts of finite MDPs, including the state-value function \(v_\pi(s)\), the action-value function \(q_\pi(s,a)\), Bellman equations, etc. (see Finite MDPs).

Core Idea of DP

The core idea of DP is to use Bellman equations as update rules, turning the computation of value functions into an iterative update process.

Recall the Bellman expectation equation (state-value function version):

\[ v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')] \]

And the Bellman optimality equation:

\[ v_*(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \]

The key assumption of DP methods is that we know the transition probabilities \(p(s', r | s, a)\), i.e., the complete environment model.

Policy Evaluation

Policy evaluation addresses the following problem: given a policy \(\pi\), compute its state-value function \(v_\pi\).

Iterative Policy Evaluation

We turn the Bellman expectation equation into an iterative update rule:

\[ v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')], \quad \forall s \in \mathcal{S} \]

Starting from an arbitrary initial value \(v_0\) (typically all zeros), we iterate until convergence. It can be shown that as \(k \to \infty\), \(v_k\) converges to \(v_\pi\).

Algorithm:

Initialize \(V(s) = 0\) for all \(s \in \mathcal{S}\)
Repeat until convergence (\(\Delta < \theta\), where \(\theta\) is a small positive threshold):
- \(\Delta \leftarrow 0\)
- For each state \(s\):
  - \(v \leftarrow V(s)\)
  - \(V(s) \leftarrow \sum_a \pi(a|s) \sum_{s', r} p(s', r | s, a) [r + \gamma V(s')]\)
  - \(\Delta \leftarrow \max(\Delta, |v - V(s)|)\)

Update scheme: The algorithm above uses an "in-place update," meaning that when updating \(V(s)\), it uses the already-updated new values of other states. This typically converges faster than using old values for updates (which would require maintaining two arrays).

Gridworld Example

Consider a 4x4 gridworld where the top-left and bottom-right corners are terminal states. The agent has four equiprobable actions (up, down, left, right) in each non-terminal state, and every step yields a reward of -1.

Under a uniform random policy (probability 0.25 for each direction), iterative policy evaluation gradually computes the value of each state. States farther from the terminal states have lower values (since more steps are needed to reach a terminal state, accumulating more -1 rewards).

Policy Improvement

Policy evaluation tells us how good a policy is, while policy improvement uses the value function to find a better policy.

Policy Improvement Theorem

For a policy \(\pi\), if we select action \(a\) in state \(s\) such that \(q_\pi(s, a) \geq v_\pi(s)\), then the new policy is at least as good as \(\pi\).

Specifically, given \(v_\pi\), we can construct a greedy policy \(\pi'\):

\[ \pi'(s) = \arg\max_a q_\pi(s, a) = \arg\max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')] \]

This greedy policy \(\pi'\) satisfies \(v_{\pi'}(s) \geq v_\pi(s)\) for all \(s\). Equality \(v_{\pi'} = v_\pi\) holds if and only if \(\pi\) is already optimal.

Policy Iteration

Policy iteration alternates between policy evaluation and policy improvement, progressively approaching the optimal policy:

\[ \pi_0 \xrightarrow{\text{评估}} v_{\pi_0} \xrightarrow{\text{改进}} \pi_1 \xrightarrow{\text{评估}} v_{\pi_1} \xrightarrow{\text{改进}} \pi_2 \xrightarrow{\text{评估}} \cdots \xrightarrow{\text{改进}} \pi_* \xrightarrow{\text{评估}} v_* \]

Algorithm:

Initialization: Arbitrarily initialize \(V(s)\) and \(\pi(s)\) for all \(s\)
Policy Evaluation: Use iterative policy evaluation to compute \(V \approx v_\pi\) (until convergence)
Policy Improvement: For each \(s\), set \(\pi(s) \leftarrow \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)
If the policy is unchanged, stop; otherwise, return to step 2

Convergence: Since a finite MDP has only a finite number of policies and each policy improvement step strictly improves the policy (unless it is already optimal), policy iteration is guaranteed to converge to the optimal policy \(\pi_*\) in a finite number of steps.

Value Iteration

One drawback of policy iteration is that each iteration requires a complete policy evaluation (multiple sweeps), which is computationally expensive. Value iteration truncates the policy evaluation to a single sweep while simultaneously incorporating policy improvement:

\[ v_{k+1}(s) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_k(s')] \]

This is essentially the iterative form of the Bellman optimality equation.

Algorithm:

Initialize \(V(s)\) to arbitrary values (terminal states to 0)
Repeat until convergence:
- For each state \(s\):
  - \(V(s) \leftarrow \max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)
After convergence, output the deterministic optimal policy:
- \(\pi(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)[r + \gamma V(s')]\)

Relationship to policy iteration: Value iteration can be viewed as a special case of policy iteration — policy improvement is performed after just one step of policy evaluation. In practice, value iteration often converges faster than policy iteration.

Policy Iteration vs. Value Iteration

Dimension	Policy Iteration	Value Iteration
Per-iteration operation	Full policy evaluation + policy improvement	One-step Bellman optimality update
Convergence speed (iterations)	Typically fewer outer iterations	Typically more outer iterations
Computation per iteration	High (full policy evaluation)	Low (single sweep)
Total computation	Usually faster for small state spaces	Usually faster for large state spaces

Generalized Policy Iteration (GPI)

GPI is a central concept that pervades all of reinforcement learning. It refers to the general idea of interleaving policy evaluation and policy improvement, without being tied to any specific implementation:

Policy evaluation drives the value function toward the true value of the current policy
Policy improvement makes the policy greedy with respect to the current value function

These two processes compete with and complement each other — evaluation and improvement can be complete (as in policy iteration), partial (as in value iteration), or interleaved (as in the TD methods discussed in later chapters).

Nearly all RL methods introduced later in the book can be understood through the GPI framework: Monte Carlo methods, TD learning, Sarsa, Q-learning, etc. They differ mainly in what method is used for policy evaluation (sampling vs. full expectations) and in the frequency of evaluation and improvement.

Limitations of DP

Requires a complete environment model: The transition probabilities \(p(s',r|s,a)\) must be known, which is unavailable in most real-world problems
High computational cost: Each update requires sweeping over all states, which becomes infeasible for large state spaces (the "curse of dimensionality")
High memory requirements: The entire value function table must be stored

These limitations motivate the methods covered in subsequent chapters: Monte Carlo methods (model-free), TD learning (no need for complete episodes), and function approximation (handling large state spaces).

References

Sutton & Barto, "Reinforcement Learning: An Introduction", 2nd Edition, Chapter 4

Dynamic Programming

Core Idea of DP

Policy Evaluation

Iterative Policy Evaluation

Gridworld Example

Policy Improvement

Policy Improvement Theorem

Policy Iteration

Value Iteration

Policy Iteration vs. Value Iteration

Generalized Policy Iteration (GPI)

Limitations of DP

References

评论 #