Policy Gradient

This chapter covers the basics of early policy gradient methods, the actor-critic framework, and the policy gradient theorem.

Fundamentals

The Policy-Based Approach

Let us first understand the basic idea behind policy gradient methods. In value-based methods, we are already familiar with the approach of maintaining a Q-table: for any state-action pair, we can look up a corresponding Q-value. Under this paradigm, once an agent is trained, it encounters a state, looks up all possible Q-values for that state, and selects the action with the highest Q-value. This is the value-based strategy.

Let us revisit the entire process using the MC method. Suppose we are still playing the grid-world game. The MC method runs several episodes, then for each state-action pair, it computes the average Q-value across all episodes to serve as that pair's Q-value:

\[ Q = average of Q in different Episodes \]

We need a Q-table to store these state-action Q-values, for example:

Q(s, up) = 10.5
Q(s, down) = -2.1

Now, we shift our thinking — this is the crucial conceptual transition from value-based methods to policy gradient. We throw away the Q-table and replace it with a policy gradient table (PG table) that stores the following:

H(s, up) = 0.5
H(s, down) = -0.1

Here, \(H\) is a preference value, also called a parameter \(\theta\). These numbers have no physical meaning whatsoever. They do not represent scores; they only represent the agent's "degree of preference" for a given action.

Suppose the agent finishes an episode and obtains a total return of \(G = 100\) via MC sampling. At state s, the agent happened to choose the action "up." Under the Q-table approach, the agent would check the Q-value, find that the stored value is only 10.5 — far from 100 — and adjust the Q-value upward. Ultimately, \(Q(s, a)\) converges to the average of all observed \(G\) values.

But under the policy gradient mindset, the agent does not care whether H(s, up) = 0.5 is too large or too small. The agent only looks at two things:

The actual total return from the episode is G=100 — a positive reward, not bad.
The action chosen was "up."

The policy gradient method then reasons: since the outcome was this good, let us increase the preference value H for "up" so that it is more likely to be selected next time. Where does the probability come from? The entries in the H-table are passed through a Softmax function to produce probabilities:

First, compute the exponentials:

\(e^{H(s, \text{上})} = e^{0.5} \approx 1.65\)
\(e^{H(s, \text{下})} = e^{-0.1} \approx 0.90\)

The sum is \(1.65 + 0.90 = 2.55\), and then we compute the probabilities:

\(\pi(s, \text{上}) = 1.65 / 2.55 \approx \mathbf{0.65}\)
\(\pi(s, \text{下}) = 0.90 / 2.55 \approx \mathbf{0.35}\)

For the selected action (up), the most basic preference update formula is:

\[ \text{更新量} = \alpha \cdot G \cdot (1 - \pi(s, \text{上})) \]

Substituting in the numbers (assuming a learning rate of 0.1):

\[ \text{更新量} = 0.1 \cdot 100 \cdot (1 - 0.65) = 10 \cdot 0.35 = \mathbf{3.5} \]

Finally, update the preference table H:

\[ H(s, \text{上})_{\text{new}} = 0.5 + 3.5 = \mathbf{4.0} \]

In this early, naive approach, as long as the reward is positive, we increase the preference. We can see that the essential difference between value-based methods and policy gradient methods is:

Value-based methods are fundamentally about regression — the goal is to make the stored numbers equal the true scores.
Policy gradient methods are fundamentally about optimization — the goal is to make the stored numbers as large as possible, as long as the corresponding actions lead to high scores.

Let us compare the two update formulas:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \mathbf{(G - Q(s, a))} \]

\[ H(s, a) \leftarrow H(s, a) + \alpha \cdot \mathbf{G} \cdot \mathbf{\nabla \log \pi} \]

We can observe that:

The driving force in the value-based approach is the error: when Q = G, the update stops.
In policy gradient, there is no subtraction. G serves merely as a weight — as long as G is positive (in the naive algorithm), it keeps pushing the preference value H upward.

This distinction may not matter much in small-scale tasks like grid-world with few actions, but in complex tasks, the policy gradient approach has the following significant advantages:

A Q-table cannot store Q-values for continuous actions, such as a car's steering angle. Even if it could, selecting the action with the highest Q-value from a huge action set would be computationally expensive. Policy gradient methods only need to output a normal distribution (mean \(\mu\) and variance \(\sigma\)) and adjust \(\mu\) and \(\sigma\) via gradients to directly determine the most probable action (e.g., steer left by 30%).
In Q-learning, the optimal action selection can change frequently because Q-values are updated often. Policy gradient adjusts probabilities gradually and smoothly, which is an advantage during training as it leads to more stable learning.
In tasks like rock-paper-scissors, Q-learning always tries to find the single action with the highest Q-value, making it predictable to opponents. If policy gradient discovers that maintaining a uniform 33% distribution maximizes the expected total return G, it will stay there.

For the most naive policy gradient, we can initialize an H-table to store preference values. \(H\) can be any real number, from \(-\infty\) to \(+\infty\). \(H\) has no direct physical meaning. It is neither a reward nor a step count.

For example, a value of \(10.5\) in the H-table simply means "I strongly prefer this action." Why the preference? Because previously, when this action was selected, \(G\) was large and pushed it to this level.

Probabilities are typically computed on the fly. For instance, when the agent is in state s, it performs the following operations:

Lookup: Extract the entire row for that state from the H-table, e.g., [H(上)=2.0, H(下)=1.0, H(左)=0.0, H(右)=0.0].
Softmax transformation: Run Softmax over this row to produce a probability distribution that sums to 1: [P(上)=0.64, P(下)=0.24, P(左)=0.06, P(右)=0.06].
Sampling: Roll the dice according to this probability distribution to select an action.

In policy gradient methods, we do not use \(\epsilon\)-greedy. In Q-learning, the policy is "hard" (all-or-nothing). Without \(\epsilon\), it rigidly selects only the current highest-scoring action, potentially never discovering a better path. In policy gradient methods, the policy is "soft" (probabilistic). The probability distribution itself serves the role of \(\epsilon\).

The sampling step may seem slightly counterintuitive: sampling follows a proportional sampling approach — it is like a lottery wheel divided according to each action's probability. If we initialize the H-table to zeros, all actions have equal probability of being selected. In this case, policy gradient tends to favor the action with the highest probability, but does not select it absolutely — it respects the current probability distribution.

This lottery-style approach is necessary for mathematical continuity, because argmax is non-differentiable, so we cannot use \(\epsilon\)-greedy for sampling.

REINFORCE

REINFORCE is the oldest and purest member of the policy gradient family. Its essence is Monte Carlo + policy gradient.

The main steps are as follows:

Generate a trajectory: Using the current policy \(\pi_\theta\) (your lottery wheel), play a complete game (one episode) from start to finish. You obtain a sequence of records:

\[ (s_0, a_0, r_1, s_1, a_1, r_2, \dots, s_{T-1}, a_{T-1}, r_T) \]

Compute returns: For each step \(t\) in this episode, calculate the total score from that step to the end:

\[ G_t = \sum_{k=t}^T \gamma^{k-t} r_k \]

Gradient update: For every action taken in this episode, adjust the parameters \(\theta\) based on its corresponding \(G_t\):

\[ \theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(a_t|s_t) \]

Note that the formula uses \(\nabla \log \pi\) (the gradient of the log-probability), not \(\nabla \pi\) (the gradient of the probability). This is because:

\[ \nabla \log \pi_\theta(a|s) = \frac{\nabla \pi_\theta(a|s)}{\pi_\theta(a|s)} \]

The intuition is: if an action already has a high probability (say 0.9), it will naturally be selected more often during sampling. Without dividing by \(0.9\), its preference value would grow disproportionately simply due to its high "appearance rate," even if its performance is mediocre. Dividing by \(\pi\) cancels out the effect of appearance frequency, allowing the algorithm to focus on whether the action truly delivered above-average returns.

Although REINFORCE is elegant in its purity, it has a fatal weakness: it is extremely unstable. REINFORCE relies on MC sampling, and each episode can vary wildly: with good luck, the agent stumbles upon treasure and all actions get rewarded; with bad luck, the agent dies just before the finish line, and all actions get penalized despite earlier good decisions. The consequence is that the parameter \(\theta\) updates bounce around erratically, requiring tens of thousands of training episodes to converge.

Policy Gradient Theorem

In the earlier discussion of REINFORCE, we presented the policy gradient update formula intuitively without rigorously deriving why it works. The policy gradient theorem provides this rigorous mathematical foundation.

Formal Statement

We define the performance measure \(J(\boldsymbol{\theta})\) as the expected return of policy \(\pi_\theta\). The policy gradient theorem states:

\[ \nabla_\theta J(\boldsymbol{\theta}) = \mathbb{E}_\pi \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t | S_t) \cdot Q^\pi(S_t, A_t) \right] \]

The profound insight of this theorem is that the gradient of the performance measure \(J\) with respect to parameters \(\boldsymbol{\theta}\) can be expressed as the expected product of the policy gradient \(\nabla \log \pi\) and the action-value function \(Q^\pi\), without needing to differentiate through the state distribution. This is crucial because the state distribution \(d^\pi(s)\) itself depends on \(\boldsymbol{\theta}\) (changing the policy changes the frequency with which states are visited), yet the policy gradient theorem elegantly avoids the need to differentiate \(d^\pi(s)\).

Proof Sketch (Likelihood Ratio Trick)

The core technique in the proof is the likelihood ratio trick (also called the log-derivative trick):

For any differentiable function \(f(\theta)\), we have:

\[ \nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s) \]

This follows from:

\[ \nabla_\theta \log \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \]

Multiplying both sides by \(\pi_\theta(a|s)\) yields the identity above. The elegance of this identity is that it transforms the derivative of a probability into the derivative of a log-probability multiplied by the probability itself — the latter can be estimated via sampling.

Using this trick, we can derive the policy gradient. Consider the performance measure in the single-step case:

\[ J(\boldsymbol{\theta}) = \sum_s d^\pi(s) \sum_a \pi_\theta(a|s) Q^\pi(s, a) \]

Taking the gradient with respect to \(\boldsymbol{\theta}\) (considering only the dependence through \(\pi_\theta\); the dependence through the state distribution can be handled via a more refined analysis):

\[ \nabla_\theta J(\boldsymbol{\theta}) = \sum_s d^\pi(s) \sum_a \nabla_\theta \pi_\theta(a|s) \cdot Q^\pi(s, a) \]

Applying the likelihood ratio trick:

\[ = \sum_s d^\pi(s) \sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s, a) \]

\[ = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s, a) \right] \]

This is the core of the policy gradient theorem. The final step converts the summation into an expectation, meaning we can estimate this gradient through sampling.

Revisiting REINFORCE

With the policy gradient theorem in hand, the REINFORCE update formula now has rigorous theoretical backing. REINFORCE uses the MC-sampled return \(G_t\) to approximate \(Q^\pi(S_t, A_t)\):

\[ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(A_t | S_t) \]

Since \(\mathbb{E}[G_t | S_t, A_t] = Q^\pi(S_t, A_t)\), \(G_t\) is an unbiased estimate of \(Q^\pi\). This means the REINFORCE update direction is, in expectation, aligned with \(\nabla J(\boldsymbol{\theta})\) — it is an unbiased but high-variance gradient estimate.

Baseline

The high variance of REINFORCE can be mitigated by introducing a baseline. The policy gradient theorem allows us to subtract any function \(b(s)\) that depends only on the state \(s\) from \(Q^\pi\), without changing the expected value of the gradient:

\[ \nabla_\theta J(\boldsymbol{\theta}) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s, a) - b(s)) \right] \]

Why does subtracting \(b(s)\) not introduce bias? Because:

\[ \sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) = b(s) \sum_a \nabla_\theta \pi_\theta(a|s) = b(s) \cdot \nabla_\theta \underbrace{\sum_a \pi_\theta(a|s)}_{=1} = 0 \]

The sum of probabilities is always 1, and its gradient is always 0. Therefore, the \(b(s)\) term vanishes in expectation.

The most natural choice for the baseline is the state-value function \(V^\pi(s)\), in which case \(Q^\pi(s, a) - V^\pi(s) = A^\pi(s, a)\) is precisely the advantage function. The advantage function measures "how much better is action \(a\) in state \(s\) compared to the average."

The REINFORCE update with a baseline becomes:

\[ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha \gamma^t (G_t - b(S_t)) \nabla_\theta \log \pi_\theta(A_t | S_t) \]

Intuitively: without a baseline, all actions with positive returns get reinforced, even if some returns are far below average. With a baseline, only actions whose returns exceed the average are reinforced, while below-average actions are suppressed. This substantially reduces the variance of the gradient estimate.

Actor-Critic Framework

Motivation: From REINFORCE to Actor-Critic

Although REINFORCE with Baseline reduces variance, it still uses MC sampling to estimate \(G_t\) and must wait until the episode ends to perform updates. More importantly, the MC estimate still has high variance — a single episode's sample is too heavily influenced by randomness.

The core idea of the Actor-Critic framework is: replace MC sampling with a learned value function (the Critic), thereby further reducing variance while enabling online updates (no need to wait for the episode to end).

Dual-Component Architecture

Actor-Critic consists of two components:

Actor: The policy network \(\pi_\theta(a|s)\), responsible for "making decisions." It outputs a probability distribution over actions given the current state. Its parameters are \(\boldsymbol{\theta}\).

Critic: The value network \(\hat{v}(s, \mathbf{w})\) (or \(\hat{q}(s, a, \mathbf{w})\)), responsible for "scoring." It evaluates how good the Actor's current policy is. Its parameters are \(\mathbf{w}\).

Their interaction can be likened to: the actor performs on stage (Actor selects actions), the critic scores from the audience (Critic evaluates value), and the actor adjusts the performance based on the critic's feedback (Actor updates the policy based on Critic's evaluation).

Update Rules

Critic update: The value function is updated using the TD Error (identical to what was covered in the TD Learning chapter):

\[ \delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \]

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha_w \delta_t \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \]

Actor update: The TD Error from the Critic is used as the policy gradient signal:

\[ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \alpha_\theta \delta_t \nabla_\theta \log \pi_\theta(A_t | S_t) \]

The key insight here is that we replace \(G_t\) from REINFORCE with \(\delta_t\). The TD Error \(\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}) - \hat{v}(S_t)\) is actually an estimate of the advantage function \(A^\pi(S_t, A_t)\). Therefore, Actor-Critic is essentially performing policy gradient updates using an estimated advantage function.

Advantages of Actor-Critic over REINFORCE:

Online updates: Updates can be performed at every step, without waiting for the episode to end.
Lower variance: The Critic's estimates are much more stable than MC sampling.
Trade-off: This introduces bias (since the Critic's estimate is imperfect), but in practice, this bias-variance trade-off is usually worthwhile.

Advantage Actor-Critic (A2C)

A2C is an important variant of the Actor-Critic framework that explicitly uses the advantage function to update the Actor:

\[ A(S_t, A_t) = Q(S_t, A_t) - V(S_t) \]

In practice, we do not need to maintain separate \(Q\) and \(V\) networks. We use the TD Error as an estimate of the advantage function:

\[ A(S_t, A_t) \approx \delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \]

The complete A2C algorithm:

初始化Actor参数 theta, Critic参数 w
对每个episode:
    初始化状态 S
    对每一步:
        根据 pi_theta(.|S) 采样动作 A
        执行 A, 观察 R, S'

        计算TD Error (Advantage估计):
            delta = R + gamma * v_hat(S', w) - v_hat(S, w)
            (若S'是终止状态, 则 delta = R - v_hat(S, w))

        更新Critic:
            w = w + alpha_w * delta * grad_w v_hat(S, w)

        更新Actor:
            theta = theta + alpha_theta * delta * grad_theta log pi_theta(A|S)

        S <- S'

Connection to Deep Reinforcement Learning

The Actor-Critic framework is the cornerstone of modern deep reinforcement learning. Nearly all advanced deep RL algorithms can be viewed as variants of Actor-Critic:

A3C / A2C: Asynchronous/synchronous parallel training of Actor-Critic, using multiple environment instances to collect data simultaneously.
PPO (Proximal Policy Optimization): Builds on Actor-Critic by introducing a clipped objective that limits the magnitude of each policy update, preventing the policy from drifting too far.
SAC (Soft Actor-Critic): Adds an entropy regularization term to the objective function, encouraging the policy to maintain exploration; also uses two Critic networks to mitigate overestimation.
DDPG / TD3: Actor-Critic methods designed for continuous action spaces, where the Actor outputs deterministic actions rather than probability distributions.

What these methods share in common is that they all maintain an Actor (policy network) and a Critic (value network), using the Critic's feedback to guide the Actor's learning.

Summary of Policy Gradient Methods

Method	Gradient Signal	Requires Critic	Bias	Variance	Online Updates
REINFORCE	\(G_t\) (MC return)	No	Unbiased	High	No (must wait for episode end)
REINFORCE with Baseline	\(G_t - b(S_t)\)	Requires baseline \(b\)	Unbiased	Medium	No (must wait for episode end)
Actor-Critic	\(\delta_t\) (TD Error)	Yes	Biased	Low	Yes (per-step updates)
A2C	\(\delta_t\) (Advantage estimate)	Yes	Biased	Low	Yes (per-step updates)

This table reveals the evolutionary trajectory of policy gradient methods:

REINFORCE: The purest form — directly uses MC returns for gradient estimation. Unbiased but with extremely high variance, leading to unstable training.
REINFORCE with Baseline: Introduces a baseline to reduce variance, but still relies on MC sampling.
Actor-Critic: Replaces MC sampling with a Critic, trading a small amount of bias for a substantial reduction in variance, while enabling online updates.
A2C and beyond: Builds on Actor-Critic with various techniques (parallel data collection, trust region constraints, entropy regularization, etc.), establishing the standard paradigm for modern deep RL.

The central tension throughout this evolutionary path is the bias-variance trade-off: MC sampling is unbiased but high-variance; bootstrapping is low-variance but biased. The history of policy gradient methods is a story of continually finding better balance points within this trade-off.