A Complete Guide to Policy Gradient Methods

Policy Gradient (PG) is a major paradigm in reinforcement learning, standing alongside value-based methods. While DQN learns Q-values to indirectly derive a policy, policy gradient methods directly parameterize the policy $\pi_\theta(a|s)$ and optimize it via gradient ascent. From the most basic REINFORCE to PPO and SAC, all policy gradient algorithms share the same underlying pipeline. This note dissects that pipeline end-to-end, covering every stage from data collection to policy updates.

Design Principle: The Bias-Variance Tradeoff

The bias-variance tradeoff is the central tension running through the entire policy gradient pipeline. Understanding this point is key to understanding why so many seemingly different PG variants exist — they are all searching for different balance points on the same seesaw.

Why This Is the Core Tension in RL

In supervised learning, the bias-variance tradeoff manifests in the choice of model complexity. In RL, however, the tension is more fundamental — it arises at the return estimation step.

The core formula of policy gradients is:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \Psi_t \right] \]

where $\Psi_t$ is some form of "signal" that tells the gradient "how good this action was." The fundamental difference between PG methods lies in the choice of $\Psi_t$.

MC vs TD: Two Extremes

Monte Carlo Return:

\[ \Psi_t = G_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k} \]

Zero bias: $G_t$ is an unbiased estimate of the return because it uses the complete sequence of actual rewards without relying on any function approximation.
High variance: $G_t$ depends on all sources of randomness from time step $t$ to the end of the episode (stochasticity in both environment transitions and policy choices). The more random factors involved, the larger the fluctuations in the estimate.

TD Target:

\[ \Psi_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

Low variance: $\delta_t$ depends on only one step of randomness (a single reward $r_t$ and a single transition $s_{t+1}$), resulting in much smaller fluctuations.
Biased: $V(s_{t+1})$ is a function approximation (the output of a neural network) and does not equal the true state value, thus introducing bias.

Bias-Variance Tradeoff Spectrum
(High bias, Low variance)                              (Zero bias, High variance)
    |                                                    |
    TD(0)          n-step TD          GAE(λ)           MC
    δ_t         r+γr'+...+γⁿV      Σ(γλ)^l δ          G_t
    |                                                    |
    └────────────────────────────────────────────────────┘
                    GAE's λ parameter slides along this spectrum

Each choice along this spectrum corresponds to a different PG variant. GAE (Generalized Advantage Estimation) uses its parameter $\lambda$ to slide freely along this spectrum and is currently the most widely adopted approach.

Why We Cannot Have It Both Ways

Ideally, we would want $\Psi_t$ to be both unbiased and low-variance. But these two properties are in conflict under finite samples:

Reducing bias requires more real information (longer rollouts), which introduces more randomness and thus increases variance.
Reducing variance requires using function approximation to "smooth" the estimates, but the approximation itself is imperfect and thus increases bias.

The entire PG pipeline — from data collection to advantage estimation to policy updates — is designed around navigating this tension.

Sampling and Data Collection (Rollout)

Policy gradient is a sampling-based method. We cannot compute the expectation $\mathbb{E}_{\tau \sim \pi_\theta}[\cdot]$ exactly; instead, we approximate it by sampling trajectories from the policy $\pi_\theta$. This step is called the Rollout.

What Is a Rollout / Trajectory

A trajectory $\tau$ is the complete record of an agent's interaction with the environment:

\[ \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T) \]

The probability of a trajectory is jointly determined by the policy and the environment dynamics:

\[ p(\tau | \theta) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) \cdot p(s_{t+1} | s_t, a_t) \]

where $p(s_0)$ is the initial state distribution, and $p(s_{t+1}|s_t,a_t)$ is the environment transition probability (unknown in the model-free setting).

Rollout refers to executing this process: interacting with the environment using the current policy $\pi_\theta$ to collect a batch of trajectory data.

On-Policy vs Off-Policy Data Collection

On-Policy:

Data must be generated by the current policy $\pi_\theta$. After each update to the policy parameters, old data must be discarded and new data sampled from the updated policy.

Pros: Unbiased gradient estimates, strong theoretical guarantees
Cons: Extremely low sample efficiency — each batch of data is used only once (or a few times, as in PPO)

Off-Policy:

Data can come from any policy (called the behavior policy $\mu$), with importance sampling used to correct for the distributional mismatch.

Pros: Historical data can be reused, leading to high sample efficiency
Cons: Importance weights can have high variance, requiring additional techniques (e.g., clipping, truncation) to stabilize training

PPO occupies a clever middle ground: it operates within an on-policy framework but uses importance sampling to allow multiple update passes over the same batch of data (typically 3–10 epochs), with clipping to limit how far the new policy can diverge from the old one.

Vectorized Environments

To improve data collection efficiency for on-policy methods, modern implementations typically run multiple environment instances in parallel:

┌─────────────────────────────────────────────────────┐
│                     策略网络 π_θ                      │
│                   (一个共享网络)                       │
└───────┬──────┬──────┬──────┬──────┬──────┬──────────┘
        │      │      │      │      │      │
       ┌▼┐   ┌▼┐   ┌▼┐   ┌▼┐   ┌▼┐   ┌▼┐
       │E│   │E│   │E│   │E│   │E│   │E│     N个并行环境
       │1│   │2│   │3│   │4│   │5│   │6│
       └┬┘   └┬┘   └┬┘   └┬┘   └┬┘   └┬┘
        │      │      │      │      │      │
        └──────┴──────┴──────┴──────┴──────┘
                         │
                    Rollout Buffer
                (N × T 个 transition)

Each environment runs independently but shares the same policy network. This way, a single rollout collects $N \times T$ transitions ($N$ environments, each running for $T$ steps), greatly increasing data throughput.

Data Structure of the Trajectory Buffer

The data collected during a rollout is typically stored in the following structure (using PPO as an example):

Field	Shape	Description
`states`	$(N \times T, \text{obs\_dim})$	Observed states
`actions`	$(N \times T, \text{act\_dim})$	Actions taken
`rewards`	$(N \times T,)$	Immediate rewards from the environment
`dones`	$(N \times T,)$	Whether the episode terminated
`log_probs`	$(N \times T,)$	$\log \pi_{\theta_{\text{old}}}(a_t
`values`	$(N \times T,)$	$V_\phi(s_t)$, the Critic's value estimates

Note that log_probs and values are computed using the old parameters at data collection time; they will later be used to compute importance weights and advantage estimates.

Return and Target Computation

After collecting data, the first step is to compute a "signal" for each time step that tells the policy gradient how good that action was. Different computation methods yield different bias-variance characteristics.

Monte Carlo Return

The most straightforward approach: wait until an episode ends, then compute the discounted cumulative reward from the current time step to the end.

\[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t} r_T = \sum_{k=0}^{T-t} \gamma^k r_{t+k} \]

where $\gamma \in [0, 1)$ is the discount factor:

$\gamma = 0$: Only cares about immediate reward — completely myopic
$\gamma \to 1$: Weighs all future rewards nearly equally — extremely far-sighted
Common value: $\gamma = 0.99$ (effectively considers roughly the next 100 steps)

Recursive computation: Computing backwards from the end of the trajectory is most efficient.

\[ G_T = r_T, \quad G_t = r_t + \gamma G_{t+1} \]

Properties of MC Return:

Unbiased: $\mathbb{E}[G_t | s_t] = V^{\pi}(s_t)$ (under policy $\pi$)
High variance: $\text{Var}[G_t]$ grows exponentially with episode length
Requires waiting until the episode ends to compute (not suitable for continuing tasks)

TD Target

TD(0) Target: Uses only one step of real reward and bootstraps the remainder with a value function.

\[ y_t^{\text{TD}(0)} = r_t + \gamma V(s_{t+1}) \]

Low variance: Involves only one step of randomness
Biased: $V(s_{t+1})$ is a function approximation, so $\mathbb{E}[y_t^{\text{TD}(0)}] \neq V^\pi(s_t)$ (unless $V$ happens to equal the true value function)

The TD target is central to DQN and many off-policy methods. In the on-policy PG pipeline, it serves as the building block for GAE.

n-step Return

The n-step return is a compromise between MC and TD(0): it uses $n$ steps of real rewards plus a bootstrap from step $n+1$.

\[ G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n}) \]

$n = 1$: Reduces to the TD(0) target $r_t + \gamma V(s_{t+1})$
$n = T - t$: Reduces to the MC return (no more bootstrapping)
Larger $n$ means less bias but more variance

Bias-Variance Analysis:

\[ \text{Bias}(G_t^{(n)}) = \gamma^n \left( V(s_{t+n}) - V^\pi(s_{t+n}) \right) \]

The bias comes from the gap between $V(s_{t+n})$ and the true value, but it is multiplied by $\gamma^n$, so bias decays faster as $n$ grows. Meanwhile, variance increases with $n$ because more steps of stochastic rewards are included.

n-step Return Illustration

    t    t+1   t+2   t+3   t+4   ...   T
    |────|────|────|────|────|─────|
    r_t  r_t+1 r_t+2 r_t+3

n=1: r_t + γV(s_{t+1})                          ← Heavy bootstrapping, low variance, high bias
n=2: r_t + γr_{t+1} + γ²V(s_{t+2})
n=3: r_t + γr_{t+1} + γ²r_{t+2} + γ³V(s_{t+3})
 .
 .
n=T: r_t + γr_{t+1} + ... + γ^{T-t}r_T          ← Pure MC, high variance, zero bias

In practice, using a fixed $n$ is not flexible enough. GAE provides a more elegant solution by computing an exponentially weighted average over all $n$-step returns.

Value Estimation (Critic Learning)

In the Actor-Critic architecture, the Critic network $V_\phi(s)$ is responsible for estimating the state value function. It serves two purposes: (1) providing a baseline to reduce the variance of the policy gradient, and (2) providing bootstrap targets for computing TD errors and GAE.

TD Error

The TD error (Temporal Difference Error) is the core training signal for the Critic:

\[ \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) \]

Intuitive interpretation: The TD error measures the degree of "surprise." If $\delta_t > 0$, the actual return ($r_t + \gamma V_\phi(s_{t+1})$) was better than expected ($V_\phi(s_t)$) — a "positive surprise." The opposite indicates a "negative surprise."

The Critic's objective is to drive $\delta_t \to 0$, i.e., to make the value estimates consistent with actual returns.

Fitting $V(s)$ with a Neural Network

The Critic network $V_\phi(s)$ is a neural network with parameters $\phi$ that takes a state $s$ as input and outputs a scalar value.

Loss function: Minimize the mean squared error between the value estimate and the target value.

\[ L^{V}(\phi) = \frac{1}{|B|} \sum_{t \in B} \left( V_\phi(s_t) - V_t^{\text{target}} \right)^2 \]

The target value $V_t^{\text{target}}$ can be set to:

MC Return: $V_t^{\text{target}} = G_t$ (unbiased but high variance)
TD Target: $V_t^{\text{target}} = r_t + \gamma V_{\phi_{\text{old}}}(s_{t+1})$ (low variance but biased)
GAE Return: $V_t^{\text{target}} = \hat{A}_t^{\text{GAE}} + V_{\phi_{\text{old}}}(s_t)$ (common practice in PPO)

Target Network Stability

In DQN, the target network is a "frozen" copy of the network maintained through periodic copying or soft updates, used to compute $V(s_{t+1})$ in the TD target.

In on-policy PG methods (such as PPO), the situation is slightly different. PPO uses the network parameters $\phi_{\text{old}}$ from the beginning of the rollout to compute $V_t^{\text{target}}$; these values are computed and stored in the buffer during the data collection phase. During the multiple epochs of updates, the target values remain fixed while only $V_\phi(s_t)$ changes as parameters are updated. This inherently serves an "anchoring" function similar to a target network.

For off-policy methods (such as SAC), a target network remains standard:

\[ \phi_{\text{target}} \leftarrow \tau \phi + (1 - \tau) \phi_{\text{target}}, \quad \tau \in [0.005, 0.01] \]

Training Procedure

Critic Training Loop:

1. Sample a mini-batch {(s_t, a_t, r_t, s_{t+1}, done_t)} from the buffer
2. Compute targets: V_target = A_t^GAE + V_old(s_t)  (or r_t + γV_target_net(s_{t+1}))
3. Compute predictions: V_pred = V_φ(s_t)
4. Compute loss: L = MSE(V_pred, V_target)
5. Backpropagate and update φ

Advantage Estimation

Advantage estimation is one of the most critical stages in the PG pipeline. The quality of the advantage estimates directly determines the quality of the policy gradient.

Baseline Subtraction — Why Subtract a Baseline

The original REINFORCE algorithm uses the full return $G_t$ as the signal:

\[ \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right] \]

The problem is that $G_t$ is typically a large positive number (especially in environments where all rewards are positive), causing the gradient for every action to point in the "increase probability" direction, differing only in magnitude. This leads to extremely high variance in the gradient estimate.

Baseline Subtraction: Introduce a baseline function $b(s_t)$ that depends only on the state, changing the signal to $G_t - b(s_t)$:

\[ \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) \right] \]

Key theorem: Subtracting any state-dependent baseline does not introduce bias.

\[ \mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0 \]

The crux of the proof is that $\sum_a \pi_\theta(a|s) = 1$ holds for all $\theta$, so its gradient with respect to $\theta$ is always zero.

Optimal baseline: It can be shown theoretically that the variance-minimizing optimal baseline is:

\[ b^*(s) = \frac{\mathbb{E}_a \left[ \| \nabla_\theta \log \pi_\theta(a|s) \|^2 \cdot G_t \right]}{\mathbb{E}_a \left[ \| \nabla_\theta \log \pi_\theta(a|s) \|^2 \right]} \]

In practice, this optimal baseline is difficult to compute exactly. Experience shows that $b(s) = V(s)$ (the state value function) is already an excellent approximation — it conveniently transforms "absolute value" into "advantage relative to the average."

Advantage Function

The advantage function is defined as:

\[ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) \]

where:

$Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a \right]$: the expected return of taking action $a$ in state $s$ and then following policy $\pi$
$V^\pi(s) = \mathbb{E}_{a \sim \pi} [Q^\pi(s,a)]$: the expected return of following policy $\pi$ from state $s$

Properties:

$\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0$: The expected advantage is zero (good and bad cancel out)
$A^\pi(s,a) > 0$: Action $a$ is better than the average performance of policy $\pi$ in state $s$
$A^\pi(s,a) < 0$: Action $a$ is worse than average

Replacing the raw return with the advantage function, the policy gradient becomes:

\[ \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^{\pi_\theta}(s_t, a_t) \right] \]

This is the standard form seen in PPO and other modern PG algorithms.

GAE (Generalized Advantage Estimation)

GAE, proposed by Schulman et al. in 2016, is the de facto standard for advantage estimation in industry. Its core idea is: compute an exponentially weighted average over TD errors of different step lengths, using the parameter $\lambda$ to control the bias-variance balance point.

Derivation

First, define the $n$-step advantage estimate:

\[ \hat{A}_t^{(1)} = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

\[ \hat{A}_t^{(2)} = \delta_t + \gamma \delta_{t+1} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \]

\[ \hat{A}_t^{(n)} = \sum_{l=0}^{n-1} \gamma^l \delta_{t+l} \]

GAE is defined as the exponentially weighted average of these $n$-step advantages (with weights $(1-\lambda)\lambda^{n-1}$):

\[ \hat{A}_t^{\text{GAE}(\gamma,\lambda)} = (1 - \lambda) \left( \hat{A}_t^{(1)} + \lambda \hat{A}_t^{(2)} + \lambda^2 \hat{A}_t^{(3)} + \cdots \right) \]

Expanding and simplifying (using the properties of geometric series) yields an elegant closed form:

\[ \hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l} \]

\[ = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + \cdots \]

Two Extremes

$\lambda = 0$: $\hat{A}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
- Equivalent to TD(0) advantage estimation
- Low variance, high bias (entirely dependent on the accuracy of $V$)

$\lambda = 1$: $\hat{A}_t = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l} = G_t - V(s_t)$
- Equivalent to MC return minus baseline
- Zero bias, high variance

Recursive Computation

In implementation, GAE can be computed recursively from the end of the trajectory backwards, which is highly efficient:

\[ \hat{A}_T = \delta_T \]

\[ \hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1} \]

Pseudocode:

# 假设已有: rewards[0:T], values[0:T+1], dones[0:T], gamma, lam
advantages = zeros(T)
last_gae = 0
for t in reversed(range(T)):
    delta = rewards[t] + gamma * values[t+1] * (1 - dones[t]) - values[t]
    advantages[t] = last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae
returns = advantages + values[0:T]  # V_target = A + V_old

Note the (1 - dones[t]) handling: when an episode ends, the next state belongs to a new episode and should not be bootstrapped from.

How to Choose $\lambda$

$\lambda$ Value	Characteristics	Applicable Scenarios
0.0	Pure TD(0), lowest variance, highest bias	When the Critic is very accurate
0.9	Leans toward TD, relatively low variance	Dense rewards, good Critic
0.95	Classic default	Most tasks
0.97-0.99	Leans toward MC, relatively high variance	Sparse rewards, short episodes
1.0	Pure MC, zero bias, highest variance	Extremely sparse rewards

Practical experience: $\lambda = 0.95$ combined with $\gamma = 0.99$ is the most common pairing, performing well on standard benchmarks such as Atari and MuJoCo.

Policy Optimization

With the advantage estimates $\hat{A}_t$ in hand, the next step is to use them to update the policy parameters $\theta$. There are many approaches, ranging from the simplest vanilla policy gradient to PPO's clipped objective, with progressively increasing sophistication and performance.

Policy Gradient Objective Function

The goal of the policy gradient is to maximize the expected return $J(\theta)$. By the policy gradient theorem:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t \right] \]

In practice, we define a surrogate objective function to leverage automatic differentiation frameworks:

\[ L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t \right] \]

We then perform gradient ascent (i.e., maximization) on $L^{PG}$. The gradient of this objective function is exactly the policy gradient.

Problems with Vanilla Policy Gradient:

Step size sensitivity: Learning rate too large leads to policy collapse; too small leads to extremely slow convergence
Parameter space is not policy space: A "small step" in parameter space may cause a "large jump" in policy space, and vice versa
No monotonic improvement guarantee: A single update can make the policy worse, and a worse policy collects worse data, creating a vicious cycle

Natural Gradient

Why is the vanilla gradient the "wrong" direction in parameter space?

Consider a simple example: suppose the policy network outputs logits of $(10, 0)$ for two actions, yielding softmax probabilities of $(0.99995, 0.00005)$. Changing the logits from $(10, 0)$ to $(11, 0)$ barely changes the policy (probability goes from 0.99995 to 0.99998). But changing logits from $(0.5, 0)$ to $(1.5, 0)$ produces a significant policy change (probability goes from 0.62 to 0.73).

The issue is: Euclidean distance measures the change in parameters $\|\Delta\theta\|_2$, but what we truly care about is the change in the policy distribution.

The Fisher Information Matrix provides the correct "metric" in policy space:

\[ F(\theta) = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s)^T \right] \]

The physical meaning of the Fisher matrix: it describes the "sensitivity" of the policy distribution to changes along each direction in parameter space. We should take small steps in high-sensitivity directions and can take large steps in low-sensitivity directions.

Natural Gradient:

\[ \tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1} \nabla_\theta J(\theta) \]

The natural gradient uses the inverse of the Fisher matrix to "correct" the gradient direction so that the resulting change in policy space is uniform. This is equivalent to solving the following constrained optimization problem:

\[ \max_{\Delta\theta} \; \nabla_\theta J(\theta)^T \Delta\theta \quad \text{s.t.} \quad \frac{1}{2} \Delta\theta^T F \Delta\theta \leq \delta \]

That is, find the update direction that maximizes the objective improvement subject to a KL divergence constraint.

Practical difficulty: The Fisher matrix $F$ has size $|\theta| \times |\theta|$. For networks with millions of parameters, storing and inverting it is infeasible. This is the problem TRPO was designed to solve.

Trust Region

The core idea of trust region methods: construct a local approximation of the objective function near the current parameters, and constrain the update step to stay within the region where this approximation is "trustworthy."

In the reinforcement learning context, the trust region constraint is typically defined using KL divergence:

\[ D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) = \mathbb{E}_{s} \left[ \sum_a \pi_{\theta_{\text{old}}}(a|s) \log \frac{\pi_{\theta_{\text{old}}}(a|s)}{\pi_\theta(a|s)} \right] \]

Why is KL divergence the right constraint?

KL divergence directly measures the difference between two policy distributions. If the KL divergence between new and old policies is small, then:

The importance sampling weights $r_t(\theta)$ stay close to 1, keeping variance under control
The approximation error of the surrogate objective relative to the true objective is bounded
The Kakade-Langford inequality guarantees monotonic policy improvement

Relationship between KL divergence and the Fisher matrix:

\[ D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \approx \frac{1}{2} (\theta - \theta_{\text{old}})^T F (\theta - \theta_{\text{old}}) \]

This is the second-order Taylor expansion of the KL divergence around $\theta_{\text{old}}$. We can see that the Fisher matrix is precisely the Hessian of the KL divergence (evaluated at $\theta_{\text{old}}$).

TRPO in Detail

TRPO (Trust Region Policy Optimization, Schulman et al., 2015) formalizes the above ideas as a constrained optimization problem:

\[ \max_\theta \quad L(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \hat{A}_t \right] \]

\[ \text{s.t.} \quad \mathbb{E}_s \left[ D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) \| \pi_\theta(\cdot|s)) \right] \leq \delta \]

Solution procedure:

Step 1: Linearize the objective function

\[ L(\theta) \approx g^T (\theta - \theta_{\text{old}}), \quad g = \nabla_\theta L \big|_{\theta_{\text{old}}} \]

Step 2: Quadratically approximate the KL constraint

\[ D_{\text{KL}} \approx \frac{1}{2} (\theta - \theta_{\text{old}})^T F (\theta - \theta_{\text{old}}) \leq \delta \]

Step 3: Solve via Lagrange duality

This is a classic linear-objective-with-quadratic-constraint problem, with the analytical solution:

\[ \theta^* = \theta_{\text{old}} + \sqrt{\frac{2\delta}{g^T F^{-1} g}} \cdot F^{-1} g \]

That is, update along the natural gradient direction $F^{-1}g$, with the step size determined by $\delta$.

Step 4: Conjugate Gradient Method

Directly computing $F^{-1}g$ requires $O(|\theta|^2)$ storage and $O(|\theta|^3)$ computation. TRPO uses the conjugate gradient method, which only requires Fisher-vector products $Fv$ (computable via automatic differentiation in $O(|\theta|)$ time), typically converging to a good approximation of $F^{-1}g$ in about 10–20 iterations.

Step 5: Line Search

The conjugate gradient method yields an approximate solution that may not satisfy the KL constraint. TRPO performs a backtracking line search along the update direction to find the largest step size that both satisfies the constraint and actually improves the objective.

Theoretical guarantee of TRPO: Each update step satisfies:

\[ J(\pi_{\theta_{\text{new}}}) \geq J(\pi_{\theta_{\text{old}}}) - C \cdot D_{\text{KL}}^{\max}(\pi_{\theta_{\text{old}}}, \pi_{\theta_{\text{new}}}) \]

That is, policy performance is monotonically non-decreasing (within the bounds of approximation error).

Engineering challenges of TRPO:

High implementation complexity (conjugate gradient + Fisher-vector products + line search)
Incompatible with parameter sharing (KL constraints are difficult to handle when Actor and Critic share layers)
Incompatible with regularization techniques such as Dropout/BatchNorm
High per-step computational cost

PPO's Clipped Objective

PPO approximates TRPO's trust region constraint in an extremely simple way (see PPO.md for details). The core formula:

\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] \]

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio and $\epsilon = 0.2$ is the clipping range.

Clipping vs Trust Region: Clipping is not a strict trust region method — it does not guarantee that the KL divergence stays below a certain threshold. In practice, however, clipping the probability ratio indirectly limits the magnitude of policy changes and works surprisingly well. Moreover, clipping requires only a first-order optimizer (Adam) and involves roughly one-tenth of TRPO's code.

For a detailed analysis of PPO (case-by-case breakdown of the four scenarios, value function loss, entropy bonus, etc.), see PPO.md.

KL Penalty Method

The original PPO paper also proposed another variant: PPO-Penalty, which replaces clipping with a KL divergence penalty:

\[ L^{\text{KL-Pen}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \hat{A}_t - \beta \cdot D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)) \right] \]

where $\beta$ is an adaptively adjusted coefficient:

If $D_{\text{KL}} > d_{\text{target}}$: $\beta \leftarrow 2\beta$ (KL too large — strengthen the penalty)
If $D_{\text{KL}} < d_{\text{target}} / 1.5$: $\beta \leftarrow \beta / 2$ (KL too small — relax the penalty)

PPO-Penalty is theoretically closer to TRPO (explicitly using a KL constraint), but in practice PPO-Clip is more stable and more commonly used. One reason is that computing KL divergence requires the full distributions of both old and new policies, whereas clipping only needs the probability ratio (a scalar), making it computationally simpler.

Regularization and Engineering Tricks

Theoretically correct algorithms often require extensive engineering techniques to work stably in practice. The following are commonly used regularization and optimization strategies in modern PG implementations.

Entropy Bonus

In PPO's complete objective function, the entropy bonus term encourages the policy to maintain a degree of randomness:

\[ L_{\text{total}} = L^{\text{CLIP}} - c_1 L^{VF} + c_2 H(\pi_\theta) \]

where the policy entropy is $H(\pi_\theta(\cdot|s)) = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$.

Purposes:

Prevent premature convergence: The policy may collapse early in training to a locally optimal deterministic action; the entropy bonus prevents this
Numerical stability: Prevents action probabilities from approaching zero ($\log 0 = -\infty$)
Encourage exploration: Maintains policy diversity, helping discover better strategies

Typical coefficient $c_2 = 0.01$. For environments with high exploration requirements, it can be increased to $0.05$. It can also be linearly decayed over the course of training.

Mini-batch SGD

After collecting a batch of rollout data, rather than computing a single gradient over all the data at once, the data is shuffled and split into multiple mini-batches:

\[ \text{Total data} = N \times T, \quad \text{mini-batch size} = M \]

\[ \text{Number of mini-batches} = \frac{N \times T}{M} \]

Benefits of mini-batch SGD:

Introduces stochasticity that helps escape local optima
Reduces memory usage (no need to compute gradients over all data simultaneously)
More frequent parameter updates

Multi-Epoch Updates (PPO-style)

On-policy data is normally used only once. PPO's key engineering innovation is: performing multiple epochs of SGD updates on the same batch of rollout data (typically 3–10 epochs), using clipping to prevent the policy from diverging too far.

PPO Single Iteration:
1. Collect N×T transitions using π_θ_old
2. Compute GAE advantage estimates A_t
3. for epoch = 1, 2, ..., K:        ← Reuse the same batch of data multiple times
       Shuffle the data
       for each mini-batch:
           Compute r_t(θ) = π_θ(a|s) / π_θ_old(a|s)
           Compute L^CLIP, L^VF, H
           Gradient ascent to update θ and φ
4. θ_old ← θ                        ← Update the old policy

$K$ should not be too large; otherwise the policy drifts too far from the distribution of the old data, and even clipping cannot recover.

Gradient Clipping

Limit the norm of the gradient to prevent excessively large single-step updates:

\[ g \leftarrow \frac{g}{\max(1, \|g\|_2 / g_{\max})} \]

Typical value $g_{\max} = 0.5$. This is especially important in RL because gradient variance is inherently large, and occasionally occurring extreme gradients can destabilize training.

Advantage Normalization

Within each mini-batch, standardize the advantage values:

\[ \hat{A}_t \leftarrow \frac{\hat{A}_t - \mu(\hat{A})}{\sigma(\hat{A}) + \epsilon} \]

where $\mu$ and $\sigma$ are the mean and standard deviation of the advantage values in the current mini-batch, and $\epsilon = 10^{-8}$ prevents division by zero.

Why it works: After normalization, roughly half of the advantages are positive and half are negative, turning the policy gradient into "simultaneously increase the probability of good actions and decrease the probability of bad actions." This is more stable than having all-positive signals (or all-negative signals).

Orthogonal Initialization

Use orthogonal initialization for the weights of both Actor and Critic networks, with a smaller scale (e.g., 0.01) for the final layer. This helps:

Maintain consistent gradient magnitudes in both forward and backward passes
Prevent the policy network from being overly deterministic at initialization
Accelerate convergence in the early stages of training

# 典型初始化方案
for layer in network.layers:
    nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
    nn.init.constant_(layer.bias, 0.0)
# 最后的策略输出层用小尺度
nn.init.orthogonal_(policy_head.weight, gain=0.01)
# 最后的价值输出层用尺度1
nn.init.orthogonal_(value_head.weight, gain=1.0)

Complete Pipeline Summary

Full Pipeline Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                    Policy Gradient Pipeline                         │
└──────────────────────────────────────────────────────────────────────┘

  ┌──────────────┐
  │  环境 (Env)   │──── N个并行环境
  └──────┬───────┘
         │  (s, a, r, s', done)
  ┌──────▼───────┐
  │  1. Rollout   │──── 用 π_θ_old 与环境交互，收集 N×T 个 transitions
  │  (数据收集)    │     记录 log π_old(a|s), V_old(s)
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  2. 计算回报   │──── 计算 TD 误差 δ_t = r + γV(s') - V(s)
  │  & GAE优势    │     递推计算 GAE: A_t = δ_t + γλ·A_{t+1}
  │               │     计算 Return: R_t = A_t + V_old(s_t)
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  3. 多Epoch   │──── for epoch = 1 to K:
  │  Mini-batch   │       for each mini-batch:
  │  优化         │
  │  ┌──────────┐ │        ┌────────────────────────────────────────┐
  │  │ Actor    │ │        │ r(θ) = π_θ(a|s) / π_old(a|s)          │
  │  │ 策略更新  │ │        │ L_clip = min(r·A, clip(r,1±ε)·A)     │
  │  └──────────┘ │        │ + c2 · H(π_θ)   ← 熵奖励             │
  │  ┌──────────┐ │        ├────────────────────────────────────────┤
  │  │ Critic   │ │        │ L_vf = (V_φ(s) - R_t)²               │
  │  │ 价值更新  │ │        └────────────────────────────────────────┘
  │  └──────────┘ │
  └──────┬───────┘
         │
  ┌──────▼───────┐
  │  4. 更新旧策略 │──── θ_old ← θ
  │  清空 Buffer   │
  └──────┬───────┘
         │
         └──── 回到 Step 1，重复

Generic PG Training Loop Pseudocode

# ─── 初始化 ─────────────────────────────────
policy_net = PolicyNetwork(obs_dim, act_dim)    # Actor: π_θ(a|s)
value_net  = ValueNetwork(obs_dim)               # Critic: V_φ(s)
optimizer  = Adam([policy_net.params, value_net.params], lr=3e-4)
envs       = VectorizedEnv(env_name, num_envs=N)

# ─── 超参数 ─────────────────────────────────
gamma       = 0.99     # 折扣因子
lam         = 0.95     # GAE λ
epsilon     = 0.2      # PPO clipping
c1          = 0.5      # 值函数损失系数
c2          = 0.01     # 熵奖励系数
K_epochs    = 4        # 每次rollout的更新轮数
T_horizon   = 128      # 每个环境的rollout长度
M_batch     = 256      # mini-batch大小
max_grad    = 0.5      # 梯度裁剪阈值

# ─── 主循环 ─────────────────────────────────
for iteration in range(max_iterations):
    # ── Step 1: Rollout ──────────────────────
    buffer = RolloutBuffer()
    obs = envs.reset()
    for t in range(T_horizon):
        with no_grad():
            action, log_prob = policy_net.sample(obs)
            value = value_net(obs)
        next_obs, reward, done, info = envs.step(action)
        buffer.store(obs, action, reward, done, log_prob, value)
        obs = next_obs

    # ── Step 2: 计算 GAE & Returns ──────────
    with no_grad():
        last_value = value_net(obs)     # Bootstrap最后一步
    buffer.compute_gae(last_value, gamma, lam)

    # ── Step 3: 多Epoch Mini-batch优化 ──────
    for epoch in range(K_epochs):
        for batch in buffer.get_minibatches(M_batch):
            states, actions, old_log_probs, returns, advantages = batch

            # Advantage Normalization
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

            # Actor Loss (PPO-Clip)
            new_log_probs = policy_net.log_prob(states, actions)
            ratio = exp(new_log_probs - old_log_probs)
            surr1 = ratio * advantages
            surr2 = clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
            policy_loss = -mean(min(surr1, surr2))

            # Critic Loss
            value_pred = value_net(states)
            value_loss = mean((value_pred - returns) ** 2)

            # Entropy Bonus
            entropy = policy_net.entropy(states)

            # Total Loss
            loss = policy_loss + c1 * value_loss - c2 * entropy

            # Update
            optimizer.zero_grad()
            loss.backward()
            clip_grad_norm_(all_params, max_grad)
            optimizer.step()

    # ── Step 4: 更新旧策略 ──────────────────
    # PPO: θ_old 隐式通过 buffer 中存储的 old_log_probs 实现
    buffer.clear()

Differences Among PG Algorithms in the Pipeline

Algorithm	Advantage Estimate $\hat{A}_t$	Policy Update Method	Data Reuse
REINFORCE	$G_t$ (MC Return)	$\nabla\log\pi \cdot G_t$	1 pass
REINFORCE w/ Baseline	$G_t - V(s_t)$	$\nabla\log\pi \cdot (G_t - V)$	1 pass
A2C	$\delta_t$ or n-step	$\nabla\log\pi \cdot \hat{A}$	1 pass
TRPO	GAE	Natural gradient + KL constraint	1 pass
PPO	GAE	Clipped Objective	K passes

Field	Shape	Description
`states`	\((N \times T, \text{obs\_dim})\)	Observed states
`actions`	\((N \times T, \text{act\_dim})\)	Actions taken
`rewards`	\((N \times T,)\)	Immediate rewards from the environment
`dones`	\((N \times T,)\)	Whether the episode terminated
`log_probs`	\((N \times T,)\)	$\log \pi_{\theta_{\text{old}}}(a_t
`values`	\((N \times T,)\)	\(V_\phi(s_t)\), the Critic's value estimates

Algorithm	Advantage Estimate \(\hat{A}_t\)	Policy Update Method	Data Reuse
REINFORCE	\(G_t\) (MC Return)	\(\nabla\log\pi \cdot G_t\)	1 pass
REINFORCE w/ Baseline	\(G_t - V(s_t)\)	\(\nabla\log\pi \cdot (G_t - V)\)	1 pass
A2C	\(\delta_t\) or n-step	\(\nabla\log\pi \cdot \hat{A}\)	1 pass
TRPO	GAE	Natural gradient + KL constraint	1 pass
PPO	GAE	Clipped Objective	K passes