Skip to content

PPO

DQN addressed discrete action problems, prompting researchers to tackle the more realistic challenge of continuous control. Today, the continuous control landscape is largely dominated by two algorithms: PPO (on-policy) and SAC (off-policy). However, the path to this point involved considerable evolution. This chapter focuses on PPO as the culmination of the on-policy lineage, providing a concise overview of the most essential concepts along the way.

PPO (Proximal Policy Optimization), proposed by OpenAI in 2017, is one of the most widely used reinforcement learning algorithms in both industry and academia. From RLHF training for ChatGPT, to robotic control, to game AI (OpenAI Five), PPO is virtually ubiquitous. Its core selling point is: it achieves theoretical performance guarantees close to TRPO through remarkably simple code, while maintaining stable training and easy hyperparameter tuning.

Development History

Before understanding PPO, we need to trace the entire evolutionary chain from basic policy gradient methods to PPO:

\[ \text{REINFORCE} \to \text{Actor-Critic} \to \text{A3C/A2C} \to \text{TRPO} \to \text{PPO} \]

A3C - Asynchronous Advantage Actor-Critic

A3C was proposed by DeepMind in 2016 (Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning") and was the first deep reinforcement learning algorithm to achieve large-scale success on continuous control tasks.

Core idea: Multiple parallel workers (threads/processes) simultaneously interact with their own independent copies of the environment, asynchronously updating a shared global network.

A3C introduced two key innovations:

  1. Asynchronous parallel training: Each worker independently collects experience and computes gradients, then asynchronously pushes those gradients to the global network. The global network updates its parameters as soon as it receives gradients, after which the worker pulls the latest parameters and continues training. This approach dramatically improves data collection efficiency. Moreover, since different workers encounter different state distributions, it naturally "breaks data correlation" -- addressing the same problem that DQN solved with experience replay, but through a completely different approach.

  2. Advantage Actor-Critic architecture: A3C simultaneously trains two networks (typically sharing lower-level parameters):

    • Actor (policy network): Outputs the action probability distribution \(\pi_\theta(a|s)\)
    • Critic (value network): Outputs the state value \(V_\phi(s)\)

    The Actor updates the policy based on the Advantage function \(A(s,a) = Q(s,a) - V(s)\). Using the Advantage instead of the raw return \(G_t\) significantly reduces variance while maintaining unbiasedness.

The problem with A3C: Asynchronous updates lead to a "stale gradient" problem. When worker A finishes computing its gradient and is about to upload it, the global network may have already been updated by worker B. Worker A's gradient was computed based on outdated parameters and may push the global network in the wrong direction.

A2C - Synchronous Advantage Actor-Critic

A2C is the synchronous version of A3C. The improvement is straightforward: all workers synchronously collect one round of data, wait for everyone to finish, and then update the global network together.

Although this sacrifices some wall-clock efficiency (faster workers must wait for slower ones), it eliminates the gradient noise introduced by asynchronous updates. Experiments show that A2C often achieves the same or even better performance than A3C under the same computational budget, while being simpler to implement and easier to debug.

A2C is an important stepping stone for understanding PPO, because PPO is essentially A2C with an added "trust region" constraint to stabilize training.

Trust Region - The Concept

Before formally introducing TRPO, we need to understand a core concept from optimization: the Trust Region.

In standard gradient descent, we compute the gradient direction and take a step along it. But how far should we go? This is determined by the learning rate. The problem is: the gradient is only accurate within an extremely small neighborhood of the current point. If the learning rate is too large, you step outside this neighborhood where the gradient direction is no longer reliable, potentially causing a catastrophic drop in performance.

The trust region approach takes an entirely different perspective:

  1. Around the current parameters \(\theta\), construct a local approximate model of the objective function (e.g., a quadratic approximation).
  2. Define a trust region -- the range within which we "trust" this approximate model to be accurate.
  3. Within this trust region, find the optimal solution of the approximate model as the next set of parameters.
  4. If the new parameters actually improve the true objective, expand the trust region; if they make it worse, shrink it.

In reinforcement learning, the trust region concept is especially important. Policy gradient algorithms suffer from a critical issue: a single overly large policy update can cause the policy to collapse entirely, and recovery is extremely difficult. Unlike supervised learning, where an increase in loss at one step can be corrected at the next, in RL, once a policy degrades, the quality of collected data also degrades, which in turn makes the next update even worse -- creating a vicious cycle.

TRPO - Trust Region Policy Optimization

TRPO was proposed by Schulman et al. in 2015 and was the first algorithm to formally incorporate trust region methods into policy gradient optimization. It provides theoretical guarantees that each update monotonically improves policy performance (or at least does not degrade it).

Theoretical foundation: TRPO builds upon an important inequality proved by Kakade and Langford in 2002. For any two policies \(\pi\) and \(\pi'\), the expected return of the new policy can be expressed as:

\[ \eta(\pi') = \eta(\pi) + \mathbb{E}_{s \sim d^{\pi'}, a \sim \pi'} [A^\pi(s, a)] \]

where \(\eta(\pi)\) is the expected return of policy \(\pi\), \(A^\pi(s,a)\) is the advantage function under the old policy, and \(d^{\pi'}\) is the state visitation distribution under the new policy.

The problem is that the expectation on the right-hand side is computed under the new policy \(\pi'\)'s state distribution, which we cannot obtain since the new policy has not yet been executed. TRPO's key insight is that if the old and new policies are sufficiently close, we can approximate using the old policy's state distribution, yielding a surrogate objective:

\[ L_{\pi_{\theta_{\text{old}}}}(\pi_\theta) = \mathbb{E}_{s \sim d^{\pi_{\theta_{\text{old}}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a) \right] \]

To ensure this approximation is valid (i.e., the two policies are indeed sufficiently close), TRPO imposes a KL divergence constraint on the policy update:

\[ \max_\theta \quad L_{\pi_{\theta_{\text{old}}}}(\pi_\theta) \quad \text{s.t.} \quad \mathbb{E}_s \left[ D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s) \| \pi_\theta(\cdot|s)) \right] \leq \delta \]

where \(D_{\text{KL}}\) is the KL divergence measuring the difference between two probability distributions, and \(\delta\) is a small threshold limiting how much the new policy can differ from the old one.

Solution method: This is a constrained optimization problem that cannot be solved directly with SGD. TRPO employs the following techniques:

  1. First-order Taylor expansion of the objective (linear approximation):
\[ L(\theta) \approx L(\theta_{\text{old}}) + g^T (\theta - \theta_{\text{old}}) \]

where \(g = \nabla_\theta L |_{\theta_{\text{old}}}\) is the policy gradient.

  1. Second-order Taylor expansion of the KL constraint (quadratic approximation):
\[ D_{\text{KL}}(\theta_{\text{old}} \| \theta) \approx \frac{1}{2} (\theta - \theta_{\text{old}})^T F (\theta - \theta_{\text{old}}) \]

where \(F\) is the Fisher Information Matrix, which describes the curvature of policy changes in parameter space.

  1. Solving with conjugate gradient: Directly computing \(F^{-1}g\) requires inverting the Fisher matrix, which is infeasible for large neural networks. TRPO cleverly uses the conjugate gradient method, which only requires computing Fisher matrix-vector products \(Fv\) (efficiently achievable via automatic differentiation) to approximately solve for the update direction.

  2. Line search: After the conjugate gradient provides the update direction, a line search is performed along this direction to ensure the KL constraint is actually satisfied and the objective function actually improves.

Critical drawbacks of TRPO:

  • Extremely complex implementation (requires conjugate gradient, line search, Fisher-vector products, etc.)
  • Incompatible with parameter-sharing architectures (difficult to handle when Actor and Critic share layers)
  • Incompatible with common regularization techniques like Dropout and Batch Normalization
  • High computational overhead

It is precisely because of these engineering challenges that PPO was born.

PPO Core Algorithm

PPO's design philosophy is: preserve TRPO's core idea of "limiting the policy update step size," but implement it in an extremely simple way. No conjugate gradient, no Fisher matrix, no line search -- just a first-order optimizer (such as Adam).

Policy Gradient Review

Before presenting PPO's specific formulas, let us review the basic Policy Gradient theorem.

The goal of policy gradient is to maximize the expected return:

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] \]

By the Policy Gradient Theorem, the gradient of \(J(\theta)\) with respect to parameters \(\theta\) is:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^{\pi_\theta}(s_t, a_t) \right] \]

where:

  • \(\pi_\theta(a_t|s_t)\) is the probability of selecting action \(a_t\) in state \(s_t\) under the policy
  • \(A^{\pi_\theta}(s_t, a_t)\) is the advantage function, measuring how good action \(a_t\) is relative to the average
  • \(\nabla_\theta \log \pi_\theta(a_t|s_t)\) is the gradient of the log-probability, indicating the parameter update direction that increases the probability of this action

Intuition: If the advantage of action \(a_t\) is \(A > 0\) (better than average), increase the probability of that action; if \(A < 0\) (worse than average), decrease it. The magnitude of the increase/decrease is proportional to \(|A|\).

The problem with vanilla policy gradient: It is an on-policy algorithm -- each batch of collected data can only be used for a single parameter update, after which the data must be discarded (since it was generated by the old policy and cannot be used for gradient estimation under the new policy). This results in extremely low sample efficiency.

Importance Sampling

To improve sample efficiency, we want to use data collected by the old policy \(\pi_{\theta_{\text{old}}}\) to update the new policy \(\pi_\theta\). This requires importance sampling.

The basic principle of importance sampling is: if we want to compute the expectation of \(f(x)\) under distribution \(p(x)\), but only have samples from a different distribution \(q(x)\), we can correct for this through reweighting:

\[ \mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right] \]

where \(\frac{p(x)}{q(x)}\) is the importance weight.

Applying this to policy gradient, we use data from the old policy to estimate the new policy's objective:

\[ L^{IS}(\theta) = \mathbb{E}_{(s,a) \sim \pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a) \right] \]

We define the probability ratio:

\[ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \]

The surrogate objective can then be written concisely as:

\[ L^{IS}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \cdot A_t \right] \]

When \(\theta = \theta_{\text{old}}\), \(r_t = 1\), and the surrogate objective reduces to the standard policy gradient objective.

The problem with importance sampling: When the old and new policies diverge significantly, the importance weight \(r_t(\theta)\) can become very large or very small, causing high variance in gradient estimates and unstable training. TRPO addresses this via a KL divergence constraint, while PPO takes a more direct approach -- Clipping.

Clipped Surrogate Objective

This is PPO's most important innovation and the most elegant part of the entire algorithm.

PPO-Clip directly clips the probability ratio \(r_t(\theta)\) within the objective function:

\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right] \]

where:

  • \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the probability ratio
  • \(\epsilon\) is a hyperparameter, typically set to \(0.2\), controlling the clipping range
  • \(\text{clip}(r_t, 1-\epsilon, 1+\epsilon)\) restricts the probability ratio to \([0.8, 1.2]\)
  • \(\min\) takes the smaller of the two terms

Why does clipping work? Let us analyze four cases:

Case 1: \(A_t > 0\) (action better than average), \(r_t > 1\) (new policy favors this action more)

  • Unclipped term: \(r_t \cdot A_t\), increases as \(r_t\) grows
  • Clipped term: \(\text{clip}(r_t, 0.8, 1.2) \cdot A_t = 1.2 \cdot A_t\), capped
  • \(\min\) takes the smaller value \(= 1.2 \cdot A_t\) (the clipped one)

Effect: The new policy already favors this good action more than the old policy, but the objective no longer provides additional reward. This prevents "over-optimization" -- avoiding excessively boosting the probability of an action just because it happened to work well once.

Case 2: \(A_t > 0\) (action better than average), \(r_t < 1\) (new policy favors this action less)

  • Unclipped term: \(r_t \cdot A_t\), less than \(A_t\)
  • Clipped term: since \(r_t < 1 < 1+\epsilon\), clipping usually does not activate, \(\text{clip}(r_t, 0.8, 1.2) = \max(r_t, 0.8)\)
  • \(\min\) takes the smaller value \(= r_t \cdot A_t\) (the unclipped term is smaller)

Effect: The probability of a good action is declining; the objective provides a normal gradient signal encouraging recovery of this action's probability.

Case 3: \(A_t < 0\) (action worse than average), \(r_t < 1\) (new policy reduces this action's probability)

  • Unclipped term: \(r_t \cdot A_t\); note that since both factors interact, as \(r_t\) decreases, the result increases (approaches 0)
  • Clipped term: \(0.8 \cdot A_t\), bounded from below
  • \(\min\) takes the smaller value \(= 0.8 \cdot A_t\) (the clipped one)

Effect: The new policy is already reducing the probability of this bad action, but the objective no longer provides additional penalty. This prevents "over-avoidance."

Case 4: \(A_t < 0\) (action worse than average), \(r_t > 1\) (new policy actually increases this action's probability)

  • Unclipped term: \(r_t \cdot A_t\); as \(r_t\) increases, the result decreases (becomes more negative)
  • Clipped term: clipping usually does not activate
  • \(\min\) takes the smaller value \(= r_t \cdot A_t\)

Effect: A bad action's probability is rising; the objective provides a normal gradient signal penalizing this behavior.

Summary: The essential logic of clipping is: when the policy changes in the correct direction (increasing probability of good actions or decreasing probability of bad actions), limit the magnitude of change; when the policy changes in the wrong direction, impose no limit and let the gradient correct freely. This implements a "pessimistic lower bound" estimate: we always optimize a lower bound of the objective, ensuring that true performance is no worse than our estimate.

Value Function Loss

The Critic network in PPO estimates the state value function \(V_\phi(s)\), which is used to compute the advantage function. The Critic's loss function is typically a simple MSE:

\[ L^{VF}(\phi) = \mathbb{E}_t \left[ (V_\phi(s_t) - V_t^{\text{target}})^2 \right] \]

where \(V_t^{\text{target}}\) is the value function target, typically the return estimate \(\hat{R}_t\) computed via GAE (discussed in detail later):

\[ V_t^{\text{target}} = \hat{R}_t = A_t + V_{\phi_{\text{old}}}(s_t) \]

In the original PPO paper, the authors also proposed a variant that applies clipping to the value function:

\[ L^{VF,\text{clip}}(\phi) = \mathbb{E}_t \left[ \max \left( (V_\phi(s_t) - V_t^{\text{target}})^2, \; (V_{\phi}^{\text{clip}}(s_t) - V_t^{\text{target}})^2 \right) \right] \]

where \(V_\phi^{\text{clip}} = \text{clip}(V_\phi(s_t), V_{\phi_{\text{old}}}(s_t) - \epsilon, V_{\phi_{\text{old}}}(s_t) + \epsilon)\). However, in practice, value function clipping has shown inconsistent effects, and many high-performance implementations simply use the plain MSE loss.

Entropy Bonus

To prevent the policy from converging prematurely to a deterministic action (i.e., "premature convergence"), PPO adds an entropy bonus to the objective function:

\[ H(\pi_\theta(\cdot|s_t)) = -\sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) \]

Entropy measures the "randomness" of the policy output. Higher entropy means a more stochastic policy (more uniform action probabilities); lower entropy means a more deterministic policy (concentrated on a single action).

The purposes of adding the entropy bonus are:

  1. Encouraging exploration: Prevents the policy from collapsing to a local optimum early in training.
  2. Avoiding zero action probabilities: If an action's probability approaches 0, \(\log \pi\) approaches negative infinity, causing numerical issues. The entropy bonus prevents this from happening.
  3. Improving robustness: A slightly stochastic policy is more robust to environmental perturbations.

PPO's Complete Objective Function

Combining the three components above, PPO's complete objective function is:

\[ L^{\text{PPO}}(\theta, \phi) = \mathbb{E}_t \left[ L_t^{\text{CLIP}}(\theta) - c_1 \cdot L_t^{VF}(\phi) + c_2 \cdot H(\pi_\theta(\cdot|s_t)) \right] \]

where:

  • \(L_t^{\text{CLIP}}(\theta)\): Clipped surrogate objective (maximized)
  • \(L_t^{VF}(\phi)\): Value function loss (minimized, hence the negative sign)
  • \(H(\pi_\theta)\): Policy entropy (maximized, hence the positive sign)
  • \(c_1\): Value function loss coefficient, typically \(0.5\) or \(1.0\)
  • \(c_2\): Entropy bonus coefficient, typically \(0.01\)

Note that there are two common implementation approaches for PPO:

  1. Shared network: The Actor and Critic share lower-level feature extraction layers (e.g., convolutional layers in a CNN), with two separate heads at the top. In this case, all three loss terms are optimized jointly, with \(c_1\) and \(c_2\) balancing their relative importance.
  2. Separate networks: The Actor and Critic use entirely independent networks. In this case, the policy loss and value function loss can be optimized separately.

Implementation Details

The PPO paper itself is quite concise, but what truly makes PPO perform well is a wealth of implementation details. These details are either mentioned in passing or omitted entirely from the original paper, yet they have an enormous impact on performance.

GAE - Generalized Advantage Estimation

GAE (Generalized Advantage Estimation) is the standard method for computing the advantage function \(A_t\) in PPO, proposed by Schulman in a separate 2016 paper.

Why is GAE needed? Computing the advantage function involves a classic bias-variance trade-off:

Method 1: Monte Carlo returns (high variance, zero bias)

\[ A_t^{MC} = \left( \sum_{l=0}^{T-t} \gamma^l r_{t+l} \right) - V(s_t) \]

Uses the complete episode return to estimate the advantage. The benefit is zero bias; the drawback is high variance (since the return is affected by all subsequent randomness).

Method 2: One-step TD error (low variance, biased)

\[ A_t^{TD} = r_t + \gamma V(s_{t+1}) - V(s_t) = \delta_t \]

Only considers one step of reward plus the next state's value estimate. Low variance (depends on only one step of randomness), but biased (since \(V(s_{t+1})\) is itself an imperfect estimate).

GAE's solution: use a parameter \(\lambda\) to smoothly interpolate between the two.

First, define the one-step TD error:

\[ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

Then, GAE is defined as the exponentially weighted average of TD errors:

\[ \hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l} \]

Expanded:

\[ \hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + \cdots \]
  • When \(\lambda = 0\): \(\hat{A}_t = \delta_t\), reducing to the one-step TD error (low variance, high bias)
  • When \(\lambda = 1\): \(\hat{A}_t = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l}\), equivalent to Monte Carlo return minus the baseline (high variance, zero bias)

In practice, \(\lambda\) is typically set to \(0.95\), striking a good balance between variance and bias.

Recursive computation of GAE: In implementation, GAE can be computed recursively from the end of the trajectory backward, which is very efficient:

\[ \hat{A}_T = \delta_T \]
\[ \hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1} \]

This has exactly the same form as the recursive formula for computing discounted returns, making the code implementation very concise:

# GAE computation (backward recursion)
advantages = torch.zeros_like(rewards)
last_gae = 0
for t in reversed(range(T)):
    delta = rewards[t] + gamma * values[t+1] * (1 - dones[t]) - values[t]
    advantages[t] = last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae
returns = advantages + values[:-1]  # V_target = A + V

Mini-batch Updates

A key implementation detail of PPO is: after collecting a large batch of data, shuffle it into multiple mini-batches and iterate over them multiple times.

The procedure is as follows:

  1. Use the current policy \(\pi_{\theta_{\text{old}}}\) to interact with the environment across \(N\) parallel environments for \(T\) steps each, collecting a total of \(N \times T\) transitions.
  2. Use the Critic network to compute value estimates \(V(s_t)\) for all states, then use GAE to compute advantages \(\hat{A}_t\) for all timesteps.
  3. Randomly shuffle all \(N \times T\) data points.
  4. Partition the shuffled data into mini-batches (each typically of size 64 or 128).
  5. For each mini-batch, compute the PPO loss and update network parameters using Adam.
  6. Repeat steps 3-5 for \(K\) epochs.
  7. Discard all data and return to step 1.

Multiple Epochs per Rollout

PPO allows multiple epochs of updates on the same batch of data (typically \(K = 3 \sim 10\)), which is a huge efficiency improvement over vanilla policy gradient (which can only update once).

This is feasible because the clipping mechanism inherently limits the magnitude of policy changes. Even when updating multiple times on the same batch of data, the policy will not deviate too far from the old policy. When the probability ratio \(r_t\) exceeds the \([1-\epsilon, 1+\epsilon]\) range, the gradient is clipped to zero, effectively halting further updates automatically.

However, the number of epochs should not be set too high. Empirically, an excessively large \(K\) leads to:

  • The policy overfitting to the current batch of data
  • Probability ratios being clipped across the board, effective gradients approaching zero, and updates stalling
  • KL divergence becoming too large, with the policy drifting too far from the old policy

Advantage Normalization

Within each mini-batch, the advantage function is typically normalized:

\[ \hat{A}_t \leftarrow \frac{\hat{A}_t - \mu(\hat{A})}{\sigma(\hat{A}) + \epsilon} \]

where \(\mu\) and \(\sigma\) are the mean and standard deviation of the advantage values in the current mini-batch, and \(\epsilon\) is a small constant (e.g., \(10^{-8}\)) to prevent division by zero.

The benefits of normalization are:

  1. The scale of advantage values becomes independent of the task's specific reward magnitude
  2. Roughly half the advantage values are positive (increase probability) and half are negative (decrease probability), leading to more balanced gradient directions
  3. Significantly improves training stability

Hyperparameters

The following are PPO's most important hyperparameters and their common values:

Hyperparameter Symbol Typical Value Description
Learning rate \(\alpha\) \(3 \times 10^{-4}\) Actor and Critic typically use the same learning rate
Clipping coefficient \(\epsilon\) \(0.2\) Controls the magnitude of policy updates; core hyperparameter
Discount factor \(\gamma\) \(0.99\) Discount for future rewards
GAE parameter \(\lambda\) \(0.95\) Controls the bias-variance trade-off
Number of epochs \(K\) \(3 \sim 10\) Training epochs per batch of data
Mini-batch size \(M\) \(64\) Number of samples per gradient update
Value function coefficient \(c_1\) \(0.5\) Weight of the value function loss
Entropy coefficient \(c_2\) \(0.01\) Weight of the entropy bonus
Number of parallel environments \(N\) \(8 \sim 128\) Number of environment copies running simultaneously
Rollout length \(T\) \(128 \sim 2048\) Number of steps collected per environment
Max gradient norm - \(0.5\) Gradient clipping to prevent gradient explosion

Note: These are empirical values and may need adjustment for different tasks. Among them, \(\epsilon\) and the learning rate have the greatest impact on performance.

PPO Training Procedure

Integrating all the above, the complete PPO training procedure is as follows:

PPO Algorithm Pseudocode:
1. Initialize policy network pi_theta and value network V_phi (may share lower layers)
2. for iteration = 1, 2, ... do
3.     Run current policy pi_theta_old in N parallel environments for T steps each
4.     Collect trajectory data: {(s_t, a_t, r_t, s_{t+1}, done_t, log pi_theta_old(a_t|s_t))}
5.     Compute value estimates for all states using V_phi
6.     Compute advantage estimates A_hat_t and return targets R_hat_t using GAE
7.     for epoch = 1, ..., K do
8.         Randomly shuffle all N x T data points
9.         Partition data into mini-batches of size M
10.        for each mini-batch do
11.            Compute log pi_theta(a_t|s_t) under the current policy
12.            Compute probability ratio r_t = exp(log pi_theta - log pi_theta_old)
13.            Compute clipped surrogate objective L^CLIP
14.            Compute value function loss L^VF
15.            Compute entropy bonus H
16.            Total loss L = -L^CLIP + c1 * L^VF - c2 * H
17.            Backpropagate, clip gradients, update with Adam
18.        end for
19.    end for
20.    theta_old <- theta (update old policy parameters)
21. end for

PPO vs TRPO

Dimension TRPO PPO
Year 2015 2017
Constraint method Hard constraint (KL divergence \(\leq \delta\)) Soft constraint (clipping)
Optimization method Second-order (conjugate gradient + line search) First-order (Adam SGD)
Implementation complexity Very high (requires Fisher matrix, conjugate gradient) Very low (core code under 50 lines)
Computational overhead High (multiple conjugate gradient iterations per step) Low (standard mini-batch SGD)
Theoretical guarantees Strict monotonic improvement guarantee No strict guarantee, but works well in practice
Architecture compatibility Incompatible with parameter sharing, Dropout, etc. Fully compatible with all standard deep learning techniques
Performance Comparable on most tasks Comparable on most tasks
Tuning difficulty Harder (sensitive to choice of \(\delta\)) Easy (\(\epsilon = 0.2\) is nearly universal)
Industry adoption Rare Extremely widespread

PPO's success conveys a profound lesson: in engineering practice, simple and robust methods are often more valuable than theoretically superior but complex alternatives. TRPO provided crucial theoretical insights, and PPO transformed those insights into a practical tool accessible to everyone.

PPO Variants and Supplements

PPO-Penalty

In addition to PPO-Clip, the original paper also proposed another variant, PPO-Penalty, which foregoes clipping and instead adds the KL divergence directly as a penalty term in the objective:

\[ L^{\text{KPPEN}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) A_t - \beta \cdot D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \right] \]

where \(\beta\) is an adaptive coefficient:

  • If the actual KL divergence \(> 1.5 \cdot d_{\text{targ}}\) (exceeds the target by too much), then \(\beta \leftarrow 2\beta\)
  • If the actual KL divergence \(< d_{\text{targ}} / 1.5\) (well below the target), then \(\beta \leftarrow \beta / 2\)

In practice, PPO-Clip generally outperforms PPO-Penalty, making PPO-Clip the default choice.

PPO in RLHF

PPO plays a central role in RLHF (Reinforcement Learning from Human Feedback) training for large language models. In this setting:

  • Environment: The language model generates responses given a prompt
  • Actions: The generation of each token
  • Reward: Provided by a Reward Model trained on human preferences
  • Policy: The language model itself serves as the policy network

RLHF-based PPO also incorporates an additional KL penalty term to prevent the model from drifting too far from the pretrained model:

\[ r_t = r_{\text{RM}}(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \]

where \(\pi_{\text{ref}}\) is the reference model after SFT (Supervised Fine-Tuning).

Discussion on On-Policy Learning

PPO is an on-policy algorithm, meaning data must be recollected after every update. Although clipping and multi-epoch updates improve data utilization, PPO's sample efficiency remains lower compared to off-policy algorithms such as SAC.

However, PPO possesses an advantage that off-policy algorithms struggle to match: training stability. Since PPO does not suffer from stale data in a replay buffer (the deadly triad), nor does it need to maintain a target network, its training process is typically very smooth and rarely exhibits the policy collapse phenomena common in DQN.

In practice, the choice depends on the scenario:

  • If sample acquisition is cheap (e.g., simulated environments): PPO is preferred for its stability and reliability
  • If sample acquisition is expensive (e.g., real robots): SAC is preferred for its higher sample efficiency
  • If the action space is discrete: PPO natively supports this; SAC requires additional modifications

评论 #