Skip to content

LLM Post-Training

LLM post-training is a cutting-edge field that applies reinforcement learning to large language model alignment. A pretrained LLM is essentially a "next-token predictor" — it learns to predict the probability distribution of the next token over massive text corpora, but this does not mean it will behave in accordance with human expectations. The goal of post-training is to use various reinforcement learning methods to transform the model from "being able to talk" to "talking sensibly."

These notes approach the topic from an RL perspective, providing in-depth coverage of post-training methods ranging from RLHF to GRPO and from DPO to RLVR, with an emphasis on mathematical derivations and algorithmic principles.


Why Post-Training Is Needed

Limitations of Pretraining

The objective function of pretraining is the standard language modeling loss:

\[ \mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) \]

This objective teaches the model "given the preceding context, which token is most likely to appear." However, there is a fundamental gap between "the most likely token" and "the token a human would want to see":

  • Internet text is rife with harmful, false, and biased content, and the model faithfully learns these distributions
  • A language model has no concept of "refusal" — it simply continues completing text
  • Among multiple reasonable answers, the model cannot judge which one better aligns with human preferences

The Necessity and Limitations of SFT

SFT (Supervised Fine-Tuning) is the first step of post-training, performing supervised learning on high-quality (instruction, response) data pairs:

\[ \mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P_\theta(y_t | x, y_{<t}) \]

SFT teaches the model the basic format and behavioral patterns for "answering questions," but it has fundamental limitations:

  1. Scarcity of supervision signal: High-quality labeled data is extremely expensive, especially answers that require domain experts to write
  2. Limited generalization: SFT can only imitate answer patterns present in the training data and cannot generalize to new scenarios
  3. Inability to distinguish quality: SFT treats all training data equally and cannot learn "which answer is better"
  4. Exposure Bias: During training the model only sees correct prefixes, but during inference it may generate incorrect prefixes, leading to error accumulation

The RL Perspective on Post-Training

Framing LLM post-training as a reinforcement learning problem is an extremely natural formulation:

RL Concept Counterpart in LLM Post-Training
Agent LLM
Policy \(\pi_\theta\) The LLM's parameterized conditional distribution \(P_\theta(y \mid x)\)
State \(s_t\) Prompt \(x\) + previously generated tokens \(y_{<t}\)
Action \(a_t\) Next token \(y_t\)
Trajectory \(\tau\) Complete generated sequence \(y = (y_1, y_2, \ldots, y_T)\)
Reward \(r\) Human preference score / Reward Model output / Verifiable reward
Environment Dialogue context + evaluation mechanism

Note several distinctive characteristics of this RL problem:

  • Enormous action space: Vocabulary size is typically \(32{,}000 \sim 150{,}000\), far exceeding that of typical RL problems
  • Sparse Reward: Rewards are usually given only at the end of generation, with no reward signal for intermediate steps
  • Deterministic environment: Given a state and action, the next state is deterministic (autoregressive concatenation)
  • Policy is the generation process: All stochasticity in the policy comes from token sampling

RLHF (Reinforcement Learning from Human Feedback)

RLHF is the alignment method systematically introduced by InstructGPT (Ouyang et al., 2022) and the core training technique behind ChatGPT. Its central idea is: use human preference feedback to construct reward signals, then optimize the LLM's policy with RL algorithms.

The complete RLHF pipeline consists of three stages. The first two stages (SFT and Reward Model training) prepare for the RL optimization in the third stage.

Stage 1: SFT (Supervised Fine-Tuning)

Starting from the pretrained model, standard supervised fine-tuning is performed on high-quality (instruction, response) data pairs to obtain an initial policy \(\pi_{\text{SFT}}\). The purpose of this step is to give the model a reasonable starting point so that it at least learns the basic format of "answering questions."

\[ \pi_{\text{SFT}} = \arg\min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[ -\sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t}) \right] \]

Stage 2: Reward Model Training

This is the most critical step in RLHF: training a Reward Model to simulate human preference judgments.

Data Collection

Given a prompt \(x\), the SFT model generates multiple responses, and human annotators perform preference ranking. The most common form is pairwise comparison: given two responses \(y_w\) (preferred/winner) and \(y_l\) (dispreferred/loser), annotate \(y_w \succ y_l\).

Bradley-Terry Model

To convert discrete preference annotations into a continuous reward function, RLHF adopts the classical Bradley-Terry preference model. This model assumes that the probability of a human choosing \(y_w\) over \(y_l\) is proportional to the sigmoid of their reward difference:

\[ P(y_w \succ y_l | x) = \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) = \frac{1}{1 + \exp\left(-(r_\phi(x, y_w) - r_\phi(x, y_l))\right)} \]

where \(r_\phi(x, y) \in \mathbb{R}\) is the Reward Model's scalar score for a (prompt, response) pair.

Intuition: The Bradley-Terry model says: if response A is much better than response B (\(r(A) \gg r(B)\)), the probability of a human choosing A approaches 1; if the two are similar (\(r(A) \approx r(B)\)), the selection probability approaches 0.5. This perfectly captures the probabilistic nature of human preferences — even when the quality gap between two responses is large, annotators occasionally "make mistakes."

Reward Model Training Objective

Maximizing the likelihood of the annotated data is equivalent to minimizing the negative log-likelihood:

\[ \mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) \right] \]

Reward Model Architecture: Typically, the language model head of a pretrained LLM (or SFT model) is replaced with a linear layer that outputs a scalar score:

\[ r_\phi(x, y) = \text{Linear}\left(\mathbf{h}_{\text{last}}^{(\text{Transformer})}(x, y)\right) \in \mathbb{R} \]

where \(\mathbf{h}_{\text{last}}\) is the hidden state at the last token position. This way, the Reward Model inherits the pretrained LLM's language understanding capabilities and only needs to learn the mapping from semantic understanding to preference scores.

From Rankings to Pairs

In practice, human annotators may perform a complete ranking of \(K\) responses: \(y_{\sigma(1)} \succ y_{\sigma(2)} \succ \cdots \succ y_{\sigma(K)}\). This can be decomposed into \(\binom{K}{2}\) pairwise comparisons for training. InstructGPT used \(K=4\) to \(K=9\), with each ranking producing 6 to 36 training samples.

Stage 3: PPO Optimization

This is the most central and complex part of RLHF. We need to use the PPO algorithm to optimize the LLM's policy so that it achieves high scores under the Reward Model while not deviating too far from the SFT model.

Optimization Objective

\[ \max_\theta \quad J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right] - \beta \cdot \mathbb{E}_{x \sim \mathcal{D}} \left[ D_{\text{KL}}\left(\pi_\theta(\cdot|x) \| \pi_{\text{SFT}}(\cdot|x)\right) \right] \]

This objective function has two terms:

  1. Reward maximization: Encourage the model to generate responses that receive high Reward Model scores
  2. KL penalty: Prevent the policy from deviating too far from the SFT model

The KL divergence is computed as:

\[ D_{\text{KL}}\left(\pi_\theta(\cdot|x) \| \pi_{\text{SFT}}(\cdot|x)\right) = \mathbb{E}_{y \sim \pi_\theta} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{\text{SFT}}(y|x)} \right] \]

For autoregressive models, the log-probability of the entire sequence can be decomposed into a token-level sum:

\[ \log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t}) \]

Therefore, the KL divergence can be accumulated incrementally at the token level.

The Importance of the KL Penalty

\(\beta\) is a critical hyperparameter that controls the balance between exploitation (leveraging the Reward Model for high scores) and exploration (maintaining consistency with the SFT model).

What happens without a KL constraint? The model quickly finds exploits in the Reward Model (Reward Hacking):

  • Generating excessively long responses (the Reward Model may assign higher scores to longer responses)
  • Repeatedly using certain phrases that "please" the Reward Model
  • Generating grammatically correct but substantively empty responses
  • In extreme cases, the model may even generate gibberish that receives high scores

The KL penalty ensures the model does not sacrifice basic language competence in order to "please" the Reward Model. In practice, \(\beta\) is often dynamically adjusted through an adaptive KL controller: if the KL divergence exceeds a target value, \(\beta\) is increased, and vice versa.

PPO Implementation for LLMs

Applying PPO to LLM training requires maintaining four models (or copies of models):

Four-model architecture:

1. Actor (π_θ)        -- The LLM policy being trained, with parameters continuously updated
2. Critic (V_ψ)       -- Value function, estimates state values, assists in computing Advantage
3. Reference (π_SFT)  -- Frozen copy of the SFT model, used for computing KL penalty
4. Reward Model (r_φ)  -- Frozen Reward Model, provides reward signals

This means training a single LLM requires simultaneously loading four (nearly) equally sized large models on GPUs, which is one of the primary reasons RLHF is computationally expensive.

PPO single-step update process:

Step 1 — Data Collection (Rollout): Sample a batch of prompts \(x\) from the dataset and generate responses \(y\) using the current policy \(\pi_\theta\).

Step 2 — Compute Rewards: For each \((x, y)\) pair, compute the reward \(r_\phi(x, y)\) using the Reward Model. This reward is given only at the last token (sparse reward), with zero reward for intermediate tokens. However, the KL penalty can be computed token by token:

\[ r_t = \begin{cases} -\beta \cdot \log \frac{\pi_\theta(y_t|x, y_{<t})}{\pi_{\text{SFT}}(y_t|x, y_{<t})} & t < T \\[6pt] r_\phi(x, y) - \beta \cdot \log \frac{\pi_\theta(y_T|x, y_{<T})}{\pi_{\text{SFT}}(y_T|x, y_{<T})} & t = T \end{cases} \]

That is, a per-token KL penalty is applied at every step, with the Reward Model score added at the final step.

Step 3 — Compute Advantage: Use GAE (Generalized Advantage Estimation) to compute the Advantage value at each token position:

\[ \hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l} \]

where the TD residual is:

\[ \delta_t = r_t + \gamma V_\psi(s_{t+1}) - V_\psi(s_t) \]

\(\gamma\) is the discount factor (typically set to 1.0) and \(\lambda\) is the GAE trade-off parameter (typically set to 0.95).

Step 4 — PPO Clipping Update: For each token position, compute the importance ratio and clipped objective:

\[ r_t(\theta) = \frac{\pi_\theta(y_t | x, y_{<t})}{\pi_{\theta_{\text{old}}}(y_t | x, y_{<t})} \]
\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] \]

where \(\epsilon\) is typically set to \(0.2\). The Critic's value function is updated simultaneously:

\[ L^{\text{VF}}(\psi) = \mathbb{E}_t \left[ \left( V_\psi(s_t) - \hat{R}_t \right)^2 \right] \]

where \(\hat{R}_t = \hat{A}_t + V_{\psi_{\text{old}}}(s_t)\) is the return estimated by GAE.

Step 5 — Multiple Mini-Batch Updates: Multiple epochs of mini-batch updates (typically 2–4 epochs) are performed on the same batch of rollout data to improve data efficiency. This is precisely the core advantage of PPO over vanilla policy gradient methods.

Limitations of RLHF

Despite the enormous success of RLHF (ChatGPT, Claude, etc.), it suffers from several fundamental issues that have motivated various alternative methods:

  1. High cost of human annotation: Large amounts of high-quality human preference annotation data are required, and annotation in specialized domains (medicine, law, etc.) is particularly expensive
  2. The Reward Model is a bottleneck: The Reward Model itself is a finite-capacity neural network whose modeling of human preferences is imperfect. When policy optimization is pushed to the extreme, the process is less about optimizing human preferences and more about exploiting the Reward Model's weaknesses
  3. PPO training instability: PPO requires careful hyperparameter tuning (\(\beta\), \(\epsilon\), learning rate, batch size, etc.), the memory footprint of four models is enormous, and training is prone to various instabilities
  4. Mode Collapse: The model may converge to a monotonous, "safe" but uninteresting response style
  5. High engineering complexity: Managing forward passes, backward passes, and communication for four large models simultaneously makes distributed training implementation extremely complex

RLAIF (RL from AI Feedback)

Core Idea

Constitutional AI (CAI, Bai et al., 2022), proposed by Anthropic, is a representative RLAIF method. The core idea is remarkably simple: use the AI itself to replace human annotators, evaluating and improving the model's outputs according to an explicit set of "constitutional" rules.

Method Pipeline

The Constitutional AI pipeline consists of two phases:

Phase 1 — Self-Critique and Revision (Critique-Revision):

  1. Have the model generate an initial response to a potentially harmful prompt
  2. Have the model critique itself according to constitutional rules ("Does this response violate rule X?")
  3. Have the model revise its response based on the critique
  4. Use the (original response, revised response) pairs as preference data

Phase 2 — RLAIF Training:

  1. Train a Reward Model using AI-generated preference data
  2. Optimize the LLM using RL (PPO)

Scalable Oversight

The deeper significance of RLAIF lies in the concept of scalable oversight: as AI capabilities grow, it becomes increasingly difficult for humans to directly evaluate AI output quality (especially in areas like complex reasoning and code generation). Using stronger AI to supervise weaker AI, forming a recursive supervision chain, is an important approach toward achieving superintelligence alignment.

\[ \text{Human oversight} \to \text{AI-assisted oversight} \to \text{AI supervising AI} \to \text{Recursive scalable oversight} \]

DPO (Direct Preference Optimization)

DPO, proposed by Rafailov et al. (2023), is the most important simplification of RLHF in recent years. Its core finding is: we do not need to explicitly train a Reward Model or run PPO at all — preference optimization can be performed directly on the policy.

Complete Derivation from RLHF to DPO

This is the most elegant part of DPO. Starting from the RLHF optimization objective, we derive the DPO loss function step by step.

Step 1: The RLHF Optimization Objective

Recall the RLHF Stage 3 objective (with KL penalty):

\[ \max_\theta \quad \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r(x, y) \right] - \beta \cdot D_{\text{KL}}\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right) \]

Here \(\pi_{\text{ref}}\) is the reference model (typically the SFT model). Expanding the KL divergence:

\[ \max_\theta \quad \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta} \left[ r(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right] \]

Step 2: Closed-Form Solution for the Optimal Policy

This is a KL-constrained optimization problem. For a fixed \(x\), we want to maximize over the distribution \(\pi_\theta(\cdot|x)\):

\[ \max_{\pi} \quad \mathbb{E}_{y \sim \pi} \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right] \]

This is a classical variational problem. Taking the functional derivative with respect to \(\pi(y|x)\) and setting it to zero (with the normalization constraint \(\sum_y \pi(y|x) = 1\)), we obtain the closed-form solution for the optimal policy:

\[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta} r(x, y)\right) \]

where the partition function is:

\[ Z(x) = \sum_{y} \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{1}{\beta} r(x, y)\right) = \mathbb{E}_{y \sim \pi_{\text{ref}}} \left[ \exp\left(\frac{1}{\beta} r(x, y)\right) \right] \]

Intuition: The optimal policy is a "reweighted version" of the reference model. For responses with high reward, the probability is exponentially amplified; for responses with low reward, the probability is exponentially suppressed. \(\beta\) controls the degree of this amplification/suppression — smaller \(\beta\) concentrates the optimal policy on high-reward responses; larger \(\beta\) keeps it closer to the reference model.

Step 3: Recovering the Reward from the Optimal Policy

From the closed-form solution above, we can solve for the reward:

\[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \cdot \exp\left(\frac{r(x,y)}{\beta}\right) \]

Taking the logarithm of both sides:

\[ \log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{1}{\beta} r(x, y) - \log Z(x) \]

Rearranging, we obtain the closed-form expression for the reward:

\[ \boxed{r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)} \]

This is the most critical step in the DPO derivation. It tells us: the reward can be expressed entirely in terms of the log-probability ratio between the optimal policy and the reference policy. The partition function \(Z(x)\) depends only on the prompt \(x\) and is independent of the response \(y\).

Step 4: Substituting into the Bradley-Terry Model

Substituting the reward expression into the preference model:

\[ P(y_w \succ y_l | x) = \sigma\left(r(x, y_w) - r(x, y_l)\right) \]
\[ = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \cancel{\beta \log Z(x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \cancel{\beta \log Z(x)}\right) \]

The key observation: \(Z(x)\) cancels perfectly when taking the difference! This means we do not need to compute this intractable partition function.

\[ P(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \]

Step 5: The DPO Loss Function

Now, we parameterize the optimal policy \(\pi^*\) with the trainable policy \(\pi_\theta\) and maximize the likelihood of the preference data:

\[ \boxed{\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]} \]

Defining the implicit reward:

\[ \hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \]

The DPO loss can be written concisely as:

\[ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E} \left[ \log \sigma\left(\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)\right) \right] \]

This is formally identical to the Reward Model training objective! The only difference is that the reward is no longer the output of a separate network but rather the log-probability ratio of the policy itself.

Gradient Analysis of DPO

The gradient of the DPO loss with respect to parameters \(\theta\) is:

\[ \nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \cdot \mathbb{E} \left[ \underbrace{\sigma\left(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)\right)}_{\text{weight: degree to which the model errs}} \left( \underbrace{\nabla_\theta \log \pi_\theta(y_w|x)}_{\text{increase preferred probability}} - \underbrace{\nabla_\theta \log \pi_\theta(y_l|x)}_{\text{decrease dispreferred probability}} \right) \right] \]

Intuition: The gradient direction increases the probability of preferred responses and decreases the probability of dispreferred responses. The gradient magnitude is controlled by \(\sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l))\) — when the model already correctly assigns a higher implicit reward to the preferred response, this weight approaches 0 and the gradient is small; when the model "errs," the weight approaches 1 and the gradient is large. This is a natural adaptive learning rate mechanism.

DPO vs RLHF

Dimension RLHF (PPO) DPO
Training stages Three stages (SFT + RM + RL) Two stages (SFT + DPO)
Requires Reward Model Yes (separately trained) No (implicit reward)
Requires online sampling Yes (PPO needs rollouts) No (directly uses offline preference data)
GPU memory Very high (4 models loaded simultaneously) Lower (2 models: \(\pi_\theta\) + \(\pi_{\text{ref}}\))
Training stability Poor, requires careful tuning Good, similar to standard supervised learning
Theoretical equivalence Approximately optimal Equivalent to RLHF's closed-form optimum
Empirical performance Generally stronger (due to online exploration) Close to RLHF, on par or better in some settings
Engineering complexity Very high Low

Limitations of DPO

  1. Reference model dependency: DPO requires a frozen reference model to compute probability ratios. This increases memory overhead, and the quality of the reference model directly affects DPO's performance
  2. Sensitivity to offline data quality: DPO relies entirely on offline preference data; if the data distribution diverges significantly from the current policy, learning performance degrades
  3. Length Bias: DPO tends to generate longer responses, because longer responses exhibit larger differences in log-probability space
  4. Lack of online exploration: Unlike PPO, DPO does not perform online sampling and exploration, potentially missing good responses outside the coverage of the preference data
  5. Overfitting risk: On small-scale preference data, DPO is prone to overfitting, leading to reduced generalization

GRPO (Group Relative Policy Optimization)

GRPO was proposed by the DeepSeek team (Shao et al., 2024) and is the core training algorithm behind DeepSeek-R1. GRPO's design philosophy is: completely eliminate the Reward Model and Value function, and estimate Advantage through within-group relative ranking.

Motivation

The main pain points of PPO in LLM training:

  1. Difficulty of training the Value function: The state space of LLMs is extremely complex (prompt + generated tokens), making training an accurate Value function a significant challenge in itself. Moreover, the Value function's parameter count is typically comparable to the LLM's, further increasing memory overhead
  2. Limitations of the Reward Model: Training a good Reward Model requires large amounts of high-quality preference data, and the Reward Model is susceptible to exploitation

GRPO's core idea: since it is difficult to judge the "absolute quality" of a single response, it is better to let multiple responses "compete" within a group and use relative ranking to determine the optimization direction.

Algorithm Pipeline

For each prompt \(x\), GRPO performs the following steps:

Step 1 — Group Sampling: Use the current policy \(\pi_{\theta_{\text{old}}}\) to generate a group of \(G\) responses for the same prompt:

\[ \{y_1, y_2, \ldots, y_G\} \sim \pi_{\theta_{\text{old}}}(\cdot | x) \]

Typical values for \(G\) range from \(8\) to \(64\).

Step 2 — Compute Rewards: Score each response using a reward function (which can be a Reward Model or a verifiable reward):

\[ r_i = r(x, y_i), \quad i = 1, 2, \ldots, G \]

Step 3 — Within-Group Normalized Advantage: This is GRPO's core innovation. The rewards are z-score normalized within the group to obtain Advantage estimates:

\[ \hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})} \]

That is:

\[ \hat{A}_i = \frac{r_i - \bar{r}}{\sigma_r}, \quad \text{where} \quad \bar{r} = \frac{1}{G}\sum_{j=1}^{G} r_j, \quad \sigma_r = \sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_j - \bar{r})^2} \]

Intuition: Within a group of responses, those above average receive positive Advantage (and are encouraged), while those below average receive negative Advantage (and are suppressed). Normalization ensures that the Advantage scale is consistent across different prompts, preventing prompts with larger absolute reward values from dominating training.

This method is essentially an instance of REINFORCE with baseline, where the baseline \(b(x) = \bar{r}\) is the within-group mean reward. Compared to a traditional learned baseline (i.e., a Value function), this baseline has slightly higher variance but requires no additional training whatsoever.

Step 4 — Policy Optimization: A PPO-clip-style objective function is used, but operating at the sequence level (rather than the token level):

\[ \mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \left[ \min\left(r_{i,t}(\theta) \hat{A}_i, \; \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \right) \right] \]

where \(r_{i,t}(\theta)\) is the importance ratio for the \(t\)-th token of the \(i\)-th response:

\[ r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} | x, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} | x, y_{i,<t})} \]

Note that \(\hat{A}_i\) is a sequence-level Advantage (the entire response shares the same Advantage value), while clipping is performed at the token level.

KL Regularization: GRPO also requires a KL penalty to prevent the policy from drifting too far. GRPO uses an approximate KL divergence estimate:

\[ D_{\text{KL}}^{\text{approx}} = \frac{\pi_{\text{ref}}(y_t | x, y_{<t})}{\pi_\theta(y_t | x, y_{<t})} - \log \frac{\pi_{\text{ref}}(y_t | x, y_{<t})}{\pi_\theta(y_t | x, y_{<t})} - 1 \]

This is an unbiased estimate of the KL divergence (based on properties of \(f\)-divergence) that approximates the standard KL divergence when the two distributions are close.

Connection Between GRPO and REINFORCE

From a theoretical perspective, GRPO can be viewed as a variant of REINFORCE with baseline. The REINFORCE gradient estimator is:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(\tau) \cdot (R(\tau) - b) \right] \]

GRPO's key improvements are:

  1. Within-group mean as baseline: \(b = \bar{r}\), eliminating the need to learn a Value function
  2. Normalization: Dividing by the within-group standard deviation \(\sigma_r\) adaptively adjusts the gradient scale
  3. PPO-clip for training stability: Using clipping instead of raw policy gradients to prevent large updates

Summary of GRPO's Advantages

  • No Reward Model required (can use one, but it is not necessary — rule-based or verifiable rewards can be used directly)
  • No Value function needed (completely eliminates the Critic network)
  • Memory efficient: only needs to load Actor + Reference, two models
  • Simple implementation: far less code than PPO-based RLHF
  • Demonstrated strong reasoning capability improvement in DeepSeek-R1

Other Preference Optimization Methods

IPO (Identity Preference Optimization)

IPO, proposed by Azar et al. (2023), aims to address a theoretical flaw of DPO: DPO assumes that preference data perfectly follows the Bradley-Terry model, but actual human preference data is full of noise.

During training, DPO may overconfidently fit the preference data, causing the implicit reward magnitudes to trend toward infinity. IPO modifies the loss function to regularize this behavior:

\[ \mathcal{L}_{\text{IPO}}(\theta) = \mathbb{E}_{(x, y_w, y_l)} \left[ \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2 \right] \]

Intuition: IPO replaces DPO's log-sigmoid loss with an MSE loss. The MSE loss "pulls" the implicit reward difference toward a target value of \(\frac{1}{2\beta}\), rather than allowing it to grow without bound as in DPO. This effectively applies implicit regularization to the implicit reward.

KTO (Kahneman-Tversky Optimization)

KTO, proposed by Ethayarajh et al. (2024), addresses a very practical problem: in many scenarios, obtaining pairwise preference annotations (\(y_w\) vs \(y_l\)) is difficult, but obtaining independent good/bad labels (thumbs up / thumbs down) is relatively easy.

KTO's name derives from Kahneman and Tversky's Prospect Theory, which posits that humans are more sensitive to losses than to gains.

KTO's loss function handles "good" and "bad" samples separately:

\[ \mathcal{L}_{\text{KTO}}(\theta) = \mathbb{E}_{(x,y)} \left[ w(y) \cdot \left(1 - v_\theta(x, y) \right) \right] \]

where:

\[ v_\theta(x, y) = \begin{cases} \sigma\left(\beta \cdot \hat{r}_\theta(x, y) - z_{\text{ref}}\right) & \text{if } y \text{ is desirable} \\[6pt] \sigma\left(z_{\text{ref}} - \beta \cdot \hat{r}_\theta(x, y)\right) & \text{if } y \text{ is undesirable} \end{cases} \]
\[ z_{\text{ref}} = \mathbb{E}_{(x', y') \sim \mathcal{D}} \left[ \beta \cdot D_{\text{KL}}\left(\pi_\theta(y'|x') \| \pi_{\text{ref}}(y'|x')\right) \right] \]

Here \(z_{\text{ref}}\) is a baseline estimated by the expected KL divergence across the dataset. \(w(y)\) is a weighting term that assigns higher weight to undesirable samples (reflecting Prospect Theory's "loss aversion").

KTO's core advantage: Lower data requirements — only binary labels (good/bad) are needed, not pairwise comparisons. This substantially reduces data collection costs.

ORPO (Odds Ratio Preference Optimization)

ORPO, proposed by Hong et al. (2024), goes a step further: it merges SFT and preference optimization into a single training stage and does not require a reference model.

ORPO's loss function:

\[ \mathcal{L}_{\text{ORPO}}(\theta) = \underbrace{\mathcal{L}_{\text{SFT}}(\theta)}_{\text{SFT term}} + \lambda \cdot \underbrace{\mathcal{L}_{\text{OR}}(\theta)}_{\text{Odds Ratio term}} \]

where the Odds Ratio term is:

\[ \mathcal{L}_{\text{OR}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left(\log \frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)} \right) \right] \]
\[ \text{odds}_\theta(y|x) = \frac{P_\theta(y|x)}{1 - P_\theta(y|x)} \]

Core idea: By replacing DPO's log-probability ratio with an odds ratio, the reference model becomes unnecessary. The SFT term simultaneously provides a regularization effect, preventing the policy from drifting too far.

SimPO (Simple Preference Optimization)

SimPO, proposed by Meng et al. (2024), further simplifies DPO with two key modifications:

  1. Length-normalized log-probabilities replace probability ratios: \(\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)\), eliminating the need for a reference model
  2. Introduction of a margin term: A fixed margin \(\gamma\) is added between the preferred and dispreferred rewards
\[ \mathcal{L}_{\text{SimPO}}(\theta) = -\mathbb{E} \left[ \log \sigma\left(\frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma \right) \right] \]

Length normalization elegantly resolves DPO's length bias problem.


RLVR (Reinforcement Learning with Verifiable Rewards)

RLVR is one of the key technologies behind the success of DeepSeek-R1 (DeepSeek, 2025) and a core method connecting RL with reasoning capabilities.

Core Idea

In domains such as mathematics, programming, and logical reasoning, the correctness of answers is verifiable:

  • Mathematical problems can be verified by checking whether the final answer is correct
  • Code can be verified through test cases
  • Logic problems can be verified by checking whether the conclusion holds

This means we can completely skip "human annotation" and "Reward Model training" and directly use correctness as the RL reward signal.

Outcome-Based Reward (ORM)

The simplest form: only care about whether the final answer is correct.

\[ r_{\text{outcome}}(x, y) = \begin{cases} 1 & \text{if final answer is correct} \\ 0 & \text{if final answer is incorrect} \end{cases} \]

More nuanced rewards can also be designed, for example incorporating format compliance:

\[ r(x, y) = r_{\text{accuracy}}(x, y) + \alpha \cdot r_{\text{format}}(x, y) \]

where \(r_{\text{format}}\) checks whether the model follows a specified chain-of-thought format (e.g., placing the reasoning process inside <think> tags).

Process-Based Reward (PRM)

A more fine-grained reward approach: evaluate each step of the reasoning process.

Let the reasoning process consist of \(K\) steps: \(y = (s_1, s_2, \ldots, s_K)\). The PRM assigns a score to each step:

\[ r_{\text{process}}(x, y) = \sum_{k=1}^{K} r_k(x, s_1, \ldots, s_k) \]

Or take the minimum (the weakest link determines overall quality):

\[ r_{\text{process}}(x, y) = \min_{k=1}^{K} r_k(x, s_1, \ldots, s_k) \]

PRM's advantage is providing denser reward signals (dense reward), helping to mitigate the sparse reward problem. However, training a PRM itself requires step-level annotation data, which is more costly.

DeepSeek-R1's Training Methodology

DeepSeek-R1 demonstrated an exciting technical path: without using any human-annotated chain-of-thought data, relying solely on GRPO + verifiable rewards, an LLM can "emerge" complex reasoning capabilities.

Training pipeline:

DeepSeek-R1 Training Pipeline:

Stage 1: Cold Start
    - Collect a small amount of long chain-of-thought data for SFT
    - Teach the model the basic "think then answer" format

Stage 2: Reasoning RL
    - Train with GRPO + verifiable rewards on math/code tasks
    - Reward signal: answer correctness + format compliance
    - The model spontaneously emerges: reflection, verification, backtracking, self-correction

Stage 3: Rejection Sampling + SFT
    - Use the Stage 2 model to generate large amounts of reasoning data
    - Keep only correct, high-quality reasoning trajectories
    - Mix with general SFT data for further fine-tuning

Stage 4: Second Round of RL
    - Combine reasoning rewards + helpfulness/safety rewards
    - Final alignment

A striking finding from DeepSeek-R1 is that during Stage 2, the model spontaneously emerged the following reasoning behaviors through pure RL training, none of which were explicitly taught in the training data:

  • Reflection: "Let me re-examine this step..."
  • Backtracking: "The approach above doesn't work, let me try a different angle..."
  • Self-verification: "Substituting back into the original equation to verify..."
  • Decomposition: "First solve sub-problem A, then..."

This paradigm of emerging reasoning capabilities through RL is highly consistent with the technical direction of OpenAI's o1/o3 series models, representing an important direction for LLM post-training.


Comparative Summary of Methods

Method Reward Model Value Function Reference Model Data Requirements Training Complexity Representative Models Core Advantage
RLHF (PPO) Required Required Required Pairwise preferences Very high ChatGPT, Claude Strong performance, mature theory
DPO Not required Not required Required Pairwise preferences Low Zephyr, many open-source Simple and efficient, theoretically equivalent to RLHF
GRPO Optional Not required Required Verifiable rewards Medium DeepSeek-R1 No Value function needed, suited for reasoning tasks
KTO Not required Not required Required Binary labels Low - Lowest data requirements
IPO Not required Not required Required Pairwise preferences Low - Robust to noisy preference data
ORPO Not required Not required Not required Pairwise preferences Low - No reference model, unified with SFT
SimPO Not required Not required Not required Pairwise preferences Low - Resolves length bias
RLVR Not required Not required Required Verifiable labels Medium DeepSeek-R1, o1 No human annotation needed, suited for reasoning

A clear evolutionary trend emerges: from complex to simple, from human-dependent to automated.

\[ \underbrace{\text{RLHF}}_{\text{4 models, human annotation}} \to \underbrace{\text{DPO}}_{\text{2 models, preference data}} \to \underbrace{\text{ORPO/SimPO}}_{\text{1 model, preference data}} \to \underbrace{\text{RLVR}}_{\text{0 human annotation, verifiable rewards}} \]

Frontier Directions

Online DPO / Iterative DPO

Standard DPO uses offline preference data, which means the distribution of preference data may not match the current policy's distribution (distribution shift). The idea behind Online DPO is:

  1. Generate new responses using the current policy \(\pi_\theta\)
  2. Perform preference annotation on the new responses using a Reward Model (or AI/humans)
  3. Update with DPO using the new preference data
  4. Repeat

This essentially transforms DPO from off-policy to on-policy, bridging the theoretical gap between DPO and RLHF. Experiments show that Online DPO typically outperforms standard offline DPO and, on certain benchmarks, approaches or even surpasses PPO-based RLHF.

Self-Play (SPIN)

SPIN (Self-Play fIne-tuNing, Chen et al., 2024) has the core idea of letting the model play against its own previous version:

  • Treat the previous-round model's outputs as "dispreferred" responses
  • Treat ground truth as "preferred" responses
  • Train using a DPO-style loss function
\[ \mathcal{L}_{\text{SPIN}}(\theta) = -\mathbb{E} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_{\text{gt}}|x)}{\pi_{\text{ref}}(y_{\text{gt}}|x)} - \beta \log \frac{\pi_\theta(y_{\text{old}}|x)}{\pi_{\text{ref}}(y_{\text{old}}|x)} \right) \right] \]

where \(y_{\text{gt}}\) is the ground truth response and \(y_{\text{old}} \sim \pi_{\theta_{\text{old}}}\) is the response generated by the old model.

SPIN's advantage is that it requires no preference annotation data at all — only ground truth responses are needed. However, as training progresses and the model's output increasingly approximates the ground truth, the learning signal gradually weakens until convergence.

RL for Reasoning (o1-style)

OpenAI's o1/o3 series and DeepSeek-R1 mark the emergence of a new paradigm: using RL to train LLM reasoning capabilities, rather than merely for alignment.

In this paradigm, the role of RL undergoes a fundamental shift:

  • Traditional RLHF: RL is used for alignment — making the model's outputs conform to human preferences
  • RL for Reasoning: RL is used for capability enhancement — teaching the model deeper reasoning

Key technical elements:

  1. Long Chain-of-Thought (Long CoT): Allow the model to perform extended internal reasoning before answering
  2. Verifiable rewards: The correctness of math/code provides a natural RL reward signal
  3. Test-time Compute Scaling: Improve performance by investing more computation during inference

This yields a profound insight: scaling is not limited to parameter count and data volume during training — computation at inference time can also be traded for better performance. This is the so-called "test-time compute scaling."

Multi-Turn RLHF

Traditional RLHF considers only single-turn dialogues \((x, y)\), but in real applications LLMs need to maintain consistency across multi-turn conversations. Multi-Turn RLHF extends the RL state to the full dialogue history:

\[ s_t = (x_1, y_1, x_2, y_2, \ldots, x_t) \]

The challenges include:

  • The credit assignment problem in multi-turn dialogues (which turn's response deserves credit for a good final outcome?)
  • Rapid expansion of the dialogue state space
  • Greater difficulty of human preference annotation (requiring evaluation of the entire conversation's quality)

Constitutional AI at Scale

Taking the Constitutional AI philosophy to its ultimate conclusion: using a comprehensive value system to guide AI behavior, rather than relying on limited human preference annotations. This touches on deeper philosophical questions: What constitutes "good" AI behavior? How do we find balance across different cultures and different value systems?


References


评论 #