LLM Post-Training
LLM post-training is a cutting-edge field that applies reinforcement learning to large language model alignment. A pretrained LLM is essentially a "next-token predictor" — it learns to predict the probability distribution of the next token over massive text corpora, but this does not mean it will behave in accordance with human expectations. The goal of post-training is to use various reinforcement learning methods to transform the model from "being able to talk" to "talking sensibly."
These notes approach the topic from an RL perspective, providing in-depth coverage of post-training methods ranging from RLHF to GRPO and from DPO to RLVR, with an emphasis on mathematical derivations and algorithmic principles.
Why Post-Training Is Needed
Limitations of Pretraining
The objective function of pretraining is the standard language modeling loss:
This objective teaches the model "given the preceding context, which token is most likely to appear." However, there is a fundamental gap between "the most likely token" and "the token a human would want to see":
- Internet text is rife with harmful, false, and biased content, and the model faithfully learns these distributions
- A language model has no concept of "refusal" — it simply continues completing text
- Among multiple reasonable answers, the model cannot judge which one better aligns with human preferences
The Necessity and Limitations of SFT
SFT (Supervised Fine-Tuning) is the first step of post-training, performing supervised learning on high-quality (instruction, response) data pairs:
SFT teaches the model the basic format and behavioral patterns for "answering questions," but it has fundamental limitations:
- Scarcity of supervision signal: High-quality labeled data is extremely expensive, especially answers that require domain experts to write
- Limited generalization: SFT can only imitate answer patterns present in the training data and cannot generalize to new scenarios
- Inability to distinguish quality: SFT treats all training data equally and cannot learn "which answer is better"
- Exposure Bias: During training the model only sees correct prefixes, but during inference it may generate incorrect prefixes, leading to error accumulation
The RL Perspective on Post-Training
Framing LLM post-training as a reinforcement learning problem is an extremely natural formulation:
| RL Concept | Counterpart in LLM Post-Training |
|---|---|
| Agent | LLM |
| Policy \(\pi_\theta\) | The LLM's parameterized conditional distribution \(P_\theta(y \mid x)\) |
| State \(s_t\) | Prompt \(x\) + previously generated tokens \(y_{<t}\) |
| Action \(a_t\) | Next token \(y_t\) |
| Trajectory \(\tau\) | Complete generated sequence \(y = (y_1, y_2, \ldots, y_T)\) |
| Reward \(r\) | Human preference score / Reward Model output / Verifiable reward |
| Environment | Dialogue context + evaluation mechanism |
Note several distinctive characteristics of this RL problem:
- Enormous action space: Vocabulary size is typically \(32{,}000 \sim 150{,}000\), far exceeding that of typical RL problems
- Sparse Reward: Rewards are usually given only at the end of generation, with no reward signal for intermediate steps
- Deterministic environment: Given a state and action, the next state is deterministic (autoregressive concatenation)
- Policy is the generation process: All stochasticity in the policy comes from token sampling
RLHF (Reinforcement Learning from Human Feedback)
RLHF is the alignment method systematically introduced by InstructGPT (Ouyang et al., 2022) and the core training technique behind ChatGPT. Its central idea is: use human preference feedback to construct reward signals, then optimize the LLM's policy with RL algorithms.
The complete RLHF pipeline consists of three stages. The first two stages (SFT and Reward Model training) prepare for the RL optimization in the third stage.
Stage 1: SFT (Supervised Fine-Tuning)
Starting from the pretrained model, standard supervised fine-tuning is performed on high-quality (instruction, response) data pairs to obtain an initial policy \(\pi_{\text{SFT}}\). The purpose of this step is to give the model a reasonable starting point so that it at least learns the basic format of "answering questions."
Stage 2: Reward Model Training
This is the most critical step in RLHF: training a Reward Model to simulate human preference judgments.
Data Collection
Given a prompt \(x\), the SFT model generates multiple responses, and human annotators perform preference ranking. The most common form is pairwise comparison: given two responses \(y_w\) (preferred/winner) and \(y_l\) (dispreferred/loser), annotate \(y_w \succ y_l\).
Bradley-Terry Model
To convert discrete preference annotations into a continuous reward function, RLHF adopts the classical Bradley-Terry preference model. This model assumes that the probability of a human choosing \(y_w\) over \(y_l\) is proportional to the sigmoid of their reward difference:
where \(r_\phi(x, y) \in \mathbb{R}\) is the Reward Model's scalar score for a (prompt, response) pair.
Intuition: The Bradley-Terry model says: if response A is much better than response B (\(r(A) \gg r(B)\)), the probability of a human choosing A approaches 1; if the two are similar (\(r(A) \approx r(B)\)), the selection probability approaches 0.5. This perfectly captures the probabilistic nature of human preferences — even when the quality gap between two responses is large, annotators occasionally "make mistakes."
Reward Model Training Objective
Maximizing the likelihood of the annotated data is equivalent to minimizing the negative log-likelihood:
Reward Model Architecture: Typically, the language model head of a pretrained LLM (or SFT model) is replaced with a linear layer that outputs a scalar score:
where \(\mathbf{h}_{\text{last}}\) is the hidden state at the last token position. This way, the Reward Model inherits the pretrained LLM's language understanding capabilities and only needs to learn the mapping from semantic understanding to preference scores.
From Rankings to Pairs
In practice, human annotators may perform a complete ranking of \(K\) responses: \(y_{\sigma(1)} \succ y_{\sigma(2)} \succ \cdots \succ y_{\sigma(K)}\). This can be decomposed into \(\binom{K}{2}\) pairwise comparisons for training. InstructGPT used \(K=4\) to \(K=9\), with each ranking producing 6 to 36 training samples.
Stage 3: PPO Optimization
This is the most central and complex part of RLHF. We need to use the PPO algorithm to optimize the LLM's policy so that it achieves high scores under the Reward Model while not deviating too far from the SFT model.
Optimization Objective
This objective function has two terms:
- Reward maximization: Encourage the model to generate responses that receive high Reward Model scores
- KL penalty: Prevent the policy from deviating too far from the SFT model
The KL divergence is computed as:
For autoregressive models, the log-probability of the entire sequence can be decomposed into a token-level sum:
Therefore, the KL divergence can be accumulated incrementally at the token level.
The Importance of the KL Penalty
\(\beta\) is a critical hyperparameter that controls the balance between exploitation (leveraging the Reward Model for high scores) and exploration (maintaining consistency with the SFT model).
What happens without a KL constraint? The model quickly finds exploits in the Reward Model (Reward Hacking):
- Generating excessively long responses (the Reward Model may assign higher scores to longer responses)
- Repeatedly using certain phrases that "please" the Reward Model
- Generating grammatically correct but substantively empty responses
- In extreme cases, the model may even generate gibberish that receives high scores
The KL penalty ensures the model does not sacrifice basic language competence in order to "please" the Reward Model. In practice, \(\beta\) is often dynamically adjusted through an adaptive KL controller: if the KL divergence exceeds a target value, \(\beta\) is increased, and vice versa.
PPO Implementation for LLMs
Applying PPO to LLM training requires maintaining four models (or copies of models):
Four-model architecture:
1. Actor (π_θ) -- The LLM policy being trained, with parameters continuously updated
2. Critic (V_ψ) -- Value function, estimates state values, assists in computing Advantage
3. Reference (π_SFT) -- Frozen copy of the SFT model, used for computing KL penalty
4. Reward Model (r_φ) -- Frozen Reward Model, provides reward signals
This means training a single LLM requires simultaneously loading four (nearly) equally sized large models on GPUs, which is one of the primary reasons RLHF is computationally expensive.
PPO single-step update process:
Step 1 — Data Collection (Rollout): Sample a batch of prompts \(x\) from the dataset and generate responses \(y\) using the current policy \(\pi_\theta\).
Step 2 — Compute Rewards: For each \((x, y)\) pair, compute the reward \(r_\phi(x, y)\) using the Reward Model. This reward is given only at the last token (sparse reward), with zero reward for intermediate tokens. However, the KL penalty can be computed token by token:
That is, a per-token KL penalty is applied at every step, with the Reward Model score added at the final step.
Step 3 — Compute Advantage: Use GAE (Generalized Advantage Estimation) to compute the Advantage value at each token position:
where the TD residual is:
\(\gamma\) is the discount factor (typically set to 1.0) and \(\lambda\) is the GAE trade-off parameter (typically set to 0.95).
Step 4 — PPO Clipping Update: For each token position, compute the importance ratio and clipped objective:
where \(\epsilon\) is typically set to \(0.2\). The Critic's value function is updated simultaneously:
where \(\hat{R}_t = \hat{A}_t + V_{\psi_{\text{old}}}(s_t)\) is the return estimated by GAE.
Step 5 — Multiple Mini-Batch Updates: Multiple epochs of mini-batch updates (typically 2–4 epochs) are performed on the same batch of rollout data to improve data efficiency. This is precisely the core advantage of PPO over vanilla policy gradient methods.
Limitations of RLHF
Despite the enormous success of RLHF (ChatGPT, Claude, etc.), it suffers from several fundamental issues that have motivated various alternative methods:
- High cost of human annotation: Large amounts of high-quality human preference annotation data are required, and annotation in specialized domains (medicine, law, etc.) is particularly expensive
- The Reward Model is a bottleneck: The Reward Model itself is a finite-capacity neural network whose modeling of human preferences is imperfect. When policy optimization is pushed to the extreme, the process is less about optimizing human preferences and more about exploiting the Reward Model's weaknesses
- PPO training instability: PPO requires careful hyperparameter tuning (\(\beta\), \(\epsilon\), learning rate, batch size, etc.), the memory footprint of four models is enormous, and training is prone to various instabilities
- Mode Collapse: The model may converge to a monotonous, "safe" but uninteresting response style
- High engineering complexity: Managing forward passes, backward passes, and communication for four large models simultaneously makes distributed training implementation extremely complex
RLAIF (RL from AI Feedback)
Core Idea
Constitutional AI (CAI, Bai et al., 2022), proposed by Anthropic, is a representative RLAIF method. The core idea is remarkably simple: use the AI itself to replace human annotators, evaluating and improving the model's outputs according to an explicit set of "constitutional" rules.
Method Pipeline
The Constitutional AI pipeline consists of two phases:
Phase 1 — Self-Critique and Revision (Critique-Revision):
- Have the model generate an initial response to a potentially harmful prompt
- Have the model critique itself according to constitutional rules ("Does this response violate rule X?")
- Have the model revise its response based on the critique
- Use the (original response, revised response) pairs as preference data
Phase 2 — RLAIF Training:
- Train a Reward Model using AI-generated preference data
- Optimize the LLM using RL (PPO)
Scalable Oversight
The deeper significance of RLAIF lies in the concept of scalable oversight: as AI capabilities grow, it becomes increasingly difficult for humans to directly evaluate AI output quality (especially in areas like complex reasoning and code generation). Using stronger AI to supervise weaker AI, forming a recursive supervision chain, is an important approach toward achieving superintelligence alignment.
DPO (Direct Preference Optimization)
DPO, proposed by Rafailov et al. (2023), is the most important simplification of RLHF in recent years. Its core finding is: we do not need to explicitly train a Reward Model or run PPO at all — preference optimization can be performed directly on the policy.
Complete Derivation from RLHF to DPO
This is the most elegant part of DPO. Starting from the RLHF optimization objective, we derive the DPO loss function step by step.
Step 1: The RLHF Optimization Objective
Recall the RLHF Stage 3 objective (with KL penalty):
Here \(\pi_{\text{ref}}\) is the reference model (typically the SFT model). Expanding the KL divergence:
Step 2: Closed-Form Solution for the Optimal Policy
This is a KL-constrained optimization problem. For a fixed \(x\), we want to maximize over the distribution \(\pi_\theta(\cdot|x)\):
This is a classical variational problem. Taking the functional derivative with respect to \(\pi(y|x)\) and setting it to zero (with the normalization constraint \(\sum_y \pi(y|x) = 1\)), we obtain the closed-form solution for the optimal policy:
where the partition function is:
Intuition: The optimal policy is a "reweighted version" of the reference model. For responses with high reward, the probability is exponentially amplified; for responses with low reward, the probability is exponentially suppressed. \(\beta\) controls the degree of this amplification/suppression — smaller \(\beta\) concentrates the optimal policy on high-reward responses; larger \(\beta\) keeps it closer to the reference model.
Step 3: Recovering the Reward from the Optimal Policy
From the closed-form solution above, we can solve for the reward:
Taking the logarithm of both sides:
Rearranging, we obtain the closed-form expression for the reward:
This is the most critical step in the DPO derivation. It tells us: the reward can be expressed entirely in terms of the log-probability ratio between the optimal policy and the reference policy. The partition function \(Z(x)\) depends only on the prompt \(x\) and is independent of the response \(y\).
Step 4: Substituting into the Bradley-Terry Model
Substituting the reward expression into the preference model:
The key observation: \(Z(x)\) cancels perfectly when taking the difference! This means we do not need to compute this intractable partition function.
Step 5: The DPO Loss Function
Now, we parameterize the optimal policy \(\pi^*\) with the trainable policy \(\pi_\theta\) and maximize the likelihood of the preference data:
Defining the implicit reward:
The DPO loss can be written concisely as:
This is formally identical to the Reward Model training objective! The only difference is that the reward is no longer the output of a separate network but rather the log-probability ratio of the policy itself.
Gradient Analysis of DPO
The gradient of the DPO loss with respect to parameters \(\theta\) is:
Intuition: The gradient direction increases the probability of preferred responses and decreases the probability of dispreferred responses. The gradient magnitude is controlled by \(\sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l))\) — when the model already correctly assigns a higher implicit reward to the preferred response, this weight approaches 0 and the gradient is small; when the model "errs," the weight approaches 1 and the gradient is large. This is a natural adaptive learning rate mechanism.
DPO vs RLHF
| Dimension | RLHF (PPO) | DPO |
|---|---|---|
| Training stages | Three stages (SFT + RM + RL) | Two stages (SFT + DPO) |
| Requires Reward Model | Yes (separately trained) | No (implicit reward) |
| Requires online sampling | Yes (PPO needs rollouts) | No (directly uses offline preference data) |
| GPU memory | Very high (4 models loaded simultaneously) | Lower (2 models: \(\pi_\theta\) + \(\pi_{\text{ref}}\)) |
| Training stability | Poor, requires careful tuning | Good, similar to standard supervised learning |
| Theoretical equivalence | Approximately optimal | Equivalent to RLHF's closed-form optimum |
| Empirical performance | Generally stronger (due to online exploration) | Close to RLHF, on par or better in some settings |
| Engineering complexity | Very high | Low |
Limitations of DPO
- Reference model dependency: DPO requires a frozen reference model to compute probability ratios. This increases memory overhead, and the quality of the reference model directly affects DPO's performance
- Sensitivity to offline data quality: DPO relies entirely on offline preference data; if the data distribution diverges significantly from the current policy, learning performance degrades
- Length Bias: DPO tends to generate longer responses, because longer responses exhibit larger differences in log-probability space
- Lack of online exploration: Unlike PPO, DPO does not perform online sampling and exploration, potentially missing good responses outside the coverage of the preference data
- Overfitting risk: On small-scale preference data, DPO is prone to overfitting, leading to reduced generalization
GRPO (Group Relative Policy Optimization)
GRPO was proposed by the DeepSeek team (Shao et al., 2024) and is the core training algorithm behind DeepSeek-R1. GRPO's design philosophy is: completely eliminate the Reward Model and Value function, and estimate Advantage through within-group relative ranking.
Motivation
The main pain points of PPO in LLM training:
- Difficulty of training the Value function: The state space of LLMs is extremely complex (prompt + generated tokens), making training an accurate Value function a significant challenge in itself. Moreover, the Value function's parameter count is typically comparable to the LLM's, further increasing memory overhead
- Limitations of the Reward Model: Training a good Reward Model requires large amounts of high-quality preference data, and the Reward Model is susceptible to exploitation
GRPO's core idea: since it is difficult to judge the "absolute quality" of a single response, it is better to let multiple responses "compete" within a group and use relative ranking to determine the optimization direction.
Algorithm Pipeline
For each prompt \(x\), GRPO performs the following steps:
Step 1 — Group Sampling: Use the current policy \(\pi_{\theta_{\text{old}}}\) to generate a group of \(G\) responses for the same prompt:
Typical values for \(G\) range from \(8\) to \(64\).
Step 2 — Compute Rewards: Score each response using a reward function (which can be a Reward Model or a verifiable reward):
Step 3 — Within-Group Normalized Advantage: This is GRPO's core innovation. The rewards are z-score normalized within the group to obtain Advantage estimates:
That is:
Intuition: Within a group of responses, those above average receive positive Advantage (and are encouraged), while those below average receive negative Advantage (and are suppressed). Normalization ensures that the Advantage scale is consistent across different prompts, preventing prompts with larger absolute reward values from dominating training.
This method is essentially an instance of REINFORCE with baseline, where the baseline \(b(x) = \bar{r}\) is the within-group mean reward. Compared to a traditional learned baseline (i.e., a Value function), this baseline has slightly higher variance but requires no additional training whatsoever.
Step 4 — Policy Optimization: A PPO-clip-style objective function is used, but operating at the sequence level (rather than the token level):
where \(r_{i,t}(\theta)\) is the importance ratio for the \(t\)-th token of the \(i\)-th response:
Note that \(\hat{A}_i\) is a sequence-level Advantage (the entire response shares the same Advantage value), while clipping is performed at the token level.
KL Regularization: GRPO also requires a KL penalty to prevent the policy from drifting too far. GRPO uses an approximate KL divergence estimate:
This is an unbiased estimate of the KL divergence (based on properties of \(f\)-divergence) that approximates the standard KL divergence when the two distributions are close.
Connection Between GRPO and REINFORCE
From a theoretical perspective, GRPO can be viewed as a variant of REINFORCE with baseline. The REINFORCE gradient estimator is:
GRPO's key improvements are:
- Within-group mean as baseline: \(b = \bar{r}\), eliminating the need to learn a Value function
- Normalization: Dividing by the within-group standard deviation \(\sigma_r\) adaptively adjusts the gradient scale
- PPO-clip for training stability: Using clipping instead of raw policy gradients to prevent large updates
Summary of GRPO's Advantages
- No Reward Model required (can use one, but it is not necessary — rule-based or verifiable rewards can be used directly)
- No Value function needed (completely eliminates the Critic network)
- Memory efficient: only needs to load Actor + Reference, two models
- Simple implementation: far less code than PPO-based RLHF
- Demonstrated strong reasoning capability improvement in DeepSeek-R1
Other Preference Optimization Methods
IPO (Identity Preference Optimization)
IPO, proposed by Azar et al. (2023), aims to address a theoretical flaw of DPO: DPO assumes that preference data perfectly follows the Bradley-Terry model, but actual human preference data is full of noise.
During training, DPO may overconfidently fit the preference data, causing the implicit reward magnitudes to trend toward infinity. IPO modifies the loss function to regularize this behavior:
Intuition: IPO replaces DPO's log-sigmoid loss with an MSE loss. The MSE loss "pulls" the implicit reward difference toward a target value of \(\frac{1}{2\beta}\), rather than allowing it to grow without bound as in DPO. This effectively applies implicit regularization to the implicit reward.
KTO (Kahneman-Tversky Optimization)
KTO, proposed by Ethayarajh et al. (2024), addresses a very practical problem: in many scenarios, obtaining pairwise preference annotations (\(y_w\) vs \(y_l\)) is difficult, but obtaining independent good/bad labels (thumbs up / thumbs down) is relatively easy.
KTO's name derives from Kahneman and Tversky's Prospect Theory, which posits that humans are more sensitive to losses than to gains.
KTO's loss function handles "good" and "bad" samples separately:
where:
Here \(z_{\text{ref}}\) is a baseline estimated by the expected KL divergence across the dataset. \(w(y)\) is a weighting term that assigns higher weight to undesirable samples (reflecting Prospect Theory's "loss aversion").
KTO's core advantage: Lower data requirements — only binary labels (good/bad) are needed, not pairwise comparisons. This substantially reduces data collection costs.
ORPO (Odds Ratio Preference Optimization)
ORPO, proposed by Hong et al. (2024), goes a step further: it merges SFT and preference optimization into a single training stage and does not require a reference model.
ORPO's loss function:
where the Odds Ratio term is:
Core idea: By replacing DPO's log-probability ratio with an odds ratio, the reference model becomes unnecessary. The SFT term simultaneously provides a regularization effect, preventing the policy from drifting too far.
SimPO (Simple Preference Optimization)
SimPO, proposed by Meng et al. (2024), further simplifies DPO with two key modifications:
- Length-normalized log-probabilities replace probability ratios: \(\hat{r}_\theta(x, y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)\), eliminating the need for a reference model
- Introduction of a margin term: A fixed margin \(\gamma\) is added between the preferred and dispreferred rewards
Length normalization elegantly resolves DPO's length bias problem.
RLVR (Reinforcement Learning with Verifiable Rewards)
RLVR is one of the key technologies behind the success of DeepSeek-R1 (DeepSeek, 2025) and a core method connecting RL with reasoning capabilities.
Core Idea
In domains such as mathematics, programming, and logical reasoning, the correctness of answers is verifiable:
- Mathematical problems can be verified by checking whether the final answer is correct
- Code can be verified through test cases
- Logic problems can be verified by checking whether the conclusion holds
This means we can completely skip "human annotation" and "Reward Model training" and directly use correctness as the RL reward signal.
Outcome-Based Reward (ORM)
The simplest form: only care about whether the final answer is correct.
More nuanced rewards can also be designed, for example incorporating format compliance:
where \(r_{\text{format}}\) checks whether the model follows a specified chain-of-thought format (e.g., placing the reasoning process inside <think> tags).
Process-Based Reward (PRM)
A more fine-grained reward approach: evaluate each step of the reasoning process.
Let the reasoning process consist of \(K\) steps: \(y = (s_1, s_2, \ldots, s_K)\). The PRM assigns a score to each step:
Or take the minimum (the weakest link determines overall quality):
PRM's advantage is providing denser reward signals (dense reward), helping to mitigate the sparse reward problem. However, training a PRM itself requires step-level annotation data, which is more costly.
DeepSeek-R1's Training Methodology
DeepSeek-R1 demonstrated an exciting technical path: without using any human-annotated chain-of-thought data, relying solely on GRPO + verifiable rewards, an LLM can "emerge" complex reasoning capabilities.
Training pipeline:
DeepSeek-R1 Training Pipeline:
Stage 1: Cold Start
- Collect a small amount of long chain-of-thought data for SFT
- Teach the model the basic "think then answer" format
Stage 2: Reasoning RL
- Train with GRPO + verifiable rewards on math/code tasks
- Reward signal: answer correctness + format compliance
- The model spontaneously emerges: reflection, verification, backtracking, self-correction
Stage 3: Rejection Sampling + SFT
- Use the Stage 2 model to generate large amounts of reasoning data
- Keep only correct, high-quality reasoning trajectories
- Mix with general SFT data for further fine-tuning
Stage 4: Second Round of RL
- Combine reasoning rewards + helpfulness/safety rewards
- Final alignment
A striking finding from DeepSeek-R1 is that during Stage 2, the model spontaneously emerged the following reasoning behaviors through pure RL training, none of which were explicitly taught in the training data:
- Reflection: "Let me re-examine this step..."
- Backtracking: "The approach above doesn't work, let me try a different angle..."
- Self-verification: "Substituting back into the original equation to verify..."
- Decomposition: "First solve sub-problem A, then..."
This paradigm of emerging reasoning capabilities through RL is highly consistent with the technical direction of OpenAI's o1/o3 series models, representing an important direction for LLM post-training.
Comparative Summary of Methods
| Method | Reward Model | Value Function | Reference Model | Data Requirements | Training Complexity | Representative Models | Core Advantage |
|---|---|---|---|---|---|---|---|
| RLHF (PPO) | Required | Required | Required | Pairwise preferences | Very high | ChatGPT, Claude | Strong performance, mature theory |
| DPO | Not required | Not required | Required | Pairwise preferences | Low | Zephyr, many open-source | Simple and efficient, theoretically equivalent to RLHF |
| GRPO | Optional | Not required | Required | Verifiable rewards | Medium | DeepSeek-R1 | No Value function needed, suited for reasoning tasks |
| KTO | Not required | Not required | Required | Binary labels | Low | - | Lowest data requirements |
| IPO | Not required | Not required | Required | Pairwise preferences | Low | - | Robust to noisy preference data |
| ORPO | Not required | Not required | Not required | Pairwise preferences | Low | - | No reference model, unified with SFT |
| SimPO | Not required | Not required | Not required | Pairwise preferences | Low | - | Resolves length bias |
| RLVR | Not required | Not required | Required | Verifiable labels | Medium | DeepSeek-R1, o1 | No human annotation needed, suited for reasoning |
A clear evolutionary trend emerges: from complex to simple, from human-dependent to automated.
Frontier Directions
Online DPO / Iterative DPO
Standard DPO uses offline preference data, which means the distribution of preference data may not match the current policy's distribution (distribution shift). The idea behind Online DPO is:
- Generate new responses using the current policy \(\pi_\theta\)
- Perform preference annotation on the new responses using a Reward Model (or AI/humans)
- Update with DPO using the new preference data
- Repeat
This essentially transforms DPO from off-policy to on-policy, bridging the theoretical gap between DPO and RLHF. Experiments show that Online DPO typically outperforms standard offline DPO and, on certain benchmarks, approaches or even surpasses PPO-based RLHF.
Self-Play (SPIN)
SPIN (Self-Play fIne-tuNing, Chen et al., 2024) has the core idea of letting the model play against its own previous version:
- Treat the previous-round model's outputs as "dispreferred" responses
- Treat ground truth as "preferred" responses
- Train using a DPO-style loss function
where \(y_{\text{gt}}\) is the ground truth response and \(y_{\text{old}} \sim \pi_{\theta_{\text{old}}}\) is the response generated by the old model.
SPIN's advantage is that it requires no preference annotation data at all — only ground truth responses are needed. However, as training progresses and the model's output increasingly approximates the ground truth, the learning signal gradually weakens until convergence.
RL for Reasoning (o1-style)
OpenAI's o1/o3 series and DeepSeek-R1 mark the emergence of a new paradigm: using RL to train LLM reasoning capabilities, rather than merely for alignment.
In this paradigm, the role of RL undergoes a fundamental shift:
- Traditional RLHF: RL is used for alignment — making the model's outputs conform to human preferences
- RL for Reasoning: RL is used for capability enhancement — teaching the model deeper reasoning
Key technical elements:
- Long Chain-of-Thought (Long CoT): Allow the model to perform extended internal reasoning before answering
- Verifiable rewards: The correctness of math/code provides a natural RL reward signal
- Test-time Compute Scaling: Improve performance by investing more computation during inference
This yields a profound insight: scaling is not limited to parameter count and data volume during training — computation at inference time can also be traded for better performance. This is the so-called "test-time compute scaling."
Multi-Turn RLHF
Traditional RLHF considers only single-turn dialogues \((x, y)\), but in real applications LLMs need to maintain consistency across multi-turn conversations. Multi-Turn RLHF extends the RL state to the full dialogue history:
The challenges include:
- The credit assignment problem in multi-turn dialogues (which turn's response deserves credit for a good final outcome?)
- Rapid expansion of the dialogue state space
- Greater difficulty of human preference annotation (requiring evaluation of the entire conversation's quality)
Constitutional AI at Scale
Taking the Constitutional AI philosophy to its ultimate conclusion: using a comprehensive value system to guide AI behavior, rather than relying on limited human preference annotations. This touches on deeper philosophical questions: What constitutes "good" AI behavior? How do we find balance across different cultures and different value systems?