Skip to content

Safety & Alignment

What Is Alignment?

The core question of alignment is:

How can we ensure that AI systems behave in accordance with human values, intentions, and expectations?

A pretrained language model is fundamentally a "text completer" — it has learned to predict the next token, but that does not mean it will behave the way humans expect. An unaligned LLM may:

  • Generate harmful, toxic, or discriminatory content
  • Provide plausible-sounding but factually incorrect information (hallucination)
  • Ignore user intent and produce irrelevant responses
  • Be manipulated by adversarial prompts into producing dangerous outputs

The goals of alignment are commonly summarized by the HHH principles:

Principle Meaning
Helpful The model should do its best to assist users in completing their tasks
Honest The model should provide accurate information and express uncertainty when unsure
Harmless The model should not generate harmful content or assist in dangerous activities

RLHF: Reinforcement Learning from Human Feedback

RLHF (Reinforcement Learning from Human Feedback) is currently the most widely adopted alignment method, systematically applied to LLMs by InstructGPT (Ouyang et al., 2022).

Full Pipeline

RLHF Three-Stage Pipeline:

Stage 1: SFT (Supervised Fine-Tuning)
    ┌─────────────────────────────────────────────┐
    │  Collect high-quality (prompt, response) pairs│
    │  Supervised fine-tuning on a pretrained LLM   │
    │  Output: SFT Model (π_SFT)                   │
    └─────────────────────────────────────────────┘
                        ↓
Stage 2: Reward Model Training
    ┌─────────────────────────────────────────────┐
    │  For the same prompt, have the SFT model     │
    │  generate multiple responses                  │
    │  Human annotators rank responses (y_w > y_l)  │
    │  Train a Reward Model to learn human prefs    │
    │  Output: Reward Model (r_φ)                   │
    └─────────────────────────────────────────────┘
                        ↓
Stage 3: RL Optimization (PPO)
    ┌─────────────────────────────────────────────┐
    │  Use Reward Model scores as reward signal     │
    │  Optimize LLM policy via PPO algorithm        │
    │  KL penalty prevents drifting too far from SFT│
    │  Output: Aligned LLM (π_RLHF)                │
    └─────────────────────────────────────────────┘

Stage 1: SFT

Collect high-quality (instruction, response) data written by human annotators and perform standard supervised fine-tuning on the pretrained model:

\[ \mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P_\theta(y_t | x, y_{<t}) \]

The role of SFT is to teach the model the basic format and behavioral patterns for "answering questions."

Stage 2: Reward Model

Given a prompt \(x\), the model generates two responses \(y_w\) (the better one) and \(y_l\) (the worse one), and human annotators label their preferences.

The Reward Model training objective (Bradley-Terry Model):

\[ \mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right] \]

Here \(\sigma\) is the sigmoid function. This objective ensures that the Reward Model assigns higher scores to responses preferred by humans.

Stage 3: PPO Optimization

Using the Reward Model's output as the reward signal, the LLM policy is optimized via the PPO algorithm:

\[ \max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \cdot \text{KL}\left( \pi_\theta(y|x) \| \pi_{\text{SFT}}(y|x) \right) \right] \]
  • \(r_\phi(x, y)\): The reward score assigned by the Reward Model
  • \(\beta \cdot \text{KL}(\cdot \| \cdot)\): KL divergence penalty term, preventing the policy from drifting too far from the SFT model
  • \(\beta\) controls the trade-off between exploration and constraint

Importance of the KL penalty: Without KL constraints, the model may exploit vulnerabilities in the Reward Model (reward hacking), generating responses that score highly but are actually low quality.


RLAIF: Constitutional AI

Constitutional AI (CAI), proposed by Anthropic, replaces part of the human feedback with AI feedback.

Core Idea

Use a set of explicit "constitution" rules to guide AI self-correction, reducing dependence on human annotation.

Pipeline

Constitutional AI Pipeline:

1. Have the model generate a response (which may contain harmful content)
2. Have the model critique and revise its response based on constitutional rules
3. Train a preference model using (original, revised) pairs
4. Further optimize with RL

Example constitutional rules:
- "Please revise the response to remove any racially discriminatory content"
- "Please revise the response to be more honest and accurate"
- "Please revise the response so it does not help users engage in illegal activities"

Advantages of RLAIF:

  • Reduced human annotation costs
  • Explicit, auditable rules
  • Easier to scale

DPO: Direct Preference Optimization

Rafailov et al. (2023) proposed DPO as a simplified alternative to RLHF.

Core Insight

The RL stage (PPO) of RLHF is unstable and complex to train. DPO shows that the RL problem can be reformulated as a simple classification problem.

Mathematical Derivation

Starting from the optimal policy in RLHF, one can derive the relationship between the reward function and the policy:

\[ r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) \]

Substituting this into the Bradley-Terry preference model and canceling \(Z(x)\) yields the DPO loss:

\[ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] \]

DPO vs. RLHF Comparison

Aspect RLHF (PPO) DPO
Requires Reward Model Yes No
Training stability Poor; requires careful hyperparameter tuning Good; similar to standard supervised learning
Computational cost High (multiple models must be online simultaneously) Low
Theoretical optimality Approximately optimal Equivalent to the closed-form solution of RLHF
Practical performance Generally better (but harder to train) Close to RLHF; better in some scenarios

The success of DPO has made alignment training much more accessible, driving the widespread adoption of alignment in open-source LLMs.

DPO Variants

  • IPO (Identity Preference Optimization): Addresses DPO's sensitivity to the preference data distribution
  • KTO (Kahneman-Tversky Optimization): Requires only binary feedback (good/bad) rather than pairwise comparisons
  • ORPO (Odds Ratio Preference Optimization): Merges SFT and preference optimization into a single step

Hallucination

Hallucination refers to models generating content that appears plausible but contradicts factual reality. It is one of the core challenges facing LLMs.

Types of Hallucination

Type Description Example
Factual hallucination Generating content that contradicts known facts "Einstein invented the telephone"
Faithfulness hallucination Generating content that contradicts the input context A summary containing information absent from the source text
Reasoning hallucination Logical errors during the reasoning process Incorrect steps in a mathematical calculation

Causes of Hallucination

  1. Noisy training data: Internet-sourced data inherently contains misinformation
  2. Training objective bias: Next-token prediction optimizes for fluency, not factual accuracy
  3. Knowledge cutoff: The model only knows information available before its training cutoff date
  4. Overconfidence: Models tend to give definitive answers even when uncertain

Mitigation Strategies

  • Retrieval-Augmented Generation (RAG): Retrieve facts from external knowledge bases to reduce hallucination
  • Chain-of-Verification: Have the model self-verify its generated content
  • Confidence calibration: Train the model to express uncertainty when it is unsure
  • Factuality reward: Incorporate factual accuracy as a reward signal in RLHF

Red Teaming and Safety Evaluation

Red Teaming

Red teaming is an adversarial evaluation method that simulates attackers to uncover safety vulnerabilities in models.

Common attack vectors:

Attack Type Description
Jailbreak Bypassing safety guardrails through carefully crafted prompts
Prompt Injection Embedding malicious instructions in the input to override system prompts
Multilingual attacks Exploiting low-resource languages to circumvent safety filters
Encoding attacks Hiding malicious requests using encodings such as base64 or ROT13
Multi-step attacks Gradually inducing harmful outputs through multi-turn conversations

Safety Evaluation Benchmarks

Benchmark Evaluation Focus
TruthfulQA Factual accuracy; resistance to common misconceptions
ToxiGen Toxic content generation
BBQ Social bias
HarmBench Comprehensive safety evaluation
WMDP Risk of leaking knowledge related to weapons of mass destruction

Current Challenges and Open Questions

1. Reward Hacking

Models may learn to "game" the Reward Model, generating responses that receive high reward scores but are actually low quality.

\[ \text{Reward Hacking}: \quad \arg\max_y r_\phi(x, y) \neq \arg\max_y r_{\text{human}}(x, y) \]

2. Superalignment

When AI systems surpass human capabilities, how can we ensure alignment? Humans cannot reliably evaluate AI outputs that exceed their own abilities.

OpenAI's Superalignment initiative proposed the research direction of "weak-to-strong generalization": using weaker models to supervise stronger ones.

3. Alignment Tax

Alignment training often comes at the cost of some model capabilities. Finding the optimal balance between safety and helpfulness remains an ongoing challenge.

4. Value Pluralism

Different cultures and communities have different definitions of "good behavior." Whose values should the model be aligned to?

5. Interpretable Alignment

Current alignment methods (RLHF, DPO) are essentially "black-box" — we cannot precisely understand which internal mechanisms of the model are altered by alignment training.

6. Evaluation Difficulty

For open-ended generation tasks, there is still no consensus on how to objectively and comprehensively evaluate model safety.


Summary of Alignment Methods

Method Core Idea Strengths Weaknesses
SFT Supervised learning to imitate human responses Simple and effective Can only imitate; hard to surpass human quality
RLHF (PPO) Human preferences + reinforcement learning Strong results; can surpass SFT Complex and unstable training
RLAIF (CAI) AI self-feedback + constitutional rules Scalable; transparent rules AI evaluation may be biased
DPO Direct preference optimization Simple and stable Sensitive to data quality
ORPO SFT + preference optimization in one step Simplified training pipeline Effectiveness requires further validation

Alignment is a continuously evolving field of research. Current methods are all approximate solutions, and there is still a long way to go before the alignment problem is truly solved.


评论 #