Safety & Alignment
What Is Alignment?
The core question of alignment is:
How can we ensure that AI systems behave in accordance with human values, intentions, and expectations?
A pretrained language model is fundamentally a "text completer" — it has learned to predict the next token, but that does not mean it will behave the way humans expect. An unaligned LLM may:
- Generate harmful, toxic, or discriminatory content
- Provide plausible-sounding but factually incorrect information (hallucination)
- Ignore user intent and produce irrelevant responses
- Be manipulated by adversarial prompts into producing dangerous outputs
The goals of alignment are commonly summarized by the HHH principles:
| Principle | Meaning |
|---|---|
| Helpful | The model should do its best to assist users in completing their tasks |
| Honest | The model should provide accurate information and express uncertainty when unsure |
| Harmless | The model should not generate harmful content or assist in dangerous activities |
RLHF: Reinforcement Learning from Human Feedback
RLHF (Reinforcement Learning from Human Feedback) is currently the most widely adopted alignment method, systematically applied to LLMs by InstructGPT (Ouyang et al., 2022).
Full Pipeline
RLHF Three-Stage Pipeline:
Stage 1: SFT (Supervised Fine-Tuning)
┌─────────────────────────────────────────────┐
│ Collect high-quality (prompt, response) pairs│
│ Supervised fine-tuning on a pretrained LLM │
│ Output: SFT Model (π_SFT) │
└─────────────────────────────────────────────┘
↓
Stage 2: Reward Model Training
┌─────────────────────────────────────────────┐
│ For the same prompt, have the SFT model │
│ generate multiple responses │
│ Human annotators rank responses (y_w > y_l) │
│ Train a Reward Model to learn human prefs │
│ Output: Reward Model (r_φ) │
└─────────────────────────────────────────────┘
↓
Stage 3: RL Optimization (PPO)
┌─────────────────────────────────────────────┐
│ Use Reward Model scores as reward signal │
│ Optimize LLM policy via PPO algorithm │
│ KL penalty prevents drifting too far from SFT│
│ Output: Aligned LLM (π_RLHF) │
└─────────────────────────────────────────────┘
Stage 1: SFT
Collect high-quality (instruction, response) data written by human annotators and perform standard supervised fine-tuning on the pretrained model:
The role of SFT is to teach the model the basic format and behavioral patterns for "answering questions."
Stage 2: Reward Model
Given a prompt \(x\), the model generates two responses \(y_w\) (the better one) and \(y_l\) (the worse one), and human annotators label their preferences.
The Reward Model training objective (Bradley-Terry Model):
Here \(\sigma\) is the sigmoid function. This objective ensures that the Reward Model assigns higher scores to responses preferred by humans.
Stage 3: PPO Optimization
Using the Reward Model's output as the reward signal, the LLM policy is optimized via the PPO algorithm:
- \(r_\phi(x, y)\): The reward score assigned by the Reward Model
- \(\beta \cdot \text{KL}(\cdot \| \cdot)\): KL divergence penalty term, preventing the policy from drifting too far from the SFT model
- \(\beta\) controls the trade-off between exploration and constraint
Importance of the KL penalty: Without KL constraints, the model may exploit vulnerabilities in the Reward Model (reward hacking), generating responses that score highly but are actually low quality.
RLAIF: Constitutional AI
Constitutional AI (CAI), proposed by Anthropic, replaces part of the human feedback with AI feedback.
Core Idea
Use a set of explicit "constitution" rules to guide AI self-correction, reducing dependence on human annotation.
Pipeline
Constitutional AI Pipeline:
1. Have the model generate a response (which may contain harmful content)
2. Have the model critique and revise its response based on constitutional rules
3. Train a preference model using (original, revised) pairs
4. Further optimize with RL
Example constitutional rules:
- "Please revise the response to remove any racially discriminatory content"
- "Please revise the response to be more honest and accurate"
- "Please revise the response so it does not help users engage in illegal activities"
Advantages of RLAIF:
- Reduced human annotation costs
- Explicit, auditable rules
- Easier to scale
DPO: Direct Preference Optimization
Rafailov et al. (2023) proposed DPO as a simplified alternative to RLHF.
Core Insight
The RL stage (PPO) of RLHF is unstable and complex to train. DPO shows that the RL problem can be reformulated as a simple classification problem.
Mathematical Derivation
Starting from the optimal policy in RLHF, one can derive the relationship between the reward function and the policy:
Substituting this into the Bradley-Terry preference model and canceling \(Z(x)\) yields the DPO loss:
DPO vs. RLHF Comparison
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Requires Reward Model | Yes | No |
| Training stability | Poor; requires careful hyperparameter tuning | Good; similar to standard supervised learning |
| Computational cost | High (multiple models must be online simultaneously) | Low |
| Theoretical optimality | Approximately optimal | Equivalent to the closed-form solution of RLHF |
| Practical performance | Generally better (but harder to train) | Close to RLHF; better in some scenarios |
The success of DPO has made alignment training much more accessible, driving the widespread adoption of alignment in open-source LLMs.
DPO Variants
- IPO (Identity Preference Optimization): Addresses DPO's sensitivity to the preference data distribution
- KTO (Kahneman-Tversky Optimization): Requires only binary feedback (good/bad) rather than pairwise comparisons
- ORPO (Odds Ratio Preference Optimization): Merges SFT and preference optimization into a single step
Hallucination
Hallucination refers to models generating content that appears plausible but contradicts factual reality. It is one of the core challenges facing LLMs.
Types of Hallucination
| Type | Description | Example |
|---|---|---|
| Factual hallucination | Generating content that contradicts known facts | "Einstein invented the telephone" |
| Faithfulness hallucination | Generating content that contradicts the input context | A summary containing information absent from the source text |
| Reasoning hallucination | Logical errors during the reasoning process | Incorrect steps in a mathematical calculation |
Causes of Hallucination
- Noisy training data: Internet-sourced data inherently contains misinformation
- Training objective bias: Next-token prediction optimizes for fluency, not factual accuracy
- Knowledge cutoff: The model only knows information available before its training cutoff date
- Overconfidence: Models tend to give definitive answers even when uncertain
Mitigation Strategies
- Retrieval-Augmented Generation (RAG): Retrieve facts from external knowledge bases to reduce hallucination
- Chain-of-Verification: Have the model self-verify its generated content
- Confidence calibration: Train the model to express uncertainty when it is unsure
- Factuality reward: Incorporate factual accuracy as a reward signal in RLHF
Red Teaming and Safety Evaluation
Red Teaming
Red teaming is an adversarial evaluation method that simulates attackers to uncover safety vulnerabilities in models.
Common attack vectors:
| Attack Type | Description |
|---|---|
| Jailbreak | Bypassing safety guardrails through carefully crafted prompts |
| Prompt Injection | Embedding malicious instructions in the input to override system prompts |
| Multilingual attacks | Exploiting low-resource languages to circumvent safety filters |
| Encoding attacks | Hiding malicious requests using encodings such as base64 or ROT13 |
| Multi-step attacks | Gradually inducing harmful outputs through multi-turn conversations |
Safety Evaluation Benchmarks
| Benchmark | Evaluation Focus |
|---|---|
| TruthfulQA | Factual accuracy; resistance to common misconceptions |
| ToxiGen | Toxic content generation |
| BBQ | Social bias |
| HarmBench | Comprehensive safety evaluation |
| WMDP | Risk of leaking knowledge related to weapons of mass destruction |
Current Challenges and Open Questions
1. Reward Hacking
Models may learn to "game" the Reward Model, generating responses that receive high reward scores but are actually low quality.
2. Superalignment
When AI systems surpass human capabilities, how can we ensure alignment? Humans cannot reliably evaluate AI outputs that exceed their own abilities.
OpenAI's Superalignment initiative proposed the research direction of "weak-to-strong generalization": using weaker models to supervise stronger ones.
3. Alignment Tax
Alignment training often comes at the cost of some model capabilities. Finding the optimal balance between safety and helpfulness remains an ongoing challenge.
4. Value Pluralism
Different cultures and communities have different definitions of "good behavior." Whose values should the model be aligned to?
5. Interpretable Alignment
Current alignment methods (RLHF, DPO) are essentially "black-box" — we cannot precisely understand which internal mechanisms of the model are altered by alignment training.
6. Evaluation Difficulty
For open-ended generation tasks, there is still no consensus on how to objectively and comprehensively evaluate model safety.
Summary of Alignment Methods
| Method | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| SFT | Supervised learning to imitate human responses | Simple and effective | Can only imitate; hard to surpass human quality |
| RLHF (PPO) | Human preferences + reinforcement learning | Strong results; can surpass SFT | Complex and unstable training |
| RLAIF (CAI) | AI self-feedback + constitutional rules | Scalable; transparent rules | AI evaluation may be biased |
| DPO | Direct preference optimization | Simple and stable | Sensitive to data quality |
| ORPO | SFT + preference optimization in one step | Simplified training pipeline | Effectiveness requires further validation |
Alignment is a continuously evolving field of research. Current methods are all approximate solutions, and there is still a long way to go before the alignment problem is truly solved.