Safety & Alignment

What Is Alignment?

The core question of alignment is:

How can we ensure that AI systems behave in accordance with human values, intentions, and expectations?

A pretrained language model is fundamentally a "text completer" — it has learned to predict the next token, but that does not mean it will behave the way humans expect. An unaligned LLM may:

Generate harmful, toxic, or discriminatory content
Provide plausible-sounding but factually incorrect information (hallucination)
Ignore user intent and produce irrelevant responses
Be manipulated by adversarial prompts into producing dangerous outputs

The goals of alignment are commonly summarized by the HHH principles:

Principle	Meaning
Helpful	The model should do its best to assist users in completing their tasks
Honest	The model should provide accurate information and express uncertainty when unsure
Harmless	The model should not generate harmful content or assist in dangerous activities

RLHF: Reinforcement Learning from Human Feedback

RLHF (Reinforcement Learning from Human Feedback) is currently the most widely adopted alignment method, systematically applied to LLMs by InstructGPT (Ouyang et al., 2022).

Full Pipeline

RLHF Three-Stage Pipeline:

Stage 1: SFT (Supervised Fine-Tuning)
    ┌─────────────────────────────────────────────┐
    │  Collect high-quality (prompt, response) pairs│
    │  Supervised fine-tuning on a pretrained LLM   │
    │  Output: SFT Model (π_SFT)                   │
    └─────────────────────────────────────────────┘
                        ↓
Stage 2: Reward Model Training
    ┌─────────────────────────────────────────────┐
    │  For the same prompt, have the SFT model     │
    │  generate multiple responses                  │
    │  Human annotators rank responses (y_w > y_l)  │
    │  Train a Reward Model to learn human prefs    │
    │  Output: Reward Model (r_φ)                   │
    └─────────────────────────────────────────────┘
                        ↓
Stage 3: RL Optimization (PPO)
    ┌─────────────────────────────────────────────┐
    │  Use Reward Model scores as reward signal     │
    │  Optimize LLM policy via PPO algorithm        │
    │  KL penalty prevents drifting too far from SFT│
    │  Output: Aligned LLM (π_RLHF)                │
    └─────────────────────────────────────────────┘

Stage 1: SFT

Collect high-quality (instruction, response) data written by human annotators and perform standard supervised fine-tuning on the pretrained model:

\[ \mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log P_\theta(y_t | x, y_{<t}) \]

The role of SFT is to teach the model the basic format and behavioral patterns for "answering questions."

Stage 2: Reward Model

Given a prompt \(x\), the model generates two responses \(y_w\) (the better one) and \(y_l\) (the worse one), and human annotators label their preferences.

The Reward Model training objective (Bradley-Terry Model):

\[ \mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right] \]

Here \(\sigma\) is the sigmoid function. This objective ensures that the Reward Model assigns higher scores to responses preferred by humans.

Stage 3: PPO Optimization

Using the Reward Model's output as the reward signal, the LLM policy is optimized via the PPO algorithm:

\[ \max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \cdot \text{KL}\left( \pi_\theta(y|x) \| \pi_{\text{SFT}}(y|x) \right) \right] \]

\(r_\phi(x, y)\): The reward score assigned by the Reward Model
\(\beta \cdot \text{KL}(\cdot \| \cdot)\): KL divergence penalty term, preventing the policy from drifting too far from the SFT model
\(\beta\) controls the trade-off between exploration and constraint

Importance of the KL penalty: Without KL constraints, the model may exploit vulnerabilities in the Reward Model (reward hacking), generating responses that score highly but are actually low quality.

RLAIF: Constitutional AI

Constitutional AI (CAI), proposed by Anthropic, replaces part of the human feedback with AI feedback.

Core Idea

Use a set of explicit "constitution" rules to guide AI self-correction, reducing dependence on human annotation.

Pipeline

Constitutional AI Pipeline:

1. Have the model generate a response (which may contain harmful content)
2. Have the model critique and revise its response based on constitutional rules
3. Train a preference model using (original, revised) pairs
4. Further optimize with RL

Example constitutional rules:
- "Please revise the response to remove any racially discriminatory content"
- "Please revise the response to be more honest and accurate"
- "Please revise the response so it does not help users engage in illegal activities"

Advantages of RLAIF:

Reduced human annotation costs
Explicit, auditable rules
Easier to scale

DPO: Direct Preference Optimization

Rafailov et al. (2023) proposed DPO as a simplified alternative to RLHF.

Core Insight

The RL stage (PPO) of RLHF is unstable and complex to train. DPO shows that the RL problem can be reformulated as a simple classification problem.

Mathematical Derivation

Starting from the optimal policy in RLHF, one can derive the relationship between the reward function and the policy:

\[ r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) \]

Substituting this into the Bradley-Terry preference model and canceling \(Z(x)\) yields the DPO loss:

\[ \mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] \]

DPO vs. RLHF Comparison

Aspect	RLHF (PPO)	DPO
Requires Reward Model	Yes	No
Training stability	Poor; requires careful hyperparameter tuning	Good; similar to standard supervised learning
Computational cost	High (multiple models must be online simultaneously)	Low
Theoretical optimality	Approximately optimal	Equivalent to the closed-form solution of RLHF
Practical performance	Generally better (but harder to train)	Close to RLHF; better in some scenarios

The success of DPO has made alignment training much more accessible, driving the widespread adoption of alignment in open-source LLMs.

DPO Variants

IPO (Identity Preference Optimization): Addresses DPO's sensitivity to the preference data distribution
KTO (Kahneman-Tversky Optimization): Requires only binary feedback (good/bad) rather than pairwise comparisons
ORPO (Odds Ratio Preference Optimization): Merges SFT and preference optimization into a single step

Hallucination

Hallucination refers to models generating content that appears plausible but contradicts factual reality. It is one of the core challenges facing LLMs.

Types of Hallucination

Type	Description	Example
Factual hallucination	Generating content that contradicts known facts	"Einstein invented the telephone"
Faithfulness hallucination	Generating content that contradicts the input context	A summary containing information absent from the source text
Reasoning hallucination	Logical errors during the reasoning process	Incorrect steps in a mathematical calculation

Causes of Hallucination

Noisy training data: Internet-sourced data inherently contains misinformation
Training objective bias: Next-token prediction optimizes for fluency, not factual accuracy
Knowledge cutoff: The model only knows information available before its training cutoff date
Overconfidence: Models tend to give definitive answers even when uncertain

Mitigation Strategies

Retrieval-Augmented Generation (RAG): Retrieve facts from external knowledge bases to reduce hallucination
Chain-of-Verification: Have the model self-verify its generated content
Confidence calibration: Train the model to express uncertainty when it is unsure
Factuality reward: Incorporate factual accuracy as a reward signal in RLHF

Red Teaming and Safety Evaluation

Red Teaming

Red teaming is an adversarial evaluation method that simulates attackers to uncover safety vulnerabilities in models.

Common attack vectors:

Attack Type	Description
Jailbreak	Bypassing safety guardrails through carefully crafted prompts
Prompt Injection	Embedding malicious instructions in the input to override system prompts
Multilingual attacks	Exploiting low-resource languages to circumvent safety filters
Encoding attacks	Hiding malicious requests using encodings such as base64 or ROT13
Multi-step attacks	Gradually inducing harmful outputs through multi-turn conversations

Safety Evaluation Benchmarks

Benchmark	Evaluation Focus
TruthfulQA	Factual accuracy; resistance to common misconceptions
ToxiGen	Toxic content generation
BBQ	Social bias
HarmBench	Comprehensive safety evaluation
WMDP	Risk of leaking knowledge related to weapons of mass destruction

Current Challenges and Open Questions

1. Reward Hacking

Models may learn to "game" the Reward Model, generating responses that receive high reward scores but are actually low quality.

\[ \text{Reward Hacking}: \quad \arg\max_y r_\phi(x, y) \neq \arg\max_y r_{\text{human}}(x, y) \]

2. Superalignment

When AI systems surpass human capabilities, how can we ensure alignment? Humans cannot reliably evaluate AI outputs that exceed their own abilities.

OpenAI's Superalignment initiative proposed the research direction of "weak-to-strong generalization": using weaker models to supervise stronger ones.

3. Alignment Tax

Alignment training often comes at the cost of some model capabilities. Finding the optimal balance between safety and helpfulness remains an ongoing challenge.

4. Value Pluralism

Different cultures and communities have different definitions of "good behavior." Whose values should the model be aligned to?

5. Interpretable Alignment

Current alignment methods (RLHF, DPO) are essentially "black-box" — we cannot precisely understand which internal mechanisms of the model are altered by alignment training.

6. Evaluation Difficulty

For open-ended generation tasks, there is still no consensus on how to objectively and comprehensively evaluate model safety.

Summary of Alignment Methods

Method	Core Idea	Strengths	Weaknesses
SFT	Supervised learning to imitate human responses	Simple and effective	Can only imitate; hard to surpass human quality
RLHF (PPO)	Human preferences + reinforcement learning	Strong results; can surpass SFT	Complex and unstable training
RLAIF (CAI)	AI self-feedback + constitutional rules	Scalable; transparent rules	AI evaluation may be biased
DPO	Direct preference optimization	Simple and stable	Sensitive to data quality
ORPO	SFT + preference optimization in one step	Simplified training pipeline	Effectiveness requires further validation

Alignment is a continuously evolving field of research. Current methods are all approximate solutions, and there is still a long way to go before the alignment problem is truly solved.