Skip to content

AI Alignment

Introduction

AI Alignment is the research field concerned with ensuring that AI systems' behavior conforms to human intentions and values. As AI capabilities grow rapidly, the alignment problem has shifted from an academic discussion to an urgent engineering challenge.


1. The Nature of the Alignment Problem

1.1 Three Sub-Problems

Sub-Problem Meaning Challenge
Specification How to accurately define what we want? Human values are complex, ambiguous, and contradictory
Robustness How to ensure AI follows the specification in novel situations? Distribution shift, adversarial manipulation
Assurance How to verify that AI is indeed aligned? The more capable the AI, the harder it is to oversee

1.2 Core Difficulties

Inner alignment vs outer alignment:

Human Intent ←── Outer Alignment ──→ Training Objective ←── Inner Alignment ──→ Model Behavior

Outer alignment problem: Does the training objective correctly capture human intent?
Inner alignment problem: Does the model truly optimize the training objective, or has it found shortcuts?

Goodhart's Law:

"When a measure becomes a target, it ceases to be a good measure."

AI may maximize rewards in unexpected ways rather than achieving the designer's true intent.


2. RLHF (Reinforcement Learning from Human Feedback)

2.1 Process

Step 1: Pre-train language model (SFT)
  → Self-supervised pre-training on large-scale text
  → Supervised fine-tuning on instruction data

Step 2: Train reward model (RM)
  → Collect human preference data: rank multiple responses to the same prompt
  → Train a model to predict human preferences

Step 3: RL optimization (PPO)
  → Use the reward model's scores as rewards
  → PPO optimizes the policy model
  → KL penalty prevents diverging too far from the SFT model
\[ \mathcal{L}(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)) \right] \]

2.2 Limitations of RLHF

Limitation Description
Reward hacking Model learns to manipulate the reward model rather than genuinely improving
Annotator bias Human annotators' preferences do not necessarily represent "correct"
Imperfect reward model RM may give incorrect scores for out-of-distribution inputs
Not scalable Humans cannot evaluate superhuman AI outputs
Surface alignment Model may learn to please rather than truly understand values

3. Constitutional AI

3.1 Anthropic's Approach

Constitutional AI reduces dependence on human annotation through principle-driven self-improvement:

Stage 1: Self-Critique
  AI generates response → AI critiques itself based on "constitutional" principles → AI revises response

Stage 2: Reinforcement Learning
  Use AI preference data (instead of human data) to train the reward model
  → RLAIF (Reinforcement Learning from AI Feedback)

"Constitutional" principle examples:

  • Choose the most helpful, honest, and harmless response
  • Choose the response that does not help humans engage in dangerous activities
  • Choose the response free from discrimination or bias

3.2 Advantages

  • Reduces human annotation costs
  • Principles can be explicitly stated, modified, and audited
  • Scalable to complex ethical judgments

4. Reward Hacking

4.1 Definition

The model finds ways to maximize the reward function that do not align with the designer's intent.

4.2 Examples

Scenario Reward Design Hacking Behavior
Walking robot Reward forward distance Falls and slides
Text summarization Human ratings Produces flashy but inaccurate text
Cleaning robot Penalize seeing trash Turns off the camera
Code generation Pass test cases Hard-codes expected outputs

4.3 Mitigation Methods

Method Approach
Diverse rewards Use multiple reward signals to avoid single-metric gaming
KL constraint Limit policy divergence from the reference model
Iterative training Continuous human feedback for correction
Process rewards Reward the reasoning process, not just the outcome
Red teaming Proactively search for reward hacking behaviors

5. Scalable Oversight

When AI capabilities exceed human capacity, how can we provide effective training signals?

5.1 Iterated Distillation and Amplification (IDA)

Human + weak AI assistant → Train a stronger AI
        ↑                           │
        └───────────────────────────┘
        The stronger AI becomes the new assistant

5.2 AI Safety via Debate

AI Agent A: Proposes an answer and arguments
AI Agent B: Challenges and rebuts
Human judge: Determines which side is more convincing

Theory: even if humans cannot independently verify complex problems,
        they can identify better arguments in a debate.

5.3 Recursive Reward Modeling

Simple tasks: Human directly evaluates
    ↓
Medium tasks: Human + AI-assisted evaluation
    ↓
Hard tasks: Use previously trained models to assist evaluation
    ↓
Very hard tasks: Continue recursively...

6. Interpretability for Alignment

Mechanistic interpretability alignment goals:

  • Understand how models internally represent concepts and make decisions
  • Detect deceptive alignment (model appears aligned during training, deviates during deployment)
  • Verify whether the model truly understands human values

Key research directions:

Direction Goal
Feature discovery Identify high-level concepts represented internally
Circuit analysis Trace neural pathways for specific behaviors
Representation engineering Directly manipulate internal concept representations
Anomaly detection Identify inconsistencies between model intent and behavior

7. Existential Risk Debate

7.1 Concerned Camp

Nick Bostrom, Stuart Russell, and others argue:

  • Superintelligence may pursue goals misaligned with humanity
  • "Paperclip maximizer" thought experiment: to maximize its objective, AI may consume all resources
  • Once out of control, correction may be impossible (irreversibility)
  • The alignment problem must be solved before AGI

7.2 Optimistic Camp

Yann LeCun, Andrew Ng, and others argue:

  • Current AI is far from AGI
  • Worrying about superintelligence is premature
  • Focus should be on present practical problems (bias, misuse, employment)
  • AI safety research can proceed in parallel with capability research

7.3 Current Consensus

Most researchers agree that:

  • Alignment is an important research direction
  • Research should not stop out of fear, nor should risks be ignored
  • Incremental alignment research is needed, strengthening as capabilities improve
  • Government, academia, and industry need to collaborate

8. Practical Paths for Alignment Research

Timeframe Focus Methods
Now Aligning current LLMs RLHF, Constitutional AI, red teaming
Near-term Supervising stronger models Scalable oversight, process rewards, interpretability
Medium-term Aligning superhuman models Formal verification, automated oversight, governance frameworks
Long-term AGI alignment Open research questions

9. Alignment is not a substitute for system security

Alignment improves model tendencies, not system boundaries. Even a model trained with RLHF or Constitutional AI can still fail when:

  • prompt-based or multi-turn jailbreaks keep searching for weak behavioral regions
  • tools, memory, and retrieval components allow untrusted context to hijack execution flow
  • a high-privilege environment amplifies a local policy failure into a full system incident

So the division of labor is:

Question Mostly alignment Mostly system security
Will the model try to refuse harmful requests? yes no
Can untrusted content control execution flow? no yes
Do dangerous actions require approval and isolation? no yes
Has the model actually learned safe internal preferences? yes partly related

Alignment is necessary, but never sufficient.

Refusal boundaries are not absolute boundaries

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 52. Image note: the slide shows a target model refusing on ethical grounds and an attacker searching around that refusal boundary. Why it matters: alignment training shapes a refusal tendency, but that tendency can still be bypassed, rewritten, or locally weakened.

Candidate-prompt search against an aligned model

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 53. Image note: the figure closes the loop between candidate prompts, the target model, and attack optimization. Why it matters: as long as repeated probing is allowed, aligned models can still be treated as search targets.

Automated jailbreak search

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 54. Image note: the slide shows the attacker, prompt candidates, and feedback loop as an automated process. Why it matters: alignment is not a static property; it faces continuous search pressure in deployment.

Putting guardrails into training rather than only into prompts

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 62. Image note: the slide presents adversarial training as part of guardrail training. Why it matters: robust alignment requires safety constraints to enter the training and evaluation loop, not just the system prompt.

Relations to other topics

References

  • Tufts EE141 Trusted AI Course Slides, LLM Security Lecture, Spring 2026.
  • "Superintelligence" - Nick Bostrom
  • "Human Compatible" - Stuart Russell
  • "Training language models to follow instructions with human feedback" - Ouyang et al. (InstructGPT)
  • "Constitutional AI: Harmlessness from AI Feedback" - Bai et al. (Anthropic)
  • Anthropic Research: https://www.anthropic.com/research

评论 #