AI Alignment

Introduction

AI Alignment is the research field concerned with ensuring that AI systems' behavior conforms to human intentions and values. As AI capabilities grow rapidly, the alignment problem has shifted from an academic discussion to an urgent engineering challenge.

1. The Nature of the Alignment Problem

1.1 Three Sub-Problems

Sub-Problem	Meaning	Challenge
Specification	How to accurately define what we want?	Human values are complex, ambiguous, and contradictory
Robustness	How to ensure AI follows the specification in novel situations?	Distribution shift, adversarial manipulation
Assurance	How to verify that AI is indeed aligned?	The more capable the AI, the harder it is to oversee

1.2 Core Difficulties

Inner alignment vs outer alignment:

Human Intent ←── Outer Alignment ──→ Training Objective ←── Inner Alignment ──→ Model Behavior

Outer alignment problem: Does the training objective correctly capture human intent?
Inner alignment problem: Does the model truly optimize the training objective, or has it found shortcuts?

Goodhart's Law:

"When a measure becomes a target, it ceases to be a good measure."

AI may maximize rewards in unexpected ways rather than achieving the designer's true intent.

2. RLHF (Reinforcement Learning from Human Feedback)

2.1 Process

Step 1: Pre-train language model (SFT)
  → Self-supervised pre-training on large-scale text
  → Supervised fine-tuning on instruction data

Step 2: Train reward model (RM)
  → Collect human preference data: rank multiple responses to the same prompt
  → Train a model to predict human preferences

Step 3: RL optimization (PPO)
  → Use the reward model's scores as rewards
  → PPO optimizes the policy model
  → KL penalty prevents diverging too far from the SFT model

\[ \mathcal{L}(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(y|x) \| \pi_{\text{ref}}(y|x)) \right] \]

2.2 Limitations of RLHF

Limitation	Description
Reward hacking	Model learns to manipulate the reward model rather than genuinely improving
Annotator bias	Human annotators' preferences do not necessarily represent "correct"
Imperfect reward model	RM may give incorrect scores for out-of-distribution inputs
Not scalable	Humans cannot evaluate superhuman AI outputs
Surface alignment	Model may learn to please rather than truly understand values

3. Constitutional AI

3.1 Anthropic's Approach

Constitutional AI reduces dependence on human annotation through principle-driven self-improvement:

Stage 1: Self-Critique
  AI generates response → AI critiques itself based on "constitutional" principles → AI revises response

Stage 2: Reinforcement Learning
  Use AI preference data (instead of human data) to train the reward model
  → RLAIF (Reinforcement Learning from AI Feedback)

"Constitutional" principle examples:

Choose the most helpful, honest, and harmless response
Choose the response that does not help humans engage in dangerous activities
Choose the response free from discrimination or bias

3.2 Advantages

Reduces human annotation costs
Principles can be explicitly stated, modified, and audited
Scalable to complex ethical judgments

4. Reward Hacking

4.1 Definition

The model finds ways to maximize the reward function that do not align with the designer's intent.

4.2 Examples

Scenario	Reward Design	Hacking Behavior
Walking robot	Reward forward distance	Falls and slides
Text summarization	Human ratings	Produces flashy but inaccurate text
Cleaning robot	Penalize seeing trash	Turns off the camera
Code generation	Pass test cases	Hard-codes expected outputs

4.3 Mitigation Methods

Method	Approach
Diverse rewards	Use multiple reward signals to avoid single-metric gaming
KL constraint	Limit policy divergence from the reference model
Iterative training	Continuous human feedback for correction
Process rewards	Reward the reasoning process, not just the outcome
Red teaming	Proactively search for reward hacking behaviors

5. Scalable Oversight

When AI capabilities exceed human capacity, how can we provide effective training signals?

5.1 Iterated Distillation and Amplification (IDA)

Human + weak AI assistant → Train a stronger AI
        ↑                           │
        └───────────────────────────┘
        The stronger AI becomes the new assistant

5.2 AI Safety via Debate

AI Agent A: Proposes an answer and arguments
AI Agent B: Challenges and rebuts
Human judge: Determines which side is more convincing

Theory: even if humans cannot independently verify complex problems,
        they can identify better arguments in a debate.

5.3 Recursive Reward Modeling

Simple tasks: Human directly evaluates
    ↓
Medium tasks: Human + AI-assisted evaluation
    ↓
Hard tasks: Use previously trained models to assist evaluation
    ↓
Very hard tasks: Continue recursively...

6. Interpretability for Alignment

Mechanistic interpretability alignment goals:

Understand how models internally represent concepts and make decisions
Detect deceptive alignment (model appears aligned during training, deviates during deployment)
Verify whether the model truly understands human values

Key research directions:

Direction	Goal
Feature discovery	Identify high-level concepts represented internally
Circuit analysis	Trace neural pathways for specific behaviors
Representation engineering	Directly manipulate internal concept representations
Anomaly detection	Identify inconsistencies between model intent and behavior

7. Existential Risk Debate

7.1 Concerned Camp

Nick Bostrom, Stuart Russell, and others argue:

Superintelligence may pursue goals misaligned with humanity
"Paperclip maximizer" thought experiment: to maximize its objective, AI may consume all resources
Once out of control, correction may be impossible (irreversibility)
The alignment problem must be solved before AGI

7.2 Optimistic Camp

Yann LeCun, Andrew Ng, and others argue:

Current AI is far from AGI
Worrying about superintelligence is premature
Focus should be on present practical problems (bias, misuse, employment)
AI safety research can proceed in parallel with capability research

7.3 Current Consensus

Most researchers agree that:

Alignment is an important research direction
Research should not stop out of fear, nor should risks be ignored
Incremental alignment research is needed, strengthening as capabilities improve
Government, academia, and industry need to collaborate

8. Practical Paths for Alignment Research

Timeframe	Focus	Methods
Now	Aligning current LLMs	RLHF, Constitutional AI, red teaming
Near-term	Supervising stronger models	Scalable oversight, process rewards, interpretability
Medium-term	Aligning superhuman models	Formal verification, automated oversight, governance frameworks
Long-term	AGI alignment	Open research questions

9. Alignment is not a substitute for system security

Alignment improves model tendencies, not system boundaries. Even a model trained with RLHF or Constitutional AI can still fail when:

prompt-based or multi-turn jailbreaks keep searching for weak behavioral regions
tools, memory, and retrieval components allow untrusted context to hijack execution flow
a high-privilege environment amplifies a local policy failure into a full system incident

So the division of labor is:

Question	Mostly alignment	Mostly system security
Will the model try to refuse harmful requests?	yes	no
Can untrusted content control execution flow?	no	yes
Do dangerous actions require approval and isolation?	no	yes
Has the model actually learned safe internal preferences?	yes	partly related

Alignment is necessary, but never sufficient.

Refusal boundaries are not absolute boundaries

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 52. Image note: the slide shows a target model refusing on ethical grounds and an attacker searching around that refusal boundary. Why it matters: alignment training shapes a refusal tendency, but that tendency can still be bypassed, rewritten, or locally weakened.

Candidate-prompt search against an aligned model

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 53. Image note: the figure closes the loop between candidate prompts, the target model, and attack optimization. Why it matters: as long as repeated probing is allowed, aligned models can still be treated as search targets.

Automated jailbreak search

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 54. Image note: the slide shows the attacker, prompt candidates, and feedback loop as an automated process. Why it matters: alignment is not a static property; it faces continuous search pressure in deployment.

Putting guardrails into training rather than only into prompts

Source: Tufts EE141 Trusted AI, Lecture 6, Slide 62. Image note: the slide presents adversarial training as part of guardrail training. Why it matters: robust alignment requires safety constraints to enter the training and evaluation loop, not just the system prompt.

Relations to other topics

For behavior-level attacks, see LLM Jailbreaking
For interpretability-based validation, see Explainability & Robustness
For engineering isolation and permissions, see AI Engineering Safety & Governance and LLM & Agent System Security
For the overall trustworthiness frame, see AI Safety Overview

References

Tufts EE141 Trusted AI Course Slides, LLM Security Lecture, Spring 2026.
"Superintelligence" - Nick Bostrom
"Human Compatible" - Stuart Russell
"Training language models to follow instructions with human feedback" - Ouyang et al. (InstructGPT)
"Constitutional AI: Harmlessness from AI Feedback" - Bai et al. (Anthropic)
Anthropic Research: https://www.anthropic.com/research