Reinforcement Learning Milestones

Overview

Reinforcement learning has undergone decades of development from theoretical foundations to industrial deployment. This article traces the key milestones in RL history, charting the technological evolution from TD-Gammon to o1.

Timeline Overview

timeline
    title RL Milestones (1992-2024)
    1992 : TD-Gammon
         : Backgammon
    2013 : DQN
         : Atari Games
    2016 : AlphaGo
         : Go
    2017 : AlphaZero
         : General Board Games
    2019 : OpenAI Five
         : Dota 2
    2019 : AlphaStar
         : StarCraft II
    2020 : MuZero
         : No Rules Needed
    2022 : ChatGPT
         : RLHF
    2023 : RT-2
         : Robotics
    2024 : o1
         : Reasoning

1. TD-Gammon (1992)

Achievement

Gerald Tesauro's TD-Gammon, developed at IBM, was the first RL system to reach expert human-level play through self-play, excelling at Backgammon.

Core Algorithm

TD(\(\lambda\)) temporal difference learning
Neural network as value function approximator (3-layer feedforward, ~160 hidden units)
Self-play for training data generation (~1.5 million games)

Key Formula

\[V(s_t) \leftarrow V(s_t) + \alpha \sum_{k=t}^{T} \lambda^{k-t} \delta_k\]

where \(\delta_k = r_{k+1} + \gamma V(s_{k+1}) - V(s_k)\) is the TD error.

Historical Significance

First demonstration that RL + neural networks is viable in complex games
Inspired the direction of subsequent deep RL research
Pioneer of the self-play training paradigm

2. DQN: Deep Q-Network (2013/2015)

Achievement

DeepMind's DQN achieved or surpassed human-level performance across 49 Atari 2600 games using a single algorithm and network architecture, published in Nature.

Core Algorithm

Deep Q-Network: Convolutional neural network approximating \(Q(s,a;\theta)\)
Experience Replay: Breaking sample correlations
Target Network: Stabilizing training

Key Innovation

\[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right)^2 \right]\]

where \(\theta^-\) are target network parameters, periodically copied from \(\theta\).

Historical Significance

Inaugurated the deep reinforcement learning era
Proved end-to-end pixel-to-action learning is feasible
Sparked widespread academic and industrial interest in deep RL
Subsequent extensions: Double DQN, Dueling DQN, Prioritized ER, Rainbow

3. AlphaGo (2016)

Achievement

DeepMind's AlphaGo defeated world Go champion Lee Sedol 4:1, a historic breakthrough in the game of Go. Go's state space of approximately \(10^{170}\) far exceeds the capacity of all previous board game AI systems.

Core Algorithm

Policy network \(p_\sigma(a|s)\): Supervised learning from expert human games
Value network \(v_\theta(s)\): Evaluating board position win rates
Monte Carlo Tree Search (MCTS): Search combining policy and value networks
Self-play RL: Policy gradient for further improvement

System Architecture

AlphaGo System:
  ├── SL Policy Network (trained on human games)
  ├── RL Policy Network (self-play reinforcement)
  ├── Value Network (position evaluation)
  └── MCTS (search and decision-making)

Historical Significance

AI surpassed humans in the most complex board game
Demonstrated the power of deep learning + RL + search
Triggered global reassessment of AI capabilities

4. AlphaZero (2017)

Achievement

AlphaZero used no human knowledge whatsoever, learning purely through self-play to surpass all specialized top AI systems in Go, chess, and shogi.

Core Improvements

Eliminated human knowledge: No supervised learning phase, pure RL
Unified architecture: Same algorithm solves three different games
Simplified MCTS: Single neural network replaces rollouts

Key Results

Game	Opponent	Result	Training Time
Go	AlphaGo Lee	100:0	34 hours
Chess	Stockfish	155.5:44.5	9 hours
Shogi	Elmo	91.2:8.8	12 hours

Historical Significance

Proved pure self-play can surpass human knowledge
Success of the "tabula rasa" learning paradigm
Important validation of algorithmic generality

5. OpenAI Five (2019)

Achievement

OpenAI Five defeated the world champion team OG in full 5v5 Dota 2 matches. Dota 2's complexity far exceeds board games: real-time decisions, imperfect information, long time horizons, and team coordination.

Core Algorithm

Large-scale PPO: ~800 petaflop-days of compute per day
Self-play: Opponent pool + historical versions
Long time horizons: ~45 minutes per game, ~20,000 decision steps
Distributed training: Thousands of GPUs in parallel

Technical Details

Observation space: ~20,000 dimensional vector (not pixels)
Action space: ~170,000 possible actions
LSTM as policy network for temporal information
Carefully crafted reward shaping

Historical Significance

First time RL reached top level in complex real-time strategy games
Demonstrated the power of massive computation in RL
Breakthrough in multi-agent cooperation

6. AlphaStar (2019)

Achievement

DeepMind's AlphaStar reached Grandmaster level (top 0.2% of players) in StarCraft II, using the full game interface with no simplifications.

Core Algorithm

Multi-agent League Training: Maintaining a league of diverse strategies
Imitation learning + RL: First learn from human replays, then improve through self-play
Transformer architecture: Handling multi-entity attention in-game
Autoregressive policy: Processing structured action spaces

League Training Architecture

League Training:
  ├── Main Agents (primary training)
  ├── Main Exploiters (counter-strategies against main agents)
  └── League Exploiters (counter-strategies against entire league)

Historical Significance

Breakthrough in imperfect-information real-time strategy games
League Training became a classic paradigm for multi-agent training
Demonstrated RL's ability to handle extremely complex decision spaces

7. MuZero (2020)

Achievement

MuZero achieved superhuman performance in Go, chess, shogi, and Atari without knowing the game rules, by learning an environment model.

Core Algorithm

MuZero learns three functions:

Representation function \(h_\theta\): Maps observations to hidden states \(s = h_\theta(o)\)
Dynamics function \(g_\theta\): Predicts next hidden state and reward \((r, s') = g_\theta(s, a)\)
Prediction function \(f_\theta\): Predicts policy and value on hidden states \((p, v) = f_\theta(s)\)

Comparison with AlphaZero

Dimension	AlphaZero	MuZero
Environment rules	Requires perfect simulator	Not needed
Model	None (uses simulator)	Learned latent model
Applicability	Perfect information games	Broader (including Atari)
MCTS	Search on real states	Search in latent space

Historical Significance

Major milestone for model-based RL
Proved learned world models can replace perfect simulators
Unified model-based and model-free approaches

8. RLHF and ChatGPT (2022)

Achievement

OpenAI's ChatGPT used RLHF (Reinforcement Learning from Human Feedback) to align large language model outputs with human preferences, sparking an AI revolution.

Core Algorithm

RLHF three stages:

SFT: Supervised fine-tuning of the base model
Reward modeling: Train reward model \(R_\phi(x,y)\)
PPO optimization:

\[\max_\theta \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ R_\phi(x,y) \right] - \beta D_{KL}(\pi_\theta \| \pi_{ref})\]

Key Papers

InstructGPT (Ouyang et al., 2022): RLHF methodology
Constitutional AI (Anthropic, 2022): Principle-based alignment
DPO (Rafailov et al., 2023): Alternative without explicit reward model

Historical Significance

Most widespread practical application of RL
From academic research to products with hundreds of millions of users
Opened a new era of AI alignment research
Proved RL can effectively control generative model behavior

9. RT-2: Robotic Transformer (2023)

Achievement

Google DeepMind's RT-2 combined vision-language models (VLMs) with robot control, enabling end-to-end learning from natural language instructions to robot actions.

Core Algorithm

Vision-Language-Action model (VLA): Representing robot actions as text tokens
Large-scale pre-training: Leveraging internet-scale vision-language data
Policy fine-tuning: Fine-tuning on robot manipulation data

Key Innovation

Input: Visual observation + Language instruction
  → VLM encoder (PaLM-E / PaLI-X)
  → Action token decoding
Output: Robot end-effector actions

Historical Significance

Fusion of foundation models with robot RL
Demonstrated generalization capabilities enabled by language understanding
Important progress in embodied AI

10. o1: Reasoning Enhancement (2024)

Achievement

OpenAI's o1 model used reinforcement learning to train model chains of thought, achieving major breakthroughs in mathematics, programming, and scientific reasoning tasks.

Core Approach

Process Reward Model (PRM): Rewarding reasoning steps rather than final answers
Test-time Compute Scaling: Models can "think longer" during inference
RL-trained reasoning: Using reinforcement learning to optimize chain-of-thought quality

Key Insight

\[\text{Traditional Scaling: } \text{Performance} \propto \text{Training Compute}\]

\[\text{o1 Scaling: } \text{Performance} \propto \text{Training Compute} \times \text{Inference Compute}\]

Subsequent Developments

DeepSeek-R1: Open-source reasoning model trained with GRPO
QwQ, Gemini Thinking: Reasoning-enhanced models from various organizations
Inference-time search: Exploration of MCTS + LLM integration

Historical Significance

Pioneered the test-time compute scaling paradigm
RL expanded from games/robotics to cognitive reasoning
Connected classical search/planning with modern LLMs

Milestone Summary

Year	Milestone	Core Algorithm	Key Significance
1992	TD-Gammon	TD(λ) + NN	RL+NN feasibility proof
2013	DQN	DQN + Experience Replay	Inaugurated deep RL era
2016	AlphaGo	MCTS + Policy/Value Net	AI surpassed humans in Go
2017	AlphaZero	Self-play + MCTS	General board game AI without human knowledge
2019	OpenAI Five	Large-scale PPO	Complex real-time strategy game
2019	AlphaStar	League Training	Imperfect information strategy game
2020	MuZero	Learned world model	No environment rules needed
2022	ChatGPT	RLHF (PPO)	Most widespread RL application
2023	RT-2	VLA model	Foundation models + robotics
2024	o1	RL-trained reasoning	Test-time compute scaling

Development Trends

Through these milestones, several clear trends emerge:

Simple to complex environments: Board games → Video games → Real-time strategy → Open worlds
Specialized to general: Single task → Multi-task → General capabilities
Virtual to real: Simulated environments → Real robots
Games to cognition: Playing games → Language alignment → Reasoning enhancement
Scale effects: Greater computation consistently yields performance improvements

References

Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature
Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature
Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play. Science
Berner, C. et al. (2019). Dota 2 with Large Scale Deep Reinforcement Learning
Vinyals, O. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback
Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models

Reinforcement Learning Milestones

Overview

Timeline Overview

1. TD-Gammon (1992)

Achievement

Core Algorithm

Key Formula

Historical Significance

2. DQN: Deep Q-Network (2013/2015)

Achievement

Core Algorithm

Key Innovation

Historical Significance

3. AlphaGo (2016)

Achievement

Core Algorithm

System Architecture

Historical Significance

4. AlphaZero (2017)

Achievement

Core Improvements

Key Results

Historical Significance

5. OpenAI Five (2019)

Achievement

Core Algorithm

Technical Details

Historical Significance

6. AlphaStar (2019)

Achievement

Core Algorithm

League Training Architecture

Historical Significance

7. MuZero (2020)

Achievement

Core Algorithm

Comparison with AlphaZero

Historical Significance

8. RLHF and ChatGPT (2022)

Achievement

Core Algorithm

Key Papers

Historical Significance

9. RT-2: Robotic Transformer (2023)

Achievement

Core Algorithm

Key Innovation

Historical Significance

10. o1: Reasoning Enhancement (2024)

Achievement

Core Approach

Key Insight

Subsequent Developments

Historical Significance

Milestone Summary

Development Trends

References

Further Reading

评论 #