Skip to content

Reinforcement Learning Milestones

Overview

Reinforcement learning has undergone decades of development from theoretical foundations to industrial deployment. This article traces the key milestones in RL history, charting the technological evolution from TD-Gammon to o1.


Timeline Overview

timeline
    title RL Milestones (1992-2024)
    1992 : TD-Gammon
         : Backgammon
    2013 : DQN
         : Atari Games
    2016 : AlphaGo
         : Go
    2017 : AlphaZero
         : General Board Games
    2019 : OpenAI Five
         : Dota 2
    2019 : AlphaStar
         : StarCraft II
    2020 : MuZero
         : No Rules Needed
    2022 : ChatGPT
         : RLHF
    2023 : RT-2
         : Robotics
    2024 : o1
         : Reasoning

1. TD-Gammon (1992)

Achievement

Gerald Tesauro's TD-Gammon, developed at IBM, was the first RL system to reach expert human-level play through self-play, excelling at Backgammon.

Core Algorithm

  • TD(\(\lambda\)) temporal difference learning
  • Neural network as value function approximator (3-layer feedforward, ~160 hidden units)
  • Self-play for training data generation (~1.5 million games)

Key Formula

\[V(s_t) \leftarrow V(s_t) + \alpha \sum_{k=t}^{T} \lambda^{k-t} \delta_k\]

where \(\delta_k = r_{k+1} + \gamma V(s_{k+1}) - V(s_k)\) is the TD error.

Historical Significance

  • First demonstration that RL + neural networks is viable in complex games
  • Inspired the direction of subsequent deep RL research
  • Pioneer of the self-play training paradigm

2. DQN: Deep Q-Network (2013/2015)

Achievement

DeepMind's DQN achieved or surpassed human-level performance across 49 Atari 2600 games using a single algorithm and network architecture, published in Nature.

Core Algorithm

  • Deep Q-Network: Convolutional neural network approximating \(Q(s,a;\theta)\)
  • Experience Replay: Breaking sample correlations
  • Target Network: Stabilizing training

Key Innovation

\[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right)^2 \right]\]

where \(\theta^-\) are target network parameters, periodically copied from \(\theta\).

Historical Significance

  • Inaugurated the deep reinforcement learning era
  • Proved end-to-end pixel-to-action learning is feasible
  • Sparked widespread academic and industrial interest in deep RL
  • Subsequent extensions: Double DQN, Dueling DQN, Prioritized ER, Rainbow

3. AlphaGo (2016)

Achievement

DeepMind's AlphaGo defeated world Go champion Lee Sedol 4:1, a historic breakthrough in the game of Go. Go's state space of approximately \(10^{170}\) far exceeds the capacity of all previous board game AI systems.

Core Algorithm

  • Policy network \(p_\sigma(a|s)\): Supervised learning from expert human games
  • Value network \(v_\theta(s)\): Evaluating board position win rates
  • Monte Carlo Tree Search (MCTS): Search combining policy and value networks
  • Self-play RL: Policy gradient for further improvement

System Architecture

AlphaGo System:
  ├── SL Policy Network (trained on human games)
  ├── RL Policy Network (self-play reinforcement)
  ├── Value Network (position evaluation)
  └── MCTS (search and decision-making)

Historical Significance

  • AI surpassed humans in the most complex board game
  • Demonstrated the power of deep learning + RL + search
  • Triggered global reassessment of AI capabilities

4. AlphaZero (2017)

Achievement

AlphaZero used no human knowledge whatsoever, learning purely through self-play to surpass all specialized top AI systems in Go, chess, and shogi.

Core Improvements

  • Eliminated human knowledge: No supervised learning phase, pure RL
  • Unified architecture: Same algorithm solves three different games
  • Simplified MCTS: Single neural network replaces rollouts

Key Results

Game Opponent Result Training Time
Go AlphaGo Lee 100:0 34 hours
Chess Stockfish 155.5:44.5 9 hours
Shogi Elmo 91.2:8.8 12 hours

Historical Significance

  • Proved pure self-play can surpass human knowledge
  • Success of the "tabula rasa" learning paradigm
  • Important validation of algorithmic generality

5. OpenAI Five (2019)

Achievement

OpenAI Five defeated the world champion team OG in full 5v5 Dota 2 matches. Dota 2's complexity far exceeds board games: real-time decisions, imperfect information, long time horizons, and team coordination.

Core Algorithm

  • Large-scale PPO: ~800 petaflop-days of compute per day
  • Self-play: Opponent pool + historical versions
  • Long time horizons: ~45 minutes per game, ~20,000 decision steps
  • Distributed training: Thousands of GPUs in parallel

Technical Details

  • Observation space: ~20,000 dimensional vector (not pixels)
  • Action space: ~170,000 possible actions
  • LSTM as policy network for temporal information
  • Carefully crafted reward shaping

Historical Significance

  • First time RL reached top level in complex real-time strategy games
  • Demonstrated the power of massive computation in RL
  • Breakthrough in multi-agent cooperation

6. AlphaStar (2019)

Achievement

DeepMind's AlphaStar reached Grandmaster level (top 0.2% of players) in StarCraft II, using the full game interface with no simplifications.

Core Algorithm

  • Multi-agent League Training: Maintaining a league of diverse strategies
  • Imitation learning + RL: First learn from human replays, then improve through self-play
  • Transformer architecture: Handling multi-entity attention in-game
  • Autoregressive policy: Processing structured action spaces

League Training Architecture

League Training:
  ├── Main Agents (primary training)
  ├── Main Exploiters (counter-strategies against main agents)
  └── League Exploiters (counter-strategies against entire league)

Historical Significance

  • Breakthrough in imperfect-information real-time strategy games
  • League Training became a classic paradigm for multi-agent training
  • Demonstrated RL's ability to handle extremely complex decision spaces

7. MuZero (2020)

Achievement

MuZero achieved superhuman performance in Go, chess, shogi, and Atari without knowing the game rules, by learning an environment model.

Core Algorithm

MuZero learns three functions:

  • Representation function \(h_\theta\): Maps observations to hidden states \(s = h_\theta(o)\)
  • Dynamics function \(g_\theta\): Predicts next hidden state and reward \((r, s') = g_\theta(s, a)\)
  • Prediction function \(f_\theta\): Predicts policy and value on hidden states \((p, v) = f_\theta(s)\)

Comparison with AlphaZero

Dimension AlphaZero MuZero
Environment rules Requires perfect simulator Not needed
Model None (uses simulator) Learned latent model
Applicability Perfect information games Broader (including Atari)
MCTS Search on real states Search in latent space

Historical Significance

  • Major milestone for model-based RL
  • Proved learned world models can replace perfect simulators
  • Unified model-based and model-free approaches

8. RLHF and ChatGPT (2022)

Achievement

OpenAI's ChatGPT used RLHF (Reinforcement Learning from Human Feedback) to align large language model outputs with human preferences, sparking an AI revolution.

Core Algorithm

RLHF three stages:

  1. SFT: Supervised fine-tuning of the base model
  2. Reward modeling: Train reward model \(R_\phi(x,y)\)
  3. PPO optimization:
\[\max_\theta \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ R_\phi(x,y) \right] - \beta D_{KL}(\pi_\theta \| \pi_{ref})\]

Key Papers

  • InstructGPT (Ouyang et al., 2022): RLHF methodology
  • Constitutional AI (Anthropic, 2022): Principle-based alignment
  • DPO (Rafailov et al., 2023): Alternative without explicit reward model

Historical Significance

  • Most widespread practical application of RL
  • From academic research to products with hundreds of millions of users
  • Opened a new era of AI alignment research
  • Proved RL can effectively control generative model behavior

9. RT-2: Robotic Transformer (2023)

Achievement

Google DeepMind's RT-2 combined vision-language models (VLMs) with robot control, enabling end-to-end learning from natural language instructions to robot actions.

Core Algorithm

  • Vision-Language-Action model (VLA): Representing robot actions as text tokens
  • Large-scale pre-training: Leveraging internet-scale vision-language data
  • Policy fine-tuning: Fine-tuning on robot manipulation data

Key Innovation

Input: Visual observation + Language instruction
  → VLM encoder (PaLM-E / PaLI-X)
  → Action token decoding
Output: Robot end-effector actions

Historical Significance

  • Fusion of foundation models with robot RL
  • Demonstrated generalization capabilities enabled by language understanding
  • Important progress in embodied AI

10. o1: Reasoning Enhancement (2024)

Achievement

OpenAI's o1 model used reinforcement learning to train model chains of thought, achieving major breakthroughs in mathematics, programming, and scientific reasoning tasks.

Core Approach

  • Process Reward Model (PRM): Rewarding reasoning steps rather than final answers
  • Test-time Compute Scaling: Models can "think longer" during inference
  • RL-trained reasoning: Using reinforcement learning to optimize chain-of-thought quality

Key Insight

\[\text{Traditional Scaling: } \text{Performance} \propto \text{Training Compute}\]
\[\text{o1 Scaling: } \text{Performance} \propto \text{Training Compute} \times \text{Inference Compute}\]

Subsequent Developments

  • DeepSeek-R1: Open-source reasoning model trained with GRPO
  • QwQ, Gemini Thinking: Reasoning-enhanced models from various organizations
  • Inference-time search: Exploration of MCTS + LLM integration

Historical Significance

  • Pioneered the test-time compute scaling paradigm
  • RL expanded from games/robotics to cognitive reasoning
  • Connected classical search/planning with modern LLMs

Milestone Summary

Year Milestone Core Algorithm Key Significance
1992 TD-Gammon TD(λ) + NN RL+NN feasibility proof
2013 DQN DQN + Experience Replay Inaugurated deep RL era
2016 AlphaGo MCTS + Policy/Value Net AI surpassed humans in Go
2017 AlphaZero Self-play + MCTS General board game AI without human knowledge
2019 OpenAI Five Large-scale PPO Complex real-time strategy game
2019 AlphaStar League Training Imperfect information strategy game
2020 MuZero Learned world model No environment rules needed
2022 ChatGPT RLHF (PPO) Most widespread RL application
2023 RT-2 VLA model Foundation models + robotics
2024 o1 RL-trained reasoning Test-time compute scaling

Through these milestones, several clear trends emerge:

  1. Simple to complex environments: Board games → Video games → Real-time strategy → Open worlds
  2. Specialized to general: Single task → Multi-task → General capabilities
  3. Virtual to real: Simulated environments → Real robots
  4. Games to cognition: Playing games → Language alignment → Reasoning enhancement
  5. Scale effects: Greater computation consistently yields performance improvements

References

  • Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon
  • Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature
  • Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature
  • Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play. Science
  • Berner, C. et al. (2019). Dota 2 with Large Scale Deep Reinforcement Learning
  • Vinyals, O. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
  • Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature
  • Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback
  • Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models

Further Reading


评论 #