Reinforcement Learning Milestones
Overview
Reinforcement learning has undergone decades of development from theoretical foundations to industrial deployment. This article traces the key milestones in RL history, charting the technological evolution from TD-Gammon to o1.
Timeline Overview
timeline
title RL Milestones (1992-2024)
1992 : TD-Gammon
: Backgammon
2013 : DQN
: Atari Games
2016 : AlphaGo
: Go
2017 : AlphaZero
: General Board Games
2019 : OpenAI Five
: Dota 2
2019 : AlphaStar
: StarCraft II
2020 : MuZero
: No Rules Needed
2022 : ChatGPT
: RLHF
2023 : RT-2
: Robotics
2024 : o1
: Reasoning
1. TD-Gammon (1992)
Achievement
Gerald Tesauro's TD-Gammon, developed at IBM, was the first RL system to reach expert human-level play through self-play, excelling at Backgammon.
Core Algorithm
- TD(\(\lambda\)) temporal difference learning
- Neural network as value function approximator (3-layer feedforward, ~160 hidden units)
- Self-play for training data generation (~1.5 million games)
Key Formula
where \(\delta_k = r_{k+1} + \gamma V(s_{k+1}) - V(s_k)\) is the TD error.
Historical Significance
- First demonstration that RL + neural networks is viable in complex games
- Inspired the direction of subsequent deep RL research
- Pioneer of the self-play training paradigm
2. DQN: Deep Q-Network (2013/2015)
Achievement
DeepMind's DQN achieved or surpassed human-level performance across 49 Atari 2600 games using a single algorithm and network architecture, published in Nature.
Core Algorithm
- Deep Q-Network: Convolutional neural network approximating \(Q(s,a;\theta)\)
- Experience Replay: Breaking sample correlations
- Target Network: Stabilizing training
Key Innovation
where \(\theta^-\) are target network parameters, periodically copied from \(\theta\).
Historical Significance
- Inaugurated the deep reinforcement learning era
- Proved end-to-end pixel-to-action learning is feasible
- Sparked widespread academic and industrial interest in deep RL
- Subsequent extensions: Double DQN, Dueling DQN, Prioritized ER, Rainbow
3. AlphaGo (2016)
Achievement
DeepMind's AlphaGo defeated world Go champion Lee Sedol 4:1, a historic breakthrough in the game of Go. Go's state space of approximately \(10^{170}\) far exceeds the capacity of all previous board game AI systems.
Core Algorithm
- Policy network \(p_\sigma(a|s)\): Supervised learning from expert human games
- Value network \(v_\theta(s)\): Evaluating board position win rates
- Monte Carlo Tree Search (MCTS): Search combining policy and value networks
- Self-play RL: Policy gradient for further improvement
System Architecture
AlphaGo System:
├── SL Policy Network (trained on human games)
├── RL Policy Network (self-play reinforcement)
├── Value Network (position evaluation)
└── MCTS (search and decision-making)
Historical Significance
- AI surpassed humans in the most complex board game
- Demonstrated the power of deep learning + RL + search
- Triggered global reassessment of AI capabilities
4. AlphaZero (2017)
Achievement
AlphaZero used no human knowledge whatsoever, learning purely through self-play to surpass all specialized top AI systems in Go, chess, and shogi.
Core Improvements
- Eliminated human knowledge: No supervised learning phase, pure RL
- Unified architecture: Same algorithm solves three different games
- Simplified MCTS: Single neural network replaces rollouts
Key Results
| Game | Opponent | Result | Training Time |
|---|---|---|---|
| Go | AlphaGo Lee | 100:0 | 34 hours |
| Chess | Stockfish | 155.5:44.5 | 9 hours |
| Shogi | Elmo | 91.2:8.8 | 12 hours |
Historical Significance
- Proved pure self-play can surpass human knowledge
- Success of the "tabula rasa" learning paradigm
- Important validation of algorithmic generality
5. OpenAI Five (2019)
Achievement
OpenAI Five defeated the world champion team OG in full 5v5 Dota 2 matches. Dota 2's complexity far exceeds board games: real-time decisions, imperfect information, long time horizons, and team coordination.
Core Algorithm
- Large-scale PPO: ~800 petaflop-days of compute per day
- Self-play: Opponent pool + historical versions
- Long time horizons: ~45 minutes per game, ~20,000 decision steps
- Distributed training: Thousands of GPUs in parallel
Technical Details
- Observation space: ~20,000 dimensional vector (not pixels)
- Action space: ~170,000 possible actions
- LSTM as policy network for temporal information
- Carefully crafted reward shaping
Historical Significance
- First time RL reached top level in complex real-time strategy games
- Demonstrated the power of massive computation in RL
- Breakthrough in multi-agent cooperation
6. AlphaStar (2019)
Achievement
DeepMind's AlphaStar reached Grandmaster level (top 0.2% of players) in StarCraft II, using the full game interface with no simplifications.
Core Algorithm
- Multi-agent League Training: Maintaining a league of diverse strategies
- Imitation learning + RL: First learn from human replays, then improve through self-play
- Transformer architecture: Handling multi-entity attention in-game
- Autoregressive policy: Processing structured action spaces
League Training Architecture
League Training:
├── Main Agents (primary training)
├── Main Exploiters (counter-strategies against main agents)
└── League Exploiters (counter-strategies against entire league)
Historical Significance
- Breakthrough in imperfect-information real-time strategy games
- League Training became a classic paradigm for multi-agent training
- Demonstrated RL's ability to handle extremely complex decision spaces
7. MuZero (2020)
Achievement
MuZero achieved superhuman performance in Go, chess, shogi, and Atari without knowing the game rules, by learning an environment model.
Core Algorithm
MuZero learns three functions:
- Representation function \(h_\theta\): Maps observations to hidden states \(s = h_\theta(o)\)
- Dynamics function \(g_\theta\): Predicts next hidden state and reward \((r, s') = g_\theta(s, a)\)
- Prediction function \(f_\theta\): Predicts policy and value on hidden states \((p, v) = f_\theta(s)\)
Comparison with AlphaZero
| Dimension | AlphaZero | MuZero |
|---|---|---|
| Environment rules | Requires perfect simulator | Not needed |
| Model | None (uses simulator) | Learned latent model |
| Applicability | Perfect information games | Broader (including Atari) |
| MCTS | Search on real states | Search in latent space |
Historical Significance
- Major milestone for model-based RL
- Proved learned world models can replace perfect simulators
- Unified model-based and model-free approaches
8. RLHF and ChatGPT (2022)
Achievement
OpenAI's ChatGPT used RLHF (Reinforcement Learning from Human Feedback) to align large language model outputs with human preferences, sparking an AI revolution.
Core Algorithm
RLHF three stages:
- SFT: Supervised fine-tuning of the base model
- Reward modeling: Train reward model \(R_\phi(x,y)\)
- PPO optimization:
Key Papers
- InstructGPT (Ouyang et al., 2022): RLHF methodology
- Constitutional AI (Anthropic, 2022): Principle-based alignment
- DPO (Rafailov et al., 2023): Alternative without explicit reward model
Historical Significance
- Most widespread practical application of RL
- From academic research to products with hundreds of millions of users
- Opened a new era of AI alignment research
- Proved RL can effectively control generative model behavior
9. RT-2: Robotic Transformer (2023)
Achievement
Google DeepMind's RT-2 combined vision-language models (VLMs) with robot control, enabling end-to-end learning from natural language instructions to robot actions.
Core Algorithm
- Vision-Language-Action model (VLA): Representing robot actions as text tokens
- Large-scale pre-training: Leveraging internet-scale vision-language data
- Policy fine-tuning: Fine-tuning on robot manipulation data
Key Innovation
Input: Visual observation + Language instruction
→ VLM encoder (PaLM-E / PaLI-X)
→ Action token decoding
Output: Robot end-effector actions
Historical Significance
- Fusion of foundation models with robot RL
- Demonstrated generalization capabilities enabled by language understanding
- Important progress in embodied AI
10. o1: Reasoning Enhancement (2024)
Achievement
OpenAI's o1 model used reinforcement learning to train model chains of thought, achieving major breakthroughs in mathematics, programming, and scientific reasoning tasks.
Core Approach
- Process Reward Model (PRM): Rewarding reasoning steps rather than final answers
- Test-time Compute Scaling: Models can "think longer" during inference
- RL-trained reasoning: Using reinforcement learning to optimize chain-of-thought quality
Key Insight
Subsequent Developments
- DeepSeek-R1: Open-source reasoning model trained with GRPO
- QwQ, Gemini Thinking: Reasoning-enhanced models from various organizations
- Inference-time search: Exploration of MCTS + LLM integration
Historical Significance
- Pioneered the test-time compute scaling paradigm
- RL expanded from games/robotics to cognitive reasoning
- Connected classical search/planning with modern LLMs
Milestone Summary
| Year | Milestone | Core Algorithm | Key Significance |
|---|---|---|---|
| 1992 | TD-Gammon | TD(λ) + NN | RL+NN feasibility proof |
| 2013 | DQN | DQN + Experience Replay | Inaugurated deep RL era |
| 2016 | AlphaGo | MCTS + Policy/Value Net | AI surpassed humans in Go |
| 2017 | AlphaZero | Self-play + MCTS | General board game AI without human knowledge |
| 2019 | OpenAI Five | Large-scale PPO | Complex real-time strategy game |
| 2019 | AlphaStar | League Training | Imperfect information strategy game |
| 2020 | MuZero | Learned world model | No environment rules needed |
| 2022 | ChatGPT | RLHF (PPO) | Most widespread RL application |
| 2023 | RT-2 | VLA model | Foundation models + robotics |
| 2024 | o1 | RL-trained reasoning | Test-time compute scaling |
Development Trends
Through these milestones, several clear trends emerge:
- Simple to complex environments: Board games → Video games → Real-time strategy → Open worlds
- Specialized to general: Single task → Multi-task → General capabilities
- Virtual to real: Simulated environments → Real robots
- Games to cognition: Playing games → Language alignment → Reasoning enhancement
- Scale effects: Greater computation consistently yields performance improvements
References
- Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature
- Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature
- Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play. Science
- Berner, C. et al. (2019). Dota 2 with Large Scale Deep Reinforcement Learning
- Vinyals, O. et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
- Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature
- Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback
- Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models
Further Reading
- RL Landscape — Methodological overview
- Deep RL Introduction — DQN in detail
- PPO Algorithm — PPO in detail
- RL in LLM Post-Training — RLHF and DPO