Reinforcement Learning Landscape
Overview
Reinforcement Learning (RL) is one of the three major paradigms of machine learning, studying how agents learn optimal behavioral policies through trial-and-error interaction with an environment. From Bellman's introduction of dynamic programming in the 1950s to RLHF driving large language model alignment in the 2020s, RL has evolved into a vast and profound research field.
This article aims to provide a panoramic map of reinforcement learning, helping readers navigate the methodological landscape, algorithm taxonomy, and frontier directions.
1. Markov Decision Process (MDP)
1.1 Basic Framework
The mathematical foundation of reinforcement learning is the Markov Decision Process (MDP), defined as a five-tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):
- \(\mathcal{S}\): State space
- \(\mathcal{A}\): Action space
- \(P(s'|s,a)\): State transition probability
- \(R(s,a,s')\): Reward function
- \(\gamma \in [0,1)\): Discount factor
1.2 Core Objective
The agent's goal is to find an optimal policy \(\pi^*\) that maximizes the expected cumulative discounted return:
1.3 Bellman Equations
State value function Bellman equation:
Action value function Bellman equation:
Optimal Bellman equation:
1.4 MDP Extensions
| Extension | Characteristic | Application |
|---|---|---|
| POMDP | Partial observability | Robot navigation, dialogue systems |
| Dec-POMDP | Decentralized partial observability | Multi-agent cooperation |
| CMDP | Constrained MDP | Safe reinforcement learning |
| Semi-MDP | Temporal abstraction | Hierarchical reinforcement learning |
2. RL Taxonomy
2.1 Overall Taxonomy Tree
graph TD
RL[Reinforcement Learning] --> MF[Model-Free]
RL --> MB[Model-Based]
MF --> VB[Value-Based]
MF --> PG[Policy-Based]
MF --> AC[Actor-Critic]
VB --> DQN[DQN Family]
VB --> TD[TD Learning]
PG --> REINFORCE[REINFORCE]
PG --> TRPO[TRPO]
PG --> PPO[PPO]
AC --> A2C[A2C/A3C]
AC --> SAC[SAC]
AC --> DDPG[DDPG/TD3]
MB --> Dyna[Dyna Architecture]
MB --> MBPO[MBPO]
MB --> MuZero[MuZero]
MB --> WorldModel[World Models]
RL --> Offline[Offline RL]
RL --> MARL[Multi-Agent RL]
RL --> HRL[Hierarchical RL]
style RL fill:#e1f5fe
style MF fill:#fff3e0
style MB fill:#e8f5e9
2.2 Model-Free vs Model-Based
Model-Free Methods
Learn directly from interaction experience without requiring an environment dynamics model:
- Value-Based: Learn value functions, derive policy indirectly - Q-Learning, SARSA, DQN, Double DQN, Dueling DQN, Rainbow
- Policy-Based: Directly parameterize and optimize the policy - REINFORCE, TRPO, PPO
- Actor-Critic: Simultaneously learn policy (Actor) and value function (Critic) - A2C, A3C, SAC, DDPG, TD3
Model-Based Methods
Learn or leverage environment models for planning:
- Learned dynamics models: Dyna, MBPO, Dreamer
- Planning and search: AlphaGo, MuZero
- World models: World Models, IRIS
Selection Guide
- Sample efficiency priority → Model-based methods
- Simple implementation, asymptotic performance priority → Model-free methods
- Combining both → Dyna architecture
2.3 On-Policy vs Off-Policy
| Dimension | On-Policy | Off-Policy |
|---|---|---|
| Definition | Behavior policy = Target policy | Behavior policy ≠ Target policy |
| Data utilization | Low (discard after use) | High (reusable) |
| Stability | Better | Requires additional techniques |
| Representative algorithms | SARSA, PPO, A2C | Q-Learning, DQN, SAC |
| Experience replay | Not used | Used |
2.4 Offline Reinforcement Learning
Learn entirely from static datasets without environment interaction:
- Core challenge: Distribution shift, extrapolation error
- Representative methods: BCQ, CQL, IQL, Decision Transformer
- Applications: Medical decision-making, autonomous driving, recommendation systems
2.5 Single-Agent vs Multi-Agent
| Dimension | Single-Agent | Multi-Agent (MARL) |
|---|---|---|
| Environment | Static/stochastic | Non-stationary (other agents also learning) |
| Objective | Maximize own return | Cooperative/competitive/mixed |
| Challenges | Exploration-exploitation tradeoff | Credit assignment, communication, scalability |
| Representatives | DQN, PPO, SAC | QMIX, MAPPO, MADDPG |
3. Key Algorithm Map
3.1 By Development Timeline
| Year | Algorithm | Category | Key Contribution |
|---|---|---|---|
| 1989 | Q-Learning | Value-Based | Off-policy TD control |
| 1992 | REINFORCE | Policy Gradient | Policy gradient theorem |
| 2013 | DQN | Deep Value-Based | Deep networks + experience replay |
| 2015 | DDPG | Actor-Critic | Deterministic policy for continuous actions |
| 2015 | TRPO | Policy Gradient | Trust region optimization |
| 2016 | A3C | Actor-Critic | Asynchronous parallel training |
| 2017 | PPO | Policy Gradient | Clipped objective, simple and efficient |
| 2018 | SAC | Actor-Critic | Maximum entropy framework |
| 2018 | TD3 | Actor-Critic | Clipped double Q, delayed updates |
| 2020 | CQL | Offline RL | Conservative Q-learning |
| 2021 | Decision Transformer | Offline RL | Sequence modeling perspective |
3.2 By Application Scenario
Discrete Action Spaces:
├── Simple tasks → Q-Learning / SARSA
├── High-dimensional observations → DQN / Rainbow
└── Multi-agent → QMIX / VDN
Continuous Action Spaces:
├── Deterministic policy → DDPG / TD3
├── Stochastic policy → SAC
├── Stable training → PPO / TRPO
└── Multi-agent → MADDPG / MAPPO
Special Scenarios:
├── Static data → CQL / IQL / Decision Transformer
├── Planning needed → MuZero / Dreamer
└── LLM alignment → RLHF (PPO) / DPO / GRPO
4. Core Components of Deep RL
4.1 Function Approximation
- Value network: Approximate \(V(s)\) or \(Q(s,a)\) with neural networks
- Policy network: Parameterize policy \(\pi_\theta(a|s)\) with neural networks
- Model network: Approximate environment dynamics \(P(s'|s,a)\) with neural networks
4.2 Key Techniques for Stable Training
| Technique | Problem Solved | Used In |
|---|---|---|
| Experience Replay | Sample correlation, data efficiency | DQN, DDPG, SAC |
| Target Network | Training instability | DQN, DDPG, TD3 |
| Clipping | Excessively large policy updates | PPO |
| Trust Region | Policy update step control | TRPO |
| Entropy Regularization | Premature convergence, insufficient exploration | SAC, A3C |
| Prioritized Experience Replay (PER) | Sample utilization efficiency | Rainbow |
4.3 Exploration Strategies
- \(\epsilon\)-greedy: Simple and effective, suitable for discrete spaces
- Boltzmann exploration: Probability-based exploration using values
- UCB: Upper confidence bound, optimism in the face of uncertainty
- Intrinsic motivation: Curiosity-driven (ICM, RND)
- Posterior sampling: Thompson Sampling
- Maximum entropy: Automatic exploration in the SAC framework
5. Connection to LLM Post-Training
5.1 RLHF Pipeline
Reinforcement learning fine-tuning of large language models is one of the most important current applications of RL:
- Supervised Fine-Tuning (SFT): Fine-tune pre-trained model with high-quality data
- Reward Model Training: Train reward model \(R_\phi(x, y)\) from human preference data
- RL Optimization: Optimize policy with PPO, with KL divergence constraint
5.2 Beyond RLHF
| Method | Characteristics |
|---|---|
| DPO | No explicit reward model needed, optimize directly from preferences |
| GRPO | Group relative policy optimization, used for mathematical reasoning |
| RLAIF | AI feedback replaces human feedback |
| Constitutional AI | Principle-based self-improvement |
5.3 LLM from an RL Perspective
Viewing LLMs through the RL lens:
- State: Generated token sequence so far
- Action: Selection of the next token
- Policy: Language model \(\pi_\theta(a_t | s_t)\)
- Reward: Human preferences / AI judgments / verifier feedback
- Environment: Task context + evaluation mechanism
6. Frontier Directions
6.1 Current Hot Topics
- LLM reasoning enhancement: Training model reasoning capabilities with RL (o1, DeepSeek-R1)
- Embodied intelligence: RL in robot manipulation and navigation (RT-2, Mobile ALOHA)
- World models: Learning predictive models of environments (Dreamer, IRIS)
- Safe RL: Constrained optimization, robust policies
- Offline-to-online: Offline-to-Online RL fine-tuning
6.2 Open Challenges
- Sample efficiency: How to reduce required interactions?
- Generalization: How to transfer to new tasks/environments?
- Long-term credit assignment: Learning under sparse rewards
- Safety: Safety guarantees during training and deployment
- Scalability: Efficient training in large-scale environments
- Alignment: Ensuring agent behavior aligns with human intent
7. Suggested Learning Path
Beginner:
MDP Basics → Dynamic Programming → MC Methods → TD Learning → Q-Learning
Intermediate:
DQN → Policy Gradient → Actor-Critic → PPO → SAC
Advanced:
Model-Based RL → Offline RL → Multi-Agent RL → Hierarchical RL
Applications:
RLHF → Robot RL → Game AI → Reasoning Enhancement
References
- Sutton & Barto, Reinforcement Learning: An Introduction (2018)
- Sergey Levine, UC Berkeley CS285: Deep Reinforcement Learning
- David Silver, UCL RL Course
- OpenAI Spinning Up in Deep RL
Further Reading
- Classic RL Introduction — MDP and foundational algorithms in detail
- Deep Reinforcement Learning — Deep methods: DQN, PPO, SAC
- RL in LLM Post-Training — RLHF, DPO, and more
- Multi-Agent RL — MARL methods and applications