Reinforcement Learning Landscape

Overview

Reinforcement Learning (RL) is one of the three major paradigms of machine learning, studying how agents learn optimal behavioral policies through trial-and-error interaction with an environment. From Bellman's introduction of dynamic programming in the 1950s to RLHF driving large language model alignment in the 2020s, RL has evolved into a vast and profound research field.

This article aims to provide a panoramic map of reinforcement learning, helping readers navigate the methodological landscape, algorithm taxonomy, and frontier directions.

1. Markov Decision Process (MDP)

1.1 Basic Framework

The mathematical foundation of reinforcement learning is the Markov Decision Process (MDP), defined as a five-tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

\(\mathcal{S}\): State space
\(\mathcal{A}\): Action space
\(P(s'|s,a)\): State transition probability
\(R(s,a,s')\): Reward function
\(\gamma \in [0,1)\): Discount factor

1.2 Core Objective

The agent's goal is to find an optimal policy \(\pi^*\) that maximizes the expected cumulative discounted return:

\[V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]\]

1.3 Bellman Equations

State value function Bellman equation:

\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]

Action value function Bellman equation:

\[Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a') \right]\]

Optimal Bellman equation:

\[V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]\]

1.4 MDP Extensions

Extension	Characteristic	Application
POMDP	Partial observability	Robot navigation, dialogue systems
Dec-POMDP	Decentralized partial observability	Multi-agent cooperation
CMDP	Constrained MDP	Safe reinforcement learning
Semi-MDP	Temporal abstraction	Hierarchical reinforcement learning

2. RL Taxonomy

2.1 Overall Taxonomy Tree

graph TD
    RL[Reinforcement Learning] --> MF[Model-Free]
    RL --> MB[Model-Based]

    MF --> VB[Value-Based]
    MF --> PG[Policy-Based]
    MF --> AC[Actor-Critic]

    VB --> DQN[DQN Family]
    VB --> TD[TD Learning]

    PG --> REINFORCE[REINFORCE]
    PG --> TRPO[TRPO]
    PG --> PPO[PPO]

    AC --> A2C[A2C/A3C]
    AC --> SAC[SAC]
    AC --> DDPG[DDPG/TD3]

    MB --> Dyna[Dyna Architecture]
    MB --> MBPO[MBPO]
    MB --> MuZero[MuZero]
    MB --> WorldModel[World Models]

    RL --> Offline[Offline RL]
    RL --> MARL[Multi-Agent RL]
    RL --> HRL[Hierarchical RL]

    style RL fill:#e1f5fe
    style MF fill:#fff3e0
    style MB fill:#e8f5e9

2.2 Model-Free vs Model-Based

Model-Free Methods

Learn directly from interaction experience without requiring an environment dynamics model:

Value-Based: Learn value functions, derive policy indirectly - Q-Learning, SARSA, DQN, Double DQN, Dueling DQN, Rainbow
Policy-Based: Directly parameterize and optimize the policy - REINFORCE, TRPO, PPO
Actor-Critic: Simultaneously learn policy (Actor) and value function (Critic) - A2C, A3C, SAC, DDPG, TD3

Model-Based Methods

Learn or leverage environment models for planning:

Learned dynamics models: Dyna, MBPO, Dreamer
Planning and search: AlphaGo, MuZero
World models: World Models, IRIS

Selection Guide

Sample efficiency priority → Model-based methods
Simple implementation, asymptotic performance priority → Model-free methods
Combining both → Dyna architecture

2.3 On-Policy vs Off-Policy

Dimension	On-Policy	Off-Policy
Definition	Behavior policy = Target policy	Behavior policy ≠ Target policy
Data utilization	Low (discard after use)	High (reusable)
Stability	Better	Requires additional techniques
Representative algorithms	SARSA, PPO, A2C	Q-Learning, DQN, SAC
Experience replay	Not used	Used

2.4 Offline Reinforcement Learning

Learn entirely from static datasets without environment interaction:

Core challenge: Distribution shift, extrapolation error
Representative methods: BCQ, CQL, IQL, Decision Transformer
Applications: Medical decision-making, autonomous driving, recommendation systems

2.5 Single-Agent vs Multi-Agent

Dimension	Single-Agent	Multi-Agent (MARL)
Environment	Static/stochastic	Non-stationary (other agents also learning)
Objective	Maximize own return	Cooperative/competitive/mixed
Challenges	Exploration-exploitation tradeoff	Credit assignment, communication, scalability
Representatives	DQN, PPO, SAC	QMIX, MAPPO, MADDPG

3. Key Algorithm Map

3.1 By Development Timeline

Year	Algorithm	Category	Key Contribution
1989	Q-Learning	Value-Based	Off-policy TD control
1992	REINFORCE	Policy Gradient	Policy gradient theorem
2013	DQN	Deep Value-Based	Deep networks + experience replay
2015	DDPG	Actor-Critic	Deterministic policy for continuous actions
2015	TRPO	Policy Gradient	Trust region optimization
2016	A3C	Actor-Critic	Asynchronous parallel training
2017	PPO	Policy Gradient	Clipped objective, simple and efficient
2018	SAC	Actor-Critic	Maximum entropy framework
2018	TD3	Actor-Critic	Clipped double Q, delayed updates
2020	CQL	Offline RL	Conservative Q-learning
2021	Decision Transformer	Offline RL	Sequence modeling perspective

3.2 By Application Scenario

Discrete Action Spaces:
  ├── Simple tasks → Q-Learning / SARSA
  ├── High-dimensional observations → DQN / Rainbow
  └── Multi-agent → QMIX / VDN

Continuous Action Spaces:
  ├── Deterministic policy → DDPG / TD3
  ├── Stochastic policy → SAC
  ├── Stable training → PPO / TRPO
  └── Multi-agent → MADDPG / MAPPO

Special Scenarios:
  ├── Static data → CQL / IQL / Decision Transformer
  ├── Planning needed → MuZero / Dreamer
  └── LLM alignment → RLHF (PPO) / DPO / GRPO

4. Core Components of Deep RL

4.1 Function Approximation

Value network: Approximate \(V(s)\) or \(Q(s,a)\) with neural networks
Policy network: Parameterize policy \(\pi_\theta(a|s)\) with neural networks
Model network: Approximate environment dynamics \(P(s'|s,a)\) with neural networks

4.2 Key Techniques for Stable Training

Technique	Problem Solved	Used In
Experience Replay	Sample correlation, data efficiency	DQN, DDPG, SAC
Target Network	Training instability	DQN, DDPG, TD3
Clipping	Excessively large policy updates	PPO
Trust Region	Policy update step control	TRPO
Entropy Regularization	Premature convergence, insufficient exploration	SAC, A3C
Prioritized Experience Replay (PER)	Sample utilization efficiency	Rainbow

4.3 Exploration Strategies

\(\epsilon\)-greedy: Simple and effective, suitable for discrete spaces
Boltzmann exploration: Probability-based exploration using values
UCB: Upper confidence bound, optimism in the face of uncertainty
Intrinsic motivation: Curiosity-driven (ICM, RND)
Posterior sampling: Thompson Sampling
Maximum entropy: Automatic exploration in the SAC framework

5. Connection to LLM Post-Training

5.1 RLHF Pipeline

Reinforcement learning fine-tuning of large language models is one of the most important current applications of RL:

Supervised Fine-Tuning (SFT): Fine-tune pre-trained model with high-quality data
Reward Model Training: Train reward model \(R_\phi(x, y)\) from human preference data
RL Optimization: Optimize policy with PPO, with KL divergence constraint

\[\max_\theta \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ R_\phi(x,y) - \beta D_{KL}(\pi_\theta \| \pi_{ref}) \right]\]

5.2 Beyond RLHF

Method	Characteristics
DPO	No explicit reward model needed, optimize directly from preferences
GRPO	Group relative policy optimization, used for mathematical reasoning
RLAIF	AI feedback replaces human feedback
Constitutional AI	Principle-based self-improvement

5.3 LLM from an RL Perspective

Viewing LLMs through the RL lens:

State: Generated token sequence so far
Action: Selection of the next token
Policy: Language model \(\pi_\theta(a_t | s_t)\)
Reward: Human preferences / AI judgments / verifier feedback
Environment: Task context + evaluation mechanism

6. Frontier Directions

6.1 Current Hot Topics

LLM reasoning enhancement: Training model reasoning capabilities with RL (o1, DeepSeek-R1)
Embodied intelligence: RL in robot manipulation and navigation (RT-2, Mobile ALOHA)
World models: Learning predictive models of environments (Dreamer, IRIS)
Safe RL: Constrained optimization, robust policies
Offline-to-online: Offline-to-Online RL fine-tuning

6.2 Open Challenges

Sample efficiency: How to reduce required interactions?
Generalization: How to transfer to new tasks/environments?
Long-term credit assignment: Learning under sparse rewards
Safety: Safety guarantees during training and deployment
Scalability: Efficient training in large-scale environments
Alignment: Ensuring agent behavior aligns with human intent

7. Suggested Learning Path

Beginner:
  MDP Basics → Dynamic Programming → MC Methods → TD Learning → Q-Learning

Intermediate:
  DQN → Policy Gradient → Actor-Critic → PPO → SAC

Advanced:
  Model-Based RL → Offline RL → Multi-Agent RL → Hierarchical RL

Applications:
  RLHF → Robot RL → Game AI → Reasoning Enhancement

References

Sutton & Barto, Reinforcement Learning: An Introduction (2018)
Sergey Levine, UC Berkeley CS285: Deep Reinforcement Learning
David Silver, UCL RL Course
OpenAI Spinning Up in Deep RL