Skip to content

Reinforcement Learning Landscape

Overview

Reinforcement Learning (RL) is one of the three major paradigms of machine learning, studying how agents learn optimal behavioral policies through trial-and-error interaction with an environment. From Bellman's introduction of dynamic programming in the 1950s to RLHF driving large language model alignment in the 2020s, RL has evolved into a vast and profound research field.

This article aims to provide a panoramic map of reinforcement learning, helping readers navigate the methodological landscape, algorithm taxonomy, and frontier directions.


1. Markov Decision Process (MDP)

1.1 Basic Framework

The mathematical foundation of reinforcement learning is the Markov Decision Process (MDP), defined as a five-tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

  • \(\mathcal{S}\): State space
  • \(\mathcal{A}\): Action space
  • \(P(s'|s,a)\): State transition probability
  • \(R(s,a,s')\): Reward function
  • \(\gamma \in [0,1)\): Discount factor

1.2 Core Objective

The agent's goal is to find an optimal policy \(\pi^*\) that maximizes the expected cumulative discounted return:

\[V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]\]

1.3 Bellman Equations

State value function Bellman equation:

\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]\]

Action value function Bellman equation:

\[Q^\pi(s,a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a') \right]\]

Optimal Bellman equation:

\[V^*(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^*(s') \right]\]

1.4 MDP Extensions

Extension Characteristic Application
POMDP Partial observability Robot navigation, dialogue systems
Dec-POMDP Decentralized partial observability Multi-agent cooperation
CMDP Constrained MDP Safe reinforcement learning
Semi-MDP Temporal abstraction Hierarchical reinforcement learning

2. RL Taxonomy

2.1 Overall Taxonomy Tree

graph TD
    RL[Reinforcement Learning] --> MF[Model-Free]
    RL --> MB[Model-Based]

    MF --> VB[Value-Based]
    MF --> PG[Policy-Based]
    MF --> AC[Actor-Critic]

    VB --> DQN[DQN Family]
    VB --> TD[TD Learning]

    PG --> REINFORCE[REINFORCE]
    PG --> TRPO[TRPO]
    PG --> PPO[PPO]

    AC --> A2C[A2C/A3C]
    AC --> SAC[SAC]
    AC --> DDPG[DDPG/TD3]

    MB --> Dyna[Dyna Architecture]
    MB --> MBPO[MBPO]
    MB --> MuZero[MuZero]
    MB --> WorldModel[World Models]

    RL --> Offline[Offline RL]
    RL --> MARL[Multi-Agent RL]
    RL --> HRL[Hierarchical RL]

    style RL fill:#e1f5fe
    style MF fill:#fff3e0
    style MB fill:#e8f5e9

2.2 Model-Free vs Model-Based

Model-Free Methods

Learn directly from interaction experience without requiring an environment dynamics model:

  • Value-Based: Learn value functions, derive policy indirectly - Q-Learning, SARSA, DQN, Double DQN, Dueling DQN, Rainbow
  • Policy-Based: Directly parameterize and optimize the policy - REINFORCE, TRPO, PPO
  • Actor-Critic: Simultaneously learn policy (Actor) and value function (Critic) - A2C, A3C, SAC, DDPG, TD3

Model-Based Methods

Learn or leverage environment models for planning:

  • Learned dynamics models: Dyna, MBPO, Dreamer
  • Planning and search: AlphaGo, MuZero
  • World models: World Models, IRIS

Selection Guide

  • Sample efficiency priority → Model-based methods
  • Simple implementation, asymptotic performance priority → Model-free methods
  • Combining both → Dyna architecture

2.3 On-Policy vs Off-Policy

Dimension On-Policy Off-Policy
Definition Behavior policy = Target policy Behavior policy ≠ Target policy
Data utilization Low (discard after use) High (reusable)
Stability Better Requires additional techniques
Representative algorithms SARSA, PPO, A2C Q-Learning, DQN, SAC
Experience replay Not used Used

2.4 Offline Reinforcement Learning

Learn entirely from static datasets without environment interaction:

  • Core challenge: Distribution shift, extrapolation error
  • Representative methods: BCQ, CQL, IQL, Decision Transformer
  • Applications: Medical decision-making, autonomous driving, recommendation systems

2.5 Single-Agent vs Multi-Agent

Dimension Single-Agent Multi-Agent (MARL)
Environment Static/stochastic Non-stationary (other agents also learning)
Objective Maximize own return Cooperative/competitive/mixed
Challenges Exploration-exploitation tradeoff Credit assignment, communication, scalability
Representatives DQN, PPO, SAC QMIX, MAPPO, MADDPG

3. Key Algorithm Map

3.1 By Development Timeline

Year Algorithm Category Key Contribution
1989 Q-Learning Value-Based Off-policy TD control
1992 REINFORCE Policy Gradient Policy gradient theorem
2013 DQN Deep Value-Based Deep networks + experience replay
2015 DDPG Actor-Critic Deterministic policy for continuous actions
2015 TRPO Policy Gradient Trust region optimization
2016 A3C Actor-Critic Asynchronous parallel training
2017 PPO Policy Gradient Clipped objective, simple and efficient
2018 SAC Actor-Critic Maximum entropy framework
2018 TD3 Actor-Critic Clipped double Q, delayed updates
2020 CQL Offline RL Conservative Q-learning
2021 Decision Transformer Offline RL Sequence modeling perspective

3.2 By Application Scenario

Discrete Action Spaces:
  ├── Simple tasks → Q-Learning / SARSA
  ├── High-dimensional observations → DQN / Rainbow
  └── Multi-agent → QMIX / VDN

Continuous Action Spaces:
  ├── Deterministic policy → DDPG / TD3
  ├── Stochastic policy → SAC
  ├── Stable training → PPO / TRPO
  └── Multi-agent → MADDPG / MAPPO

Special Scenarios:
  ├── Static data → CQL / IQL / Decision Transformer
  ├── Planning needed → MuZero / Dreamer
  └── LLM alignment → RLHF (PPO) / DPO / GRPO

4. Core Components of Deep RL

4.1 Function Approximation

  • Value network: Approximate \(V(s)\) or \(Q(s,a)\) with neural networks
  • Policy network: Parameterize policy \(\pi_\theta(a|s)\) with neural networks
  • Model network: Approximate environment dynamics \(P(s'|s,a)\) with neural networks

4.2 Key Techniques for Stable Training

Technique Problem Solved Used In
Experience Replay Sample correlation, data efficiency DQN, DDPG, SAC
Target Network Training instability DQN, DDPG, TD3
Clipping Excessively large policy updates PPO
Trust Region Policy update step control TRPO
Entropy Regularization Premature convergence, insufficient exploration SAC, A3C
Prioritized Experience Replay (PER) Sample utilization efficiency Rainbow

4.3 Exploration Strategies

  • \(\epsilon\)-greedy: Simple and effective, suitable for discrete spaces
  • Boltzmann exploration: Probability-based exploration using values
  • UCB: Upper confidence bound, optimism in the face of uncertainty
  • Intrinsic motivation: Curiosity-driven (ICM, RND)
  • Posterior sampling: Thompson Sampling
  • Maximum entropy: Automatic exploration in the SAC framework

5. Connection to LLM Post-Training

5.1 RLHF Pipeline

Reinforcement learning fine-tuning of large language models is one of the most important current applications of RL:

  1. Supervised Fine-Tuning (SFT): Fine-tune pre-trained model with high-quality data
  2. Reward Model Training: Train reward model \(R_\phi(x, y)\) from human preference data
  3. RL Optimization: Optimize policy with PPO, with KL divergence constraint
\[\max_\theta \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ R_\phi(x,y) - \beta D_{KL}(\pi_\theta \| \pi_{ref}) \right]\]

5.2 Beyond RLHF

Method Characteristics
DPO No explicit reward model needed, optimize directly from preferences
GRPO Group relative policy optimization, used for mathematical reasoning
RLAIF AI feedback replaces human feedback
Constitutional AI Principle-based self-improvement

5.3 LLM from an RL Perspective

Viewing LLMs through the RL lens:

  • State: Generated token sequence so far
  • Action: Selection of the next token
  • Policy: Language model \(\pi_\theta(a_t | s_t)\)
  • Reward: Human preferences / AI judgments / verifier feedback
  • Environment: Task context + evaluation mechanism

6. Frontier Directions

6.1 Current Hot Topics

  • LLM reasoning enhancement: Training model reasoning capabilities with RL (o1, DeepSeek-R1)
  • Embodied intelligence: RL in robot manipulation and navigation (RT-2, Mobile ALOHA)
  • World models: Learning predictive models of environments (Dreamer, IRIS)
  • Safe RL: Constrained optimization, robust policies
  • Offline-to-online: Offline-to-Online RL fine-tuning

6.2 Open Challenges

  1. Sample efficiency: How to reduce required interactions?
  2. Generalization: How to transfer to new tasks/environments?
  3. Long-term credit assignment: Learning under sparse rewards
  4. Safety: Safety guarantees during training and deployment
  5. Scalability: Efficient training in large-scale environments
  6. Alignment: Ensuring agent behavior aligns with human intent

7. Suggested Learning Path

Beginner:
  MDP Basics → Dynamic Programming → MC Methods → TD Learning → Q-Learning

Intermediate:
  DQN → Policy Gradient → Actor-Critic → PPO → SAC

Advanced:
  Model-Based RL → Offline RL → Multi-Agent RL → Hierarchical RL

Applications:
  RLHF → Robot RL → Game AI → Reasoning Enhancement

References

  • Sutton & Barto, Reinforcement Learning: An Introduction (2018)
  • Sergey Levine, UC Berkeley CS285: Deep Reinforcement Learning
  • David Silver, UCL RL Course
  • OpenAI Spinning Up in Deep RL

Further Reading


评论 #