Skip to content

Model-Based Reinforcement Learning

All the deep reinforcement learning algorithms we have discussed so far (DQN, PPO, SAC, etc.) fall under Model-Free RL — they learn value functions or policies directly from experience gathered by interacting with the environment, without explicitly modeling the environment's dynamics. Model-Based RL (MBRL) takes a fundamentally different approach: first learn a model of the environment (a World Model), then use that model to assist decision-making.

This philosophy is much closer to how humans think — before making a decision, we typically "simulate" the consequences of different actions in our minds, rather than blindly trying every possibility.

Model-Free vs Model-Based

The Essential Difference Between the Two Paradigms

Model-Free RL:
+----------+     动作a     +----------+     r, s'     +----------+
|          | ------------> |          | ------------> |          |
|  智能体   |              |   环境    |              | 价值/策略 |
|  (Policy) | <----------- | (真实)   | <----------- |  (更新)  |
+----------+   观测s, r    +----------+              +----------+

    Learns policy/value function directly from (s, a, r, s') experience
    Does not understand "why," only knows "what works"


Model-Based RL:
+----------+     动作a     +----------+     r, s'
|          | ------------> |          | ------------>  收集数据
|  智能体   |              |   环境    |              +----------+
|          | <----------- |  (真实)   |              | 学习模型 |
+----------+   观测s, r    +----------+              |p(s'|s,a)|
     ^                                               +----+-----+
     |                                                     |
     |         +----------+     模拟r, s'                  |
     +-------- |  模型内   | <-----------------------------+
      策略优化 |  规划/模拟 |
               +----------+

    First learns an environment model, then "imagines" experience within it to optimize the policy
    Understands "how the world works," then "runs experiments in its head"

Core Trade-offs

Advantages of Model-Free:

  • No need to learn an environment model, avoiding problems caused by model error
  • Theoretically guarantees asymptotic optimality (given infinite data)
  • Relatively simple to implement

Disadvantages of Model-Free:

  • Extremely low sample efficiency: Each policy update requires a large number of real environment interactions. Typical Atari games require tens of millions of frames; MuJoCo continuous control tasks require millions of steps
  • Cannot perform counterfactual reasoning — unable to answer "what would have happened if I had made a different choice"

Advantages of Model-Based:

  • High sample efficiency: Once an environment model is learned, it can generate unlimited "simulated experience" without real interactions
  • Enables planning — before executing an action, the agent can simulate multiple possible futures in the model and choose the best one
  • Closer to how humans make decisions

Disadvantages of Model-Based:

  • Model Error: The learned model can never be perfect, and prediction errors accumulate exponentially over long planning horizons — this is known as Compounding Error
  • Learning a good environment model is itself a very challenging problem
  • High computational cost (must both learn the model and plan within it)

Mathematical Analysis of Compounding Error

Suppose the total variation distance between the learned model \(\hat{p}(s'|s, a)\) and the true dynamics \(p(s'|s, a)\) is bounded by \(\epsilon_m\):

\[ \max_{s, a} D_{\text{TV}}(\hat{p}(\cdot|s, a) \| p(\cdot|s, a)) \leq \epsilon_m \]

Then after an \(H\)-step rollout in the model, the upper bound on the policy performance gap is:

\[ |\eta(\pi) - \hat{\eta}(\pi)| \leq \frac{2 r_{\max} \gamma \epsilon_m}{(1 - \gamma)^2} \]

where \(\eta(\pi)\) and \(\hat{\eta}(\pi)\) are the expected returns of the policy in the real environment and the learned model, respectively. The key insight is: even if the single-step model error \(\epsilon_m\) is small, the accumulated error over many steps can become very large. This is why a core design principle in MBRL is "short-horizon rollouts."

World Model

What Is a World Model

A world model is an approximation of the environment's dynamics. A complete world model typically consists of three components:

\[ \text{World Model} = \begin{cases} \text{Transition Model:} & \hat{p}(s_{t+1} | s_t, a_t) \\ \text{Reward Model:} & \hat{r}(s_t, a_t) \\ \text{Done Model:} & \hat{d}(s_t, a_t) \in \{0, 1\} \end{cases} \]
  • Transition Model \(\hat{p}(s_{t+1} | s_t, a_t)\): Given the current state and action, predicts the next state
  • Reward Model \(\hat{r}(s_t, a_t)\): Given a state and action, predicts the immediate reward
  • Done Model \(\hat{d}(s_t, a_t)\): Predicts whether a terminal state has been reached

Model Learning

Learning the environment model is fundamentally a supervised learning problem — learning from collected real transition data:

\[ \mathcal{D} = \{(s_i, a_i, r_i, s_i', d_i)\}_{i=1}^{N} \]

Deterministic model:

\[ \hat{s}_{t+1} = f_\theta(s_t, a_t), \quad L(\theta) = \mathbb{E}_{\mathcal{D}} \left[ \| f_\theta(s_t, a_t) - s_{t+1} \|^2 \right] \]

Stochastic model (Gaussian):

\[ \hat{p}_\theta(s_{t+1} | s_t, a_t) = \mathcal{N}(\mu_\theta(s_t, a_t), \Sigma_\theta(s_t, a_t)) \]

Trained by maximizing the likelihood:

\[ L(\theta) = -\mathbb{E}_{\mathcal{D}} \left[ \log \hat{p}_\theta(s_{t+1} | s_t, a_t) \right] \]

Compared to deterministic models, stochastic models can better capture the inherent randomness (aleatoric uncertainty) of the environment.

Ensemble Models

A single neural network model cannot tell us "how uncertain it is about its own predictions" — this is critical in MBRL because we need to know when the model's predictions are unreliable.

The ensemble method is the most practical solution to this problem: train \(B\) independently initialized models \(\{f_{\theta_1}, f_{\theta_2}, \ldots, f_{\theta_B}\}\) (typically \(B = 5 \sim 7\)), each trained on the same data but with different initializations.

Uncertainty estimation: Given \((s, a)\), the \(B\) models produce \(B\) different predictions. Their agreement reflects how confident the model is:

\[ \text{Uncertainty}(s, a) = \text{Var}_{b=1}^{B} \left[ f_{\theta_b}(s, a) \right] \]
  • If all models' predictions are consistent → low uncertainty → the model's predictions can be trusted
  • If the models' predictions diverge significantly → high uncertainty → the predictions should not be trusted (typically indicating that this \((s, a)\) is out of the training data distribution)
         +--------+
(s, a) → | Model 1| → s'_1 = [1.2, 3.4]
         +--------+                         Low uncertainty:
(s, a) → | Model 2| → s'_2 = [1.3, 3.3]    All models agree
         +--------+                         → Trust predictions
(s, a) → | Model 3| → s'_3 = [1.2, 3.5]
         +--------+

(s, a) → | Model 1| → s'_1 = [1.2, 3.4]
         +--------+                         High uncertainty:
(s, a) → | Model 2| → s'_2 = [5.1, 0.2]    Models strongly disagree
         +--------+                         → Do not trust
(s, a) → | Model 3| → s'_3 = [-0.8, 7.1]
         +--------+

The Dyna Architecture

Dyna was proposed by Sutton in 1991 and is the most classic framework in Model-Based RL. Its core idea is simple yet profound: learn simultaneously from both real and simulated experience.

Dyna-Q Algorithm

Dyna-Q seamlessly combines model-free Q-learning with model-based planning:

Dyna-Q Algorithm:

while not done:
    1. Interact with the real environment:
       a = ε-greedy(Q, s)
       s', r = env.step(a)

    2. Update Q using real experience (Direct RL):
       Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]

    3. Update model using real experience (Model Learning):
       Model(s,a) ← (s', r)

    4. Generate simulated experience from the model and update Q (Planning):
       repeat k times:
           s̃, ã ← randomly sample from previously visited (s,a) pairs
           s̃', r̃ = Model(s̃, ã)
           Q(s̃,ã) ← Q(s̃,ã) + α[r̃ + γ max_a' Q(s̃',a') - Q(s̃,ã)]

    s ← s'

Key insight: Steps 2 and 4 perform exactly the same Q-learning update; the only difference is the data source — Step 2 uses real experience, while Step 4 uses model-simulated experience. The parameter \(k\) controls "how many simulation steps correspond to each real interaction step" and can be flexibly adjusted based on the computational budget.

The value of Dyna lies in its elegant conceptual framework: Planning and learning can be unified as the same operation — both are value updates based on experience samples, differing only in whether the experience is "real" or "imagined."

MBPO (Model-Based Policy Optimization)

MBPO was proposed by Janner et al. in 2019 ("When to Trust Your Model: Model-Based Policy Optimization") and is a milestone work in successfully combining model-based methods with modern deep RL. Its core innovation is: using short-horizon model rollouts to achieve an optimal balance between sample efficiency and model error.

Motivation

Performing long-horizon planning directly in the learned model (e.g., running many simulation steps as in Dyna) faces a fundamental tension:

  • Rollouts too long: Compounding errors accumulate, simulated trajectories become increasingly inaccurate, and the resulting garbage data poisons policy learning
  • Rollouts too short: Although accurate, the diversity of simulated experience is insufficient, and sample efficiency gains are limited

MBPO's key insight is: there exists an optimal rollout length \(k\) at which the sample efficiency gains from the model just outweigh the performance loss caused by model error.

Algorithm Design

The core procedure of MBPO:

  1. Collect data from the real environment → update the environment model (Ensemble)
  2. Starting from real states, perform \(k\)-step branched rollouts using the model → generate large amounts of simulated data
  3. Train the policy using simulated data + real data (using SAC as the underlying algorithm)
MBPO Branched Rollout:

  Real trajectory:  s1 ---a1--→ s2 ---a2--→ s3 ---a3--→ s4 ---a4--→ s5
                          |                        |
              Model       |           Model        |
              rollout     |           rollout      |
              (k=3 steps) v           (k=3 steps)  v
                         s̃2_1                     s̃4_1
                          |                        |
                         s̃2_2                     s̃4_2
                          |                        |
                         s̃2_3                     s̃4_3

  Starting from states along the real trajectory, use the learned model
  to roll forward k steps, generating large amounts of "branched" simulated experience

Why branch from real states? Because the model starts its short rollouts from states drawn from the real distribution, so each step's input is relatively reliable. In contrast, if the model were to generate entire trajectories from scratch, later states would increasingly drift away from the real distribution.

Theoretical Guarantee

MBPO's key theoretical result is a monotonic improvement guarantee. Let \(\eta[\pi]\) be the expected return of policy \(\pi\) in the real environment and \(\hat{\eta}[\pi]\) be the expected return in the model. Then:

\[ \eta[\pi] \geq \hat{\eta}[\pi] - C \cdot \left[ \epsilon_m + \epsilon_\pi \right] \]

where \(\epsilon_m\) is the model error, \(\epsilon_\pi\) is the magnitude of policy change (controllable via SAC's KL constraint), and \(C\) is a constant related to the rollout length \(k\).

This inequality tells us: as long as the model error \(\epsilon_m\) and policy change \(\epsilon_\pi\) are sufficiently small, we can guarantee that policy improvement in the model transfers to the real environment. Short rollouts are the key mechanism for controlling the impact of \(\epsilon_m\) on final performance.

MBPO Pseudocode

Algorithm: MBPO
-----------------------------------
Initialize: policy π, environment model ensemble {f_θ1, ..., f_θB},
            real data buffer D_env, simulated data buffer D_model

for each iteration:
    // 1. Collect real data
    Interact with the real environment using π → add new data to D_env

    // 2. Update the environment model
    Train ensemble model {f_θ1, ..., f_θB} on D_env

    // 3. Model rollout (branched)
    for each state s sampled from D_env:
        Starting from s, use a randomly selected ensemble member
        to perform a k-step rollout → add simulated data to D_model

    // 4. Policy optimization
    for G gradient steps:
        Sample a mini-batch from D_env ∪ D_model
        Update policy π using SAC

The Dreamer Series

The Dreamer series, proposed by Hafner et al. starting in 2020, represents another important direction in Model-Based RL: performing complete "imagination" training in a learned latent space — not just using the model to generate simulated data, but having the policy learn entirely within imagination.

Dreamer v1: Learning in Imagination

Dreamer v1 ("Dream to Control: Learning Behaviors by Latent Imagination", 2020) introduced two core innovations: RSSM and pure imagination-based training.

RSSM (Recurrent State-Space Model)

RSSM is the core "world model" of Dreamer. It splits the latent state into two parts:

\[ \text{RSSM State} = \underbrace{h_t}_{\text{Deterministic}} \oplus \underbrace{z_t}_{\text{Stochastic}} \]
  • Deterministic component \(h_t\): A hidden state maintained by a GRU, capturing long-term dependencies
  • Stochastic component \(z_t\): A randomly sampled latent variable, capturing the environment's inherent randomness

RSSM consists of four components:

\[ \begin{aligned} \text{Sequence Model (deterministic path):} \quad & h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1}) \\ \text{Prior:} \quad & \hat{z}_t \sim p_\phi(\hat{z}_t | h_t) \\ \text{Posterior:} \quad & z_t \sim q_\phi(z_t | h_t, o_t) \\ \text{Decoder (reconstruction):} \quad & \hat{o}_t \sim p_\phi(\hat{o}_t | h_t, z_t) \end{aligned} \]

where \(o_t\) is the raw observation (e.g., image pixels).

RSSM Information Flow:

         o_t (观测)        o_{t+1} (观测)
          |                  |
          v                  v
     +---------+        +---------+
     |Posterior |        |Posterior |
     |q(z|h,o) |        |q(z|h,o) |
     +----+----+        +----+----+
          |                  |
          v                  v
         z_t                z_{t+1}
          |                  |
    h_t --+---- GRU -----> h_{t+1} --+---- GRU ----> ...
          |                  |
          v                  v
     +---------+        +---------+
     |  Prior  |        |  Prior  |
     |p(ẑ|h)  |        |p(ẑ|h)  |
     +---------+        +---------+
          |                  |
          v                  v
     +---------+        +---------+
     | Decoder |        | Decoder |
     |p(ô|h,z)|        |p(ô|h,z)|
     +---------+        +---------+
          |                  |
          v                  v
         ô_t               ô_{t+1}

The Distinction Between Prior and Posterior:

  • Posterior \(q(z_t | h_t, o_t)\): Has access to both the deterministic state \(h_t\) and the real observation \(o_t\); used during training
  • Prior \(p(\hat{z}_t | h_t)\): Only sees the deterministic state \(h_t\), without the real observation; used during imagination rollouts (since no real observations are available during imagination)

During training, the KL divergence between the prior and posterior is minimized, teaching the prior to "make reasonable predictions about the stochastic state even without seeing the observation":

\[ L_{\text{KL}} = D_{\text{KL}}(q(z_t | h_t, o_t) \| p(\hat{z}_t | h_t)) \]

Pure Imagination Training

Dreamer's policy learning takes place entirely within RSSM's latent space:

  1. Sample initial states \((h_t, z_t)\) from real data
  2. Roll forward \(H\) steps using RSSM's prior (not the posterior, since no real observations are available)
  3. Train the policy on these "imagined" trajectories using an Actor-Critic method

This is more thorough than MBPO — MBPO still requires partial training on real data, whereas Dreamer's policy learns entirely in imagination. Real data is used solely to train the world model.

Dreamer v2: Discrete Latent Representations

Dreamer v2 ("Mastering Atari with Discrete World Models", 2021) introduced the following key improvements:

1. Discrete Latent Space

The continuous latent variable \(z_t\) is replaced with a discrete vector. Specifically, \(z_t\) consists of 32 categorical distributions, each with 32 categories, yielding a total of \(32 \times 32 = 1024\) possible combinations.

Advantages of discrete representations:

  • Better suited for representing discrete environmental features (e.g., presence/absence of objects)
  • More stable training (continuous VAEs are prone to posterior collapse)
  • More compatible with discrete sequence models such as Transformers

2. KL Balancing

In standard VAE training, the KL divergence \(D_{\text{KL}}(q \| p)\) simultaneously pushes the posterior \(q\) toward the prior \(p\) and the prior \(p\) toward the posterior \(q\). However, in Dreamer these two forces should be asymmetric:

\[ L_{\text{KL}} = \alpha \cdot D_{\text{KL}}(\text{sg}(q) \| p) + (1 - \alpha) \cdot D_{\text{KL}}(q \| \text{sg}(p)) \]

where \(\text{sg}\) denotes the stop-gradient operation and \(\alpha = 0.8\). This causes the prior to be pushed harder (to "catch up" with the posterior), while the posterior is less constrained and can encode information more freely.

Dreamer v3: A Universal World Model

Dreamer v3 ("Mastering Diverse Domains through World Models", 2023) had an extremely ambitious goal: a single set of fixed hyperparameters that works across all domains — from Atari to continuous control to 3D games.

Core technical innovations of Dreamer v3:

1. Symlog Predictions

Reward magnitudes can differ by several orders of magnitude across tasks (rewards in Atari may range from 0 to 1000, while some control tasks may have rewards between -0.01 and 0.01). Dreamer v3 uses the symlog transform to handle these scale differences:

\[ \text{symlog}(x) = \text{sign}(x) \cdot \ln(|x| + 1) \]
\[ \text{symexp}(x) = \text{sign}(x) \cdot (\exp(|x|) - 1) \]

The model predicts values in symlog space, and the symexp transform is applied at inference time to convert back to the original scale. This is more flexible than simple reward clipping or normalization.

2. Free Bits

No gradient is produced when the KL divergence falls below a certain threshold, preventing posterior collapse:

\[ L_{\text{KL}} = \max(D_{\text{KL}}(q \| p), \text{free\_nats}) \]

3. Percentile Scaling

Returns are normalized using percentiles rather than fixed scaling:

\[ \hat{R}_t = \frac{R_t - \text{Per}_5(R)}{{\text{Per}_{95}(R) - \text{Per}_5(R)}} \]

where \(\text{Per}_5\) and \(\text{Per}_{95}\) are the 5th and 95th percentiles of the return distribution.

Landmark achievement: Dreamer v3 was the first MBRL algorithm to learn to collect diamonds in Minecraft from pixel inputs without any prior knowledge. This task requires the agent to complete a long sequence of dozens of sub-goals (chop tree → craft crafting table → craft wooden pickaxe → mine stone → craft stone pickaxe → mine iron ore → smelt iron → craft iron pickaxe → mine diamond ore), demanding extremely strong long-horizon planning capabilities.

MuZero

MuZero was proposed by Schrittwieser et al. in 2020 ("Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model") and is the latest evolution of DeepMind's AlphaGo lineage. It represents another extreme philosophy in Model-Based RL: rather than learning a complete environment model (e.g., predicting the next frame of pixels), learn only what is "useful for decision-making."

The Evolution from AlphaGo to MuZero

\[ \text{AlphaGo} \to \text{AlphaGo Zero} \to \text{AlphaZero} \to \text{MuZero} \]
Algorithm Environment Model Human Knowledge Applicability
AlphaGo Known (Go rules) Required (human game records) Go only
AlphaGo Zero Known (Go rules) Not required Go only
AlphaZero Known (game rules) Not required Go, Chess, Shogi
MuZero Learned model Not required Any environment

MuZero's revolutionary contribution is: it does not need to know the rules of the environment. AlphaZero required calling the real game simulator during planning (e.g., "if I place a stone here, what will the board look like"), which demands perfect knowledge of environment rules. MuZero replaces the real simulator with a learned model, making it applicable to any environment.

Three Networks

MuZero's architecture consists of three tightly integrated neural networks:

1. Representation Function \(h_\theta\):

\[ s^0 = h_\theta(o_1, \ldots, o_t) \]

Encodes the raw observation sequence \((o_1, \ldots, o_t)\) into a hidden state \(s^0\). Note that this hidden state does not need to correspond to the real environment state in any way — it only needs to contain "information useful for decision-making."

2. Dynamics Function \(g_\theta\):

\[ s^{k+1}, r^{k+1} = g_\theta(s^k, a^{k+1}) \]

Given the current hidden state \(s^k\) and action \(a^{k+1}\), predicts the next hidden state \(s^{k+1}\) and immediate reward \(r^{k+1}\). This is MuZero's "world model," but it operates in hidden space and does not predict pixel-level observations.

3. Prediction Function \(f_\theta\):

\[ p^k, v^k = f_\theta(s^k) \]

Given a hidden state \(s^k\), predicts the policy (action probability distribution) \(p^k\) and value \(v^k\). This directly corresponds to the policy-value network in AlphaZero.

MuZero Architecture:

Real observations          Planning in hidden space
                         (MCTS in latent space)
o1, ..., ot
    |                    s^0 ---a1--→ s^1 ---a2--→ s^2 ---a3--→ s^3
    v                     |     r^1    |     r^2    |     r^3    |
+--------+               v            v            v            v
| h(obs) | ----→ s^0   (p^0,v^0)   (p^1,v^1)   (p^2,v^2)   (p^3,v^3)
+--------+  Repr.
             func.      Prediction f   Dynamics g + Prediction f

MCTS Planning

At each decision step, MuZero uses MCTS (Monte Carlo Tree Search) to plan in hidden space. The MCTS procedure is as follows:

  1. Selection: Starting from the root node \(s^0\), select actions according to the UCB formula, traversing down the tree until reaching an unexpanded leaf node
  2. Expansion: At the leaf node, expand a new node using the dynamics function \(g_\theta\) and the prediction function \(f_\theta\)
  3. Backup: Propagate the leaf node's value estimate \(v^k\) back up along the path, updating the statistics of all nodes along the way
  4. Action Selection: Repeat the above steps many times (e.g., 800 simulations), then select the final action based on the visit counts of each action at the root node

The UCB formula balances exploration and exploitation:

\[ a^k = \arg\max_a \left[ Q(s^k, a) + c \cdot P(s^k, a) \cdot \frac{\sqrt{\sum_b N(s^k, b)}}{1 + N(s^k, a)} \right] \]

where \(Q\) is the estimated action value, \(P\) is the prior policy given by the prediction function, \(N\) is the visit count, and \(c\) is the exploration coefficient.

Training

MuZero's training objective aligns the predictions of the three networks with actual game outcomes:

\[ L(\theta) = \sum_{k=0}^{K} \left[ \underbrace{l^r(u_{t+k}, r^k)}_{\text{Reward loss}} + \underbrace{l^v(z_{t+k}, v^k)}_{\text{Value loss}} + \underbrace{l^p(\pi_{t+k}, p^k)}_{\text{Policy loss}} \right] + c \|\theta\|^2 \]
  • Reward loss: The predicted reward \(r^k\) should match the real reward \(u_{t+k}\)
  • Value loss: The predicted value \(v^k\) should match the bootstrapped value target \(z_{t+k}\)
  • Policy loss: The predicted policy \(p^k\) should match the improved policy \(\pi_{t+k}\) produced by MCTS search

A key detail: during training, the dynamics function \(g_\theta\) is unrolled for \(K\) steps (typically \(K = 5\)), and gradients are backpropagated through the entire unrolled chain. This means the hidden state \(s^k\) is trained end-to-end to be a "representation useful for decision-making."

Core Philosophy: Model Only What Matters

MuZero and Dreamer represent two different world model philosophies:

  • Dreamer: Learn a world model that is as accurate as possible (capable of reconstructing observations), then perform policy learning within the model
  • MuZero: Learn only the information "useful for decision-making" (predicting rewards, values, and policies), without caring about observation reconstruction

MuZero's philosophy has advantages in certain scenarios. For example, in Atari games, the screen contains many visual details irrelevant to decision-making (e.g., background textures, particle effects). Dreamer must spend substantial model capacity reconstructing these irrelevant details, while MuZero ignores them entirely and focuses only on information that affects decisions.

MBRL Method Comparison

Dimension Dyna-Q MBPO Dreamer v3 MuZero
Model Type Tabular/Simple NN Ensemble NN RSSM (latent space) Hidden-space dynamics model
Planning Method Simulated experience + Q update Short-horizon rollout Latent-space Actor-Critic MCTS
State Space Discrete/Low-dim Continuous (low/mid-dim) Continuous (incl. high-dim pixels) Any
Observation Reconstruction N/A Not required Required (Decoder) Not required
Sample Efficiency Medium High High Very high
Computational Cost Low Medium High Very high (MCTS)
Application Domain Teaching/Simple tasks Continuous control General-purpose (incl. pixels) Board games/Atari
Key Innovation Unified learning + planning Short horizon + monotonic improvement RSSM + pure imagination training Hidden-space planning + MCTS

Selection guidelines:

  • Low-dimensional continuous control, prioritizing sample efficiency: MBPO
  • High-dimensional pixel inputs, generality: Dreamer v3
  • Discrete action spaces, precise planning required: MuZero
  • Learning and teaching: Dyna-Q (conceptually clean; the best starting point for understanding MBRL)

Summary and Outlook

The core value of Model-Based RL lies in sample efficiency — by learning an environment model, the agent can acquire vast amounts of experience through "imagination," dramatically reducing the number of real environment interactions required. This is critical in scenarios where real interaction is expensive or dangerous.

From Dyna to MBPO to Dreamer to MuZero, a clear line of development emerges:

  1. Dyna (1991): Proposed the framework unifying learning and planning
  2. MBPO (2019): Solved compounding error via short-horizon rollouts; the first to surpass model-free methods on continuous control tasks
  3. Dreamer (2020-2023): Performed complete imagination-based training in latent space, enabling efficient learning from pixel-level inputs
  4. MuZero (2020): Modeled only decision-relevant information; achieved superhuman performance in board games and Atari

The future trend points toward Foundation World Models — just as language models have learned general language understanding from massive text corpora, world models have the potential to learn general physical intuition and environmental understanding from massive amounts of video and interaction data. The successes of Dreamer v3 and MuZero have already demonstrated that learned world models can support decision-making in complex tasks. The next step is to extend this capability to more open-ended and diverse real-world scenarios.


评论 #