Skip to content

World Models and Video Generation

World Models are a core capability of embodied intelligence: by internally simulating future states, they enable robots to "imagine" the consequences of their actions. Starting from mathematical definitions, this article reviews the applications of world models in robotics and their convergence with video generation technologies.

Related notes: World Models (General) | Introduction to Robot Foundation Models


1. Mathematical Definition of World Models

1.1 Basic Form

The core of a world model is learning the environment's state transition function:

\[p_\theta(s_{t+1} | s_t, a_t)\]

where \(s_t\) is the environment state at time \(t\), and \(a_t\) is the agent's action.

However, in practical robot scenarios, we typically cannot directly access the complete environment state \(s_t\); we can only obtain observations \(o_t\) (e.g., images). Therefore, a latent state space must be introduced:

\[z_t = f_\phi(o_t) \quad \text{(Encoder: observation → latent state)}\]
\[\hat{z}_{t+1} = g_\theta(z_t, a_t) \quad \text{(Transition model: predict next latent state)}\]
\[\hat{o}_{t+1} = d_\psi(z_{t+1}) \quad \text{(Decoder: latent state → predicted observation)}\]
\[\hat{r}_t = r_\xi(z_t, a_t) \quad \text{(Reward predictor)}\]

1.2 Complete Objective

Training a world model typically involves jointly optimizing multiple objectives:

\[\mathcal{L} = \underbrace{\mathcal{L}_{\text{recon}}}_{\text{reconstruction loss}} + \beta \underbrace{\mathcal{L}_{\text{KL}}}_{\text{regularization}} + \gamma \underbrace{\mathcal{L}_{\text{reward}}}_{\text{reward prediction}} + \delta \underbrace{\mathcal{L}_{\text{dyn}}}_{\text{dynamics consistency}}\]

where:

  • \(\mathcal{L}_{\text{recon}} = \|o_t - \hat{o}_t\|^2\): Reconstruction loss
  • \(\mathcal{L}_{\text{KL}} = D_{\text{KL}}(q(z_t|o_t) \| p(z_t|z_{t-1}, a_{t-1}))\): Consistency between posterior and prior
  • \(\mathcal{L}_{\text{reward}} = \|r_t - \hat{r}_t\|^2\): Reward prediction loss
  • \(\mathcal{L}_{\text{dyn}}\): Dynamics prediction accuracy

1.3 Uses of World Models in Robotics

graph TB
    WM[World Model<br/>p(s_t+1 | s_t, a_t)]

    WM --> U1[Model Predictive Control MPC<br/>Online planning of optimal action sequences]
    WM --> U2[Imagination Training<br/>Generating virtual experiences inside the model]
    WM --> U3[Safety Checking<br/>Predicting whether action consequences are safe]
    WM --> U4[Sim2Real<br/>Learned model bridges the simulation gap]
    WM --> U5[Video Prediction<br/>Generating future scenes to aid decision-making]

    U1 --> A1["Select action sequence that maximizes Σ r(st,at)"]
    U2 --> A2["Dreamer: Train RL policy in latent space"]
    U5 --> A3["Visual planning: Imagine goal scene first, then act"]

2. Dreamer Series: From Simulation to Real Robots

2.1 Dreamer Architecture Evolution

The Dreamer series is one of the most successful applications of world models in RL. Its core idea is "imagination" training within a learned latent world model:

Dreamer v1 (2020):

  • Learns a latent dynamics model (RSSM: Recurrent State-Space Model)
  • Trains Actor-Critic through imagined trajectories in latent space
  • Dramatically improves sample efficiency

Dreamer v2 (2021):

  • Discretized latent states (categorical latents)
  • Improved value function estimation
  • First to surpass model-free methods on Atari

Dreamer v3 (2023):

  • Universal hyperparameters, no per-task tuning needed
  • Symlog encoding handles rewards of different magnitudes
  • Excellent performance across 150+ tasks

Mathematical form of RSSM:

\[\begin{aligned} \text{Deterministic path:} \quad h_t &= f_\theta(h_{t-1}, z_{t-1}, a_{t-1}) \\ \text{Stochastic state prior:} \quad \hat{z}_t &\sim p_\theta(z_t | h_t) \\ \text{Stochastic state posterior:} \quad z_t &\sim q_\phi(z_t | h_t, o_t) \end{aligned}\]

where \(h_t\) is the deterministic recurrent state and \(z_t\) is the stochastic state.

2.2 DayDreamer (2022): Dreamer on Real Robots

DayDreamer (Hafner et al., 2022) was the first to successfully apply Dreamer to real robots:

Platform Task Training Time Notes
A1 Quadruped Standing, walking 1 hour of real interaction Learning gaits from scratch
UR5 Robot Arm Object manipulation ~30 minutes Tabletop grasping
Shadow Hand Dexterous rotation ~40 minutes In-hand manipulation

Key implementation details:

  1. Collect a small amount of data on the real robot
  2. Train the RSSM world model
  3. Train the policy through imagination within the world model (no real interaction needed)
  4. Deploy the policy on the real robot, collect more data
  5. Repeat the above process

3. Video Generation as World Models

3.1 Predicting the Future from Pixel Space

A major recent trend is treating video generation models as world models. The core insight: If a model can accurately predict future video frames given actions, then it has implicitly learned the physical laws of the environment.

Formal expression:

\[p_\theta(o_{t+1:t+H} | o_{1:t}, a_{t:t+H-1})\]

where \(o\) are image frames, \(a\) are actions, and \(H\) is the prediction time horizon.

3.2 Key Models

UniSim (Google DeepMind, 2023)

Positioning: Universal interactive simulator

Core idea: Use a video diffusion model to simulate the dynamic changes of any environment, supporting multiple types of "action" inputs:

  • Robot end-effector motion
  • Free-form text descriptions ("open the drawer")
  • Camera motion trajectories

Architecture: Based on a Video Diffusion Model, conditioned on previous frames and action descriptions

Applications:

  • Training robot policies (simulator replacement)
  • Data augmentation: Generating training data under different conditions
  • Evaluation: Testing policy robustness within the model

Genie (Google DeepMind, 2024)

Positioning: Generative interactive environment

Core idea: Unsupervised learning of controllable interactive environments from internet videos

Architecture:

  1. Video Tokenizer: Encodes video frames into discrete tokens
  2. Latent Action Model (LAM): Infers latent actions from consecutive frames
  3. Dynamics Model: Given current frame and latent action, predicts the next frame
graph LR
    V1[Video Frame t] --> VT[Video Tokenizer]
    V2[Video Frame t+1] --> VT
    VT --> LAM[Latent Action Model]
    LAM --> LA[Latent Action â_t]

    VT --> DM[Dynamics Model]
    LA --> DM
    DM --> VT2[Predicted Frame t+1 Tokens]
    VT2 --> DEC[Decoder]
    DEC --> PRED[Predicted Video Frame]

Key innovation: No action annotations required! Learns an interactive world model from pure video data.

Scale: 11B parameters, trained on 200K hours of internet video

Cosmos (NVIDIA, 2025)

Positioning: Physical AI world foundation model

Core architecture:

  • Cosmos Tokenizer: Compresses video into both continuous and discrete token forms
  • Cosmos World Foundation Model (WFM): Based on both diffusion Transformer and autoregressive Transformer architectures
  • Post-training: Supports fine-tuning for specific robot scenarios

Two architectures:

Feature Diffusion WFM Autoregressive WFM
Generation method Diffusion denoising Token-by-token generation
Quality High Medium–High
Speed Slower Faster
Controllability Via conditioning Via prompting
Parameters 7B–14B 4B–12B

Key contributions:

  • Largest-scale open-source physical world video generation model
  • Supports generation from text, images, actions, and other conditions
  • Focused on physical accuracy (gravity, collisions, fluids)

Genesis (2024)

Positioning: Differentiable physics simulator

Core idea: Unlike the above learning-based world models, Genesis takes the differentiable physics simulation route:

\[\frac{\partial s_{t+1}}{\partial a_t} = \frac{\partial f_{\text{physics}}(s_t, a_t)}{\partial a_t}\]

Through a differentiable physics engine, control policies can be optimized directly via gradient backpropagation.

Features:

  • Supports rigid bodies, soft bodies, fluids, cloth, and other physical materials
  • 10–80x faster than traditional physics simulators (GPU parallel)
  • Supports automatic generation of robot training scenarios
  • Differentiability enables direct gradient-based policy optimization

4. Technical Comparison Analysis

4.1 Latent Space Models vs. Pixel Space Models

graph TB
    subgraph LatentSpaceWM["Latent Space World Model"]
        L1[Observation] --> L2[Encoder]
        L2 --> L3[Latent State z]
        L3 --> L4[Latent Dynamics Model]
        L4 --> L5[Predicted Latent State]
        L5 --> L6[Decoder]
        L6 --> L7[Predicted Observation]
    end

    subgraph PixelSpaceWM["Pixel Space World Model"]
        P1[Observation] --> P2[Video Diffusion / Autoregressive Model]
        P2 --> P3[Direct Prediction of Future Frames]
    end

    style LatentSpaceWM fill:#e3f2fd
    style PixelSpaceWM fill:#fff3e0
Dimension Latent Space (Dreamer Series) Pixel Space (UniSim, Cosmos)
Representatives Dreamer v3, TD-MPC UniSim, Genie, Cosmos
Compression Encoder compression Tokenization / Diffusion
Prediction accuracy Medium (lossy reconstruction) High (direct pixel prediction)
Computational cost Low High
Training data Small (a few hours) Large (tens of thousands of hours)
Physical accuracy Data-dependent Data-dependent
Planability Direct planning in latent space Requires additional modules to extract information

4.2 Learning-Based World Models vs. Differentiable Physics

Dimension Learning-Based (Dreamer, UniSim) Differentiable Physics (Genesis)
Physical accuracy Data-driven, may be non-physical Based on physics equations, accurate
Generalization Difficult to extrapolate beyond data Physical laws naturally generalize
Training data needs Large amounts of interaction data No training data needed
Flexibility Can learn arbitrary dynamics Limited to physics engine-supported materials
Gradient access Requires reparameterization tricks Natively differentiable
Sim-to-Real gap Data-driven can reduce it Parameter calibration is key

5. World Models for Robot Planning

5.1 Model Predictive Control (MPC) with World Models

Given a world model \(p_\theta\), MPC selects the optimal action sequence through online optimization:

\[\mathbf{a}_{t:t+H}^* = \arg\max_{\mathbf{a}_{t:t+H}} \sum_{k=0}^{H} \gamma^k \hat{r}(s_{t+k}, a_{t+k})\]
\[\text{s.t.} \quad \hat{s}_{t+k+1} = g_\theta(\hat{s}_{t+k}, a_{t+k})\]

Common optimization methods:

  • CEM (Cross-Entropy Method): Sample-evaluate-resample loop
  • MPPI: Model Predictive Path Integral, weighted averaging with temperature parameter
  • Gradient optimization: If the world model is differentiable, backpropagate gradients directly

5.2 Visual Planning

An emerging direction is using video generation models for "visual planning":

  1. Given the current observation and goal description
  2. The world model generates an "imagined" video showing how to reach the goal
  3. Intermediate sub-goals are extracted from the generated video
  4. Low-level policies sequentially execute each sub-goal

The advantage of this approach is leveraging the rich visual priors of video generation models to plan complex long-horizon tasks.


6. Summary and Outlook

World models in robotics are undergoing a transition from "small models + small data" to "large models + large data":

Phase Representatives Characteristics
Early PlaNet, Dreamer v1 Small RSSM, latent space, task-specific
Middle Dreamer v3, DayDreamer Generalized, real robot validation
Current UniSim, Cosmos, Genie Large-scale video generation, physical AI
Future? Unified physical world model Accurate physics + cross-scene generalization + real-time inference

Core open questions:

  1. Physical accuracy: Do video generation models truly "understand" physics? Or are they only imitating surface pixel patterns?
  2. Controllability: How to precisely control world model outputs to serve robot planning?
  3. Real-time performance: Large-scale video generation model inference speeds are far from meeting real-time control needs
  4. Evaluation: How to systematically evaluate the physical accuracy and utility of world models?

References:

  • Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020
  • Hafner et al., "Mastering Diverse Domains through World Models" (Dreamer v3), 2023
  • Hafner et al., "DayDreamer: World Models for Physical Robot Learning", CoRL 2022
  • Yang et al., "Learning Interactive Real-World Simulators" (UniSim), 2023
  • Bruce et al., "Genie: Generative Interactive Environments", ICML 2024
  • NVIDIA, "Cosmos World Foundation Model Platform for Physical AI", 2025
  • Genesis contributors, "Genesis: A Universal and Generative Physics Engine for Robotics and Beyond", 2024

评论 #