World Models and Video Generation

World Models are a core capability of embodied intelligence: by internally simulating future states, they enable robots to "imagine" the consequences of their actions. Starting from mathematical definitions, this article reviews the applications of world models in robotics and their convergence with video generation technologies.

Related notes: World Models (General) | Introduction to Robot Foundation Models

1. Mathematical Definition of World Models

1.1 Basic Form

The core of a world model is learning the environment's state transition function:

\[p_\theta(s_{t+1} | s_t, a_t)\]

where \(s_t\) is the environment state at time \(t\), and \(a_t\) is the agent's action.

However, in practical robot scenarios, we typically cannot directly access the complete environment state \(s_t\); we can only obtain observations \(o_t\) (e.g., images). Therefore, a latent state space must be introduced:

\[z_t = f_\phi(o_t) \quad \text{(Encoder: observation → latent state)}\]

\[\hat{z}_{t+1} = g_\theta(z_t, a_t) \quad \text{(Transition model: predict next latent state)}\]

\[\hat{o}_{t+1} = d_\psi(z_{t+1}) \quad \text{(Decoder: latent state → predicted observation)}\]

\[\hat{r}_t = r_\xi(z_t, a_t) \quad \text{(Reward predictor)}\]

1.2 Complete Objective

Training a world model typically involves jointly optimizing multiple objectives:

\[\mathcal{L} = \underbrace{\mathcal{L}_{\text{recon}}}_{\text{reconstruction loss}} + \beta \underbrace{\mathcal{L}_{\text{KL}}}_{\text{regularization}} + \gamma \underbrace{\mathcal{L}_{\text{reward}}}_{\text{reward prediction}} + \delta \underbrace{\mathcal{L}_{\text{dyn}}}_{\text{dynamics consistency}}\]

where:

\(\mathcal{L}_{\text{recon}} = \|o_t - \hat{o}_t\|^2\): Reconstruction loss
\(\mathcal{L}_{\text{KL}} = D_{\text{KL}}(q(z_t|o_t) \| p(z_t|z_{t-1}, a_{t-1}))\): Consistency between posterior and prior
\(\mathcal{L}_{\text{reward}} = \|r_t - \hat{r}_t\|^2\): Reward prediction loss
\(\mathcal{L}_{\text{dyn}}\): Dynamics prediction accuracy

1.3 Uses of World Models in Robotics

graph TB
    WM[World Model<br/>p(s_t+1 | s_t, a_t)]

    WM --> U1[Model Predictive Control MPC<br/>Online planning of optimal action sequences]
    WM --> U2[Imagination Training<br/>Generating virtual experiences inside the model]
    WM --> U3[Safety Checking<br/>Predicting whether action consequences are safe]
    WM --> U4[Sim2Real<br/>Learned model bridges the simulation gap]
    WM --> U5[Video Prediction<br/>Generating future scenes to aid decision-making]

    U1 --> A1["Select action sequence that maximizes Σ r(st,at)"]
    U2 --> A2["Dreamer: Train RL policy in latent space"]
    U5 --> A3["Visual planning: Imagine goal scene first, then act"]

2. Dreamer Series: From Simulation to Real Robots

2.1 Dreamer Architecture Evolution

The Dreamer series is one of the most successful applications of world models in RL. Its core idea is "imagination" training within a learned latent world model:

Dreamer v1 (2020):

Learns a latent dynamics model (RSSM: Recurrent State-Space Model)
Trains Actor-Critic through imagined trajectories in latent space
Dramatically improves sample efficiency

Dreamer v2 (2021):

Discretized latent states (categorical latents)
Improved value function estimation
First to surpass model-free methods on Atari

Dreamer v3 (2023):

Universal hyperparameters, no per-task tuning needed
Symlog encoding handles rewards of different magnitudes
Excellent performance across 150+ tasks

Mathematical form of RSSM:

\[\begin{aligned} \text{Deterministic path:} \quad h_t &= f_\theta(h_{t-1}, z_{t-1}, a_{t-1}) \\ \text{Stochastic state prior:} \quad \hat{z}_t &\sim p_\theta(z_t | h_t) \\ \text{Stochastic state posterior:} \quad z_t &\sim q_\phi(z_t | h_t, o_t) \end{aligned}\]

where \(h_t\) is the deterministic recurrent state and \(z_t\) is the stochastic state.

2.2 DayDreamer (2022): Dreamer on Real Robots

DayDreamer (Hafner et al., 2022) was the first to successfully apply Dreamer to real robots:

Platform	Task	Training Time	Notes
A1 Quadruped	Standing, walking	1 hour of real interaction	Learning gaits from scratch
UR5 Robot Arm	Object manipulation	~30 minutes	Tabletop grasping
Shadow Hand	Dexterous rotation	~40 minutes	In-hand manipulation

Key implementation details:

Collect a small amount of data on the real robot
Train the RSSM world model
Train the policy through imagination within the world model (no real interaction needed)
Deploy the policy on the real robot, collect more data
Repeat the above process

3. Video Generation as World Models

3.1 Predicting the Future from Pixel Space

A major recent trend is treating video generation models as world models. The core insight: If a model can accurately predict future video frames given actions, then it has implicitly learned the physical laws of the environment.

Formal expression:

\[p_\theta(o_{t+1:t+H} | o_{1:t}, a_{t:t+H-1})\]

where \(o\) are image frames, \(a\) are actions, and \(H\) is the prediction time horizon.

3.2 Key Models

UniSim (Google DeepMind, 2023)

Positioning: Universal interactive simulator

Core idea: Use a video diffusion model to simulate the dynamic changes of any environment, supporting multiple types of "action" inputs:

Robot end-effector motion
Free-form text descriptions ("open the drawer")
Camera motion trajectories

Architecture: Based on a Video Diffusion Model, conditioned on previous frames and action descriptions

Applications:

Training robot policies (simulator replacement)
Data augmentation: Generating training data under different conditions
Evaluation: Testing policy robustness within the model

Genie (Google DeepMind, 2024)

Positioning: Generative interactive environment

Core idea: Unsupervised learning of controllable interactive environments from internet videos

Architecture:

Video Tokenizer: Encodes video frames into discrete tokens
Latent Action Model (LAM): Infers latent actions from consecutive frames
Dynamics Model: Given current frame and latent action, predicts the next frame

graph LR
    V1[Video Frame t] --> VT[Video Tokenizer]
    V2[Video Frame t+1] --> VT
    VT --> LAM[Latent Action Model]
    LAM --> LA[Latent Action â_t]

    VT --> DM[Dynamics Model]
    LA --> DM
    DM --> VT2[Predicted Frame t+1 Tokens]
    VT2 --> DEC[Decoder]
    DEC --> PRED[Predicted Video Frame]

Key innovation: No action annotations required! Learns an interactive world model from pure video data.

Scale: 11B parameters, trained on 200K hours of internet video

Cosmos (NVIDIA, 2025)

Positioning: Physical AI world foundation model

Core architecture:

Cosmos Tokenizer: Compresses video into both continuous and discrete token forms
Cosmos World Foundation Model (WFM): Based on both diffusion Transformer and autoregressive Transformer architectures
Post-training: Supports fine-tuning for specific robot scenarios

Two architectures:

Feature	Diffusion WFM	Autoregressive WFM
Generation method	Diffusion denoising	Token-by-token generation
Quality	High	Medium–High
Speed	Slower	Faster
Controllability	Via conditioning	Via prompting
Parameters	7B–14B	4B–12B

Key contributions:

Largest-scale open-source physical world video generation model
Supports generation from text, images, actions, and other conditions
Focused on physical accuracy (gravity, collisions, fluids)

Genesis (2024)

Positioning: Differentiable physics simulator

Core idea: Unlike the above learning-based world models, Genesis takes the differentiable physics simulation route:

\[\frac{\partial s_{t+1}}{\partial a_t} = \frac{\partial f_{\text{physics}}(s_t, a_t)}{\partial a_t}\]

Through a differentiable physics engine, control policies can be optimized directly via gradient backpropagation.

Features:

Supports rigid bodies, soft bodies, fluids, cloth, and other physical materials
10–80x faster than traditional physics simulators (GPU parallel)
Supports automatic generation of robot training scenarios
Differentiability enables direct gradient-based policy optimization

4. Technical Comparison Analysis

4.1 Latent Space Models vs. Pixel Space Models

graph TB
    subgraph LatentSpaceWM["Latent Space World Model"]
        L1[Observation] --> L2[Encoder]
        L2 --> L3[Latent State z]
        L3 --> L4[Latent Dynamics Model]
        L4 --> L5[Predicted Latent State]
        L5 --> L6[Decoder]
        L6 --> L7[Predicted Observation]
    end

    subgraph PixelSpaceWM["Pixel Space World Model"]
        P1[Observation] --> P2[Video Diffusion / Autoregressive Model]
        P2 --> P3[Direct Prediction of Future Frames]
    end

    style LatentSpaceWM fill:#e3f2fd
    style PixelSpaceWM fill:#fff3e0

Dimension	Latent Space (Dreamer Series)	Pixel Space (UniSim, Cosmos)
Representatives	Dreamer v3, TD-MPC	UniSim, Genie, Cosmos
Compression	Encoder compression	Tokenization / Diffusion
Prediction accuracy	Medium (lossy reconstruction)	High (direct pixel prediction)
Computational cost	Low	High
Training data	Small (a few hours)	Large (tens of thousands of hours)
Physical accuracy	Data-dependent	Data-dependent
Planability	Direct planning in latent space	Requires additional modules to extract information

4.2 Learning-Based World Models vs. Differentiable Physics

Dimension	Learning-Based (Dreamer, UniSim)	Differentiable Physics (Genesis)
Physical accuracy	Data-driven, may be non-physical	Based on physics equations, accurate
Generalization	Difficult to extrapolate beyond data	Physical laws naturally generalize
Training data needs	Large amounts of interaction data	No training data needed
Flexibility	Can learn arbitrary dynamics	Limited to physics engine-supported materials
Gradient access	Requires reparameterization tricks	Natively differentiable
Sim-to-Real gap	Data-driven can reduce it	Parameter calibration is key

5. World Models for Robot Planning

5.1 Model Predictive Control (MPC) with World Models

Given a world model \(p_\theta\), MPC selects the optimal action sequence through online optimization:

\[\mathbf{a}_{t:t+H}^* = \arg\max_{\mathbf{a}_{t:t+H}} \sum_{k=0}^{H} \gamma^k \hat{r}(s_{t+k}, a_{t+k})\]

\[\text{s.t.} \quad \hat{s}_{t+k+1} = g_\theta(\hat{s}_{t+k}, a_{t+k})\]

Common optimization methods:

CEM (Cross-Entropy Method): Sample-evaluate-resample loop
MPPI: Model Predictive Path Integral, weighted averaging with temperature parameter
Gradient optimization: If the world model is differentiable, backpropagate gradients directly

5.2 Visual Planning

An emerging direction is using video generation models for "visual planning":

Given the current observation and goal description
The world model generates an "imagined" video showing how to reach the goal
Intermediate sub-goals are extracted from the generated video
Low-level policies sequentially execute each sub-goal

The advantage of this approach is leveraging the rich visual priors of video generation models to plan complex long-horizon tasks.

6. Summary and Outlook

World models in robotics are undergoing a transition from "small models + small data" to "large models + large data":

Phase	Representatives	Characteristics
Early	PlaNet, Dreamer v1	Small RSSM, latent space, task-specific
Middle	Dreamer v3, DayDreamer	Generalized, real robot validation
Current	UniSim, Cosmos, Genie	Large-scale video generation, physical AI
Future?	Unified physical world model	Accurate physics + cross-scene generalization + real-time inference

Core open questions:

Physical accuracy: Do video generation models truly "understand" physics? Or are they only imitating surface pixel patterns?
Controllability: How to precisely control world model outputs to serve robot planning?
Real-time performance: Large-scale video generation model inference speeds are far from meeting real-time control needs
Evaluation: How to systematically evaluate the physical accuracy and utility of world models?

References:

Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020
Hafner et al., "Mastering Diverse Domains through World Models" (Dreamer v3), 2023
Hafner et al., "DayDreamer: World Models for Physical Robot Learning", CoRL 2022
Yang et al., "Learning Interactive Real-World Simulators" (UniSim), 2023
Bruce et al., "Genie: Generative Interactive Environments", ICML 2024
NVIDIA, "Cosmos World Foundation Model Platform for Physical AI", 2025
Genesis contributors, "Genesis: A Universal and Generative Physics Engine for Robotics and Beyond", 2024