World Models and Video Generation
World Models are a core capability of embodied intelligence: by internally simulating future states, they enable robots to "imagine" the consequences of their actions. Starting from mathematical definitions, this article reviews the applications of world models in robotics and their convergence with video generation technologies.
Related notes: World Models (General) | Introduction to Robot Foundation Models
1. Mathematical Definition of World Models
1.1 Basic Form
The core of a world model is learning the environment's state transition function:
where \(s_t\) is the environment state at time \(t\), and \(a_t\) is the agent's action.
However, in practical robot scenarios, we typically cannot directly access the complete environment state \(s_t\); we can only obtain observations \(o_t\) (e.g., images). Therefore, a latent state space must be introduced:
1.2 Complete Objective
Training a world model typically involves jointly optimizing multiple objectives:
where:
- \(\mathcal{L}_{\text{recon}} = \|o_t - \hat{o}_t\|^2\): Reconstruction loss
- \(\mathcal{L}_{\text{KL}} = D_{\text{KL}}(q(z_t|o_t) \| p(z_t|z_{t-1}, a_{t-1}))\): Consistency between posterior and prior
- \(\mathcal{L}_{\text{reward}} = \|r_t - \hat{r}_t\|^2\): Reward prediction loss
- \(\mathcal{L}_{\text{dyn}}\): Dynamics prediction accuracy
1.3 Uses of World Models in Robotics
graph TB
WM[World Model<br/>p(s_t+1 | s_t, a_t)]
WM --> U1[Model Predictive Control MPC<br/>Online planning of optimal action sequences]
WM --> U2[Imagination Training<br/>Generating virtual experiences inside the model]
WM --> U3[Safety Checking<br/>Predicting whether action consequences are safe]
WM --> U4[Sim2Real<br/>Learned model bridges the simulation gap]
WM --> U5[Video Prediction<br/>Generating future scenes to aid decision-making]
U1 --> A1["Select action sequence that maximizes Σ r(st,at)"]
U2 --> A2["Dreamer: Train RL policy in latent space"]
U5 --> A3["Visual planning: Imagine goal scene first, then act"]
2. Dreamer Series: From Simulation to Real Robots
2.1 Dreamer Architecture Evolution
The Dreamer series is one of the most successful applications of world models in RL. Its core idea is "imagination" training within a learned latent world model:
Dreamer v1 (2020):
- Learns a latent dynamics model (RSSM: Recurrent State-Space Model)
- Trains Actor-Critic through imagined trajectories in latent space
- Dramatically improves sample efficiency
Dreamer v2 (2021):
- Discretized latent states (categorical latents)
- Improved value function estimation
- First to surpass model-free methods on Atari
Dreamer v3 (2023):
- Universal hyperparameters, no per-task tuning needed
- Symlog encoding handles rewards of different magnitudes
- Excellent performance across 150+ tasks
Mathematical form of RSSM:
where \(h_t\) is the deterministic recurrent state and \(z_t\) is the stochastic state.
2.2 DayDreamer (2022): Dreamer on Real Robots
DayDreamer (Hafner et al., 2022) was the first to successfully apply Dreamer to real robots:
| Platform | Task | Training Time | Notes |
|---|---|---|---|
| A1 Quadruped | Standing, walking | 1 hour of real interaction | Learning gaits from scratch |
| UR5 Robot Arm | Object manipulation | ~30 minutes | Tabletop grasping |
| Shadow Hand | Dexterous rotation | ~40 minutes | In-hand manipulation |
Key implementation details:
- Collect a small amount of data on the real robot
- Train the RSSM world model
- Train the policy through imagination within the world model (no real interaction needed)
- Deploy the policy on the real robot, collect more data
- Repeat the above process
3. Video Generation as World Models
3.1 Predicting the Future from Pixel Space
A major recent trend is treating video generation models as world models. The core insight: If a model can accurately predict future video frames given actions, then it has implicitly learned the physical laws of the environment.
Formal expression:
where \(o\) are image frames, \(a\) are actions, and \(H\) is the prediction time horizon.
3.2 Key Models
UniSim (Google DeepMind, 2023)
Positioning: Universal interactive simulator
Core idea: Use a video diffusion model to simulate the dynamic changes of any environment, supporting multiple types of "action" inputs:
- Robot end-effector motion
- Free-form text descriptions ("open the drawer")
- Camera motion trajectories
Architecture: Based on a Video Diffusion Model, conditioned on previous frames and action descriptions
Applications:
- Training robot policies (simulator replacement)
- Data augmentation: Generating training data under different conditions
- Evaluation: Testing policy robustness within the model
Genie (Google DeepMind, 2024)
Positioning: Generative interactive environment
Core idea: Unsupervised learning of controllable interactive environments from internet videos
Architecture:
- Video Tokenizer: Encodes video frames into discrete tokens
- Latent Action Model (LAM): Infers latent actions from consecutive frames
- Dynamics Model: Given current frame and latent action, predicts the next frame
graph LR
V1[Video Frame t] --> VT[Video Tokenizer]
V2[Video Frame t+1] --> VT
VT --> LAM[Latent Action Model]
LAM --> LA[Latent Action â_t]
VT --> DM[Dynamics Model]
LA --> DM
DM --> VT2[Predicted Frame t+1 Tokens]
VT2 --> DEC[Decoder]
DEC --> PRED[Predicted Video Frame]
Key innovation: No action annotations required! Learns an interactive world model from pure video data.
Scale: 11B parameters, trained on 200K hours of internet video
Cosmos (NVIDIA, 2025)
Positioning: Physical AI world foundation model
Core architecture:
- Cosmos Tokenizer: Compresses video into both continuous and discrete token forms
- Cosmos World Foundation Model (WFM): Based on both diffusion Transformer and autoregressive Transformer architectures
- Post-training: Supports fine-tuning for specific robot scenarios
Two architectures:
| Feature | Diffusion WFM | Autoregressive WFM |
|---|---|---|
| Generation method | Diffusion denoising | Token-by-token generation |
| Quality | High | Medium–High |
| Speed | Slower | Faster |
| Controllability | Via conditioning | Via prompting |
| Parameters | 7B–14B | 4B–12B |
Key contributions:
- Largest-scale open-source physical world video generation model
- Supports generation from text, images, actions, and other conditions
- Focused on physical accuracy (gravity, collisions, fluids)
Genesis (2024)
Positioning: Differentiable physics simulator
Core idea: Unlike the above learning-based world models, Genesis takes the differentiable physics simulation route:
Through a differentiable physics engine, control policies can be optimized directly via gradient backpropagation.
Features:
- Supports rigid bodies, soft bodies, fluids, cloth, and other physical materials
- 10–80x faster than traditional physics simulators (GPU parallel)
- Supports automatic generation of robot training scenarios
- Differentiability enables direct gradient-based policy optimization
4. Technical Comparison Analysis
4.1 Latent Space Models vs. Pixel Space Models
graph TB
subgraph LatentSpaceWM["Latent Space World Model"]
L1[Observation] --> L2[Encoder]
L2 --> L3[Latent State z]
L3 --> L4[Latent Dynamics Model]
L4 --> L5[Predicted Latent State]
L5 --> L6[Decoder]
L6 --> L7[Predicted Observation]
end
subgraph PixelSpaceWM["Pixel Space World Model"]
P1[Observation] --> P2[Video Diffusion / Autoregressive Model]
P2 --> P3[Direct Prediction of Future Frames]
end
style LatentSpaceWM fill:#e3f2fd
style PixelSpaceWM fill:#fff3e0
| Dimension | Latent Space (Dreamer Series) | Pixel Space (UniSim, Cosmos) |
|---|---|---|
| Representatives | Dreamer v3, TD-MPC | UniSim, Genie, Cosmos |
| Compression | Encoder compression | Tokenization / Diffusion |
| Prediction accuracy | Medium (lossy reconstruction) | High (direct pixel prediction) |
| Computational cost | Low | High |
| Training data | Small (a few hours) | Large (tens of thousands of hours) |
| Physical accuracy | Data-dependent | Data-dependent |
| Planability | Direct planning in latent space | Requires additional modules to extract information |
4.2 Learning-Based World Models vs. Differentiable Physics
| Dimension | Learning-Based (Dreamer, UniSim) | Differentiable Physics (Genesis) |
|---|---|---|
| Physical accuracy | Data-driven, may be non-physical | Based on physics equations, accurate |
| Generalization | Difficult to extrapolate beyond data | Physical laws naturally generalize |
| Training data needs | Large amounts of interaction data | No training data needed |
| Flexibility | Can learn arbitrary dynamics | Limited to physics engine-supported materials |
| Gradient access | Requires reparameterization tricks | Natively differentiable |
| Sim-to-Real gap | Data-driven can reduce it | Parameter calibration is key |
5. World Models for Robot Planning
5.1 Model Predictive Control (MPC) with World Models
Given a world model \(p_\theta\), MPC selects the optimal action sequence through online optimization:
Common optimization methods:
- CEM (Cross-Entropy Method): Sample-evaluate-resample loop
- MPPI: Model Predictive Path Integral, weighted averaging with temperature parameter
- Gradient optimization: If the world model is differentiable, backpropagate gradients directly
5.2 Visual Planning
An emerging direction is using video generation models for "visual planning":
- Given the current observation and goal description
- The world model generates an "imagined" video showing how to reach the goal
- Intermediate sub-goals are extracted from the generated video
- Low-level policies sequentially execute each sub-goal
The advantage of this approach is leveraging the rich visual priors of video generation models to plan complex long-horizon tasks.
6. Summary and Outlook
World models in robotics are undergoing a transition from "small models + small data" to "large models + large data":
| Phase | Representatives | Characteristics |
|---|---|---|
| Early | PlaNet, Dreamer v1 | Small RSSM, latent space, task-specific |
| Middle | Dreamer v3, DayDreamer | Generalized, real robot validation |
| Current | UniSim, Cosmos, Genie | Large-scale video generation, physical AI |
| Future? | Unified physical world model | Accurate physics + cross-scene generalization + real-time inference |
Core open questions:
- Physical accuracy: Do video generation models truly "understand" physics? Or are they only imitating surface pixel patterns?
- Controllability: How to precisely control world model outputs to serve robot planning?
- Real-time performance: Large-scale video generation model inference speeds are far from meeting real-time control needs
- Evaluation: How to systematically evaluate the physical accuracy and utility of world models?
References:
- Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination", ICLR 2020
- Hafner et al., "Mastering Diverse Domains through World Models" (Dreamer v3), 2023
- Hafner et al., "DayDreamer: World Models for Physical Robot Learning", CoRL 2022
- Yang et al., "Learning Interactive Real-World Simulators" (UniSim), 2023
- Bruce et al., "Genie: Generative Interactive Environments", ICML 2024
- NVIDIA, "Cosmos World Foundation Model Platform for Physical AI", 2025
- Genesis contributors, "Genesis: A Universal and Generative Physics Engine for Robotics and Beyond", 2024