World Models
1. What Is a World Model?
A World Model is an internal representation of the world that can simulate how the world's state evolves over time and in response to actions.
This definition carries several layers of meaning:
- Internal representation: Not an external database or knowledge graph, but a structure encoded within the model itself
- World state: Not a loose collection of facts, but a structured, trackable state
- Temporal evolution: The state is not a static snapshot but a dynamic process that can be propagated forward
- Action-responsiveness: State changes are not merely spontaneous — they can be intervened upon by an agent's actions
The essence of a world model is not "knowing many things," but "being able to run an internal simulator of the world."
2. Core Components
A complete world model comprises at least the following five components:
| Component | Meaning | Role |
|---|---|---|
| State representation | Internal encoding of the current world | Provides the basis for reasoning and prediction |
| Dynamics / transition model | \(P(s_{t+1} \mid s_t, a_t)\) | Predicts state changes caused by actions |
| Observation model | Generates observable signals from states | Bridges internal representation and external perception |
| Reward / value | Evaluates the quality of states | Provides a basis for decision-making |
| Causal structure | Causal relationships among variables | Supports intervention and counterfactual reasoning |
The most central of these is the dynamics model: given the current state \(s_t\) and action \(a_t\), predict the next state \(s_{t+1}\). If a model truly masters \(P(s_{t+1} \mid s_t, a_t)\), it must have internalized a vast body of world regularities — continuity, object permanence, collision dynamics, causal propagation, and more.
3. Why World Models Matter So Much
World models enable four critical capabilities:
- Prediction: Foreseeing consequences without executing an action — "If I push this cup, it will fall off the table"
- Planning: Searching over action sequences in imagination — "To reach that room, I should open the door first and then turn the corner"
- Imagination: Generating scenarios never previously experienced — "What would the world look like if gravity were twice as strong?"
- Counterfactual reasoning: Evaluating alternatives to decisions already made — "What would have happened if I had taken the other road?"
Without a world model, an agent is limited to reactive, stimulus–response behavior; with a world model, it can simulate outcomes "in its head" before acting.
4. The Difference Between "Knowing Facts" and "Having a Simulator"
This is an easy-to-confuse yet absolutely fundamental distinction.
An LLM is more like a "world commentator" that has read vast amounts of text, rather than a "world runner" equipped with an internal simulator.
What LLMs can do: State that "a ball will fall when released" or "a dropped cup might shatter" — these are world knowledge compressed from text.
Where LLMs are unreliable: Multi-step spatial tracking, latent variable maintenance, continuous-time processes, and the precise consequences of actions — because their training objective is "predict the next token," not "maintain a world state that evolves over time."
The core difference lies in:
| Dimension | Knowing facts (LLM) | Having a simulator (World Model) |
|---|---|---|
| Representation | Semantic associations in text | Evolvable latent states |
| Prediction method | Statistical co-occurrence | State transitions |
| Multi-step reasoning | Degrades with increasing steps | Can be stably rolled forward |
| Interventionability | Weak — changing the input does not guarantee consistency | Strong — changing an action systematically alters the future |
| Physical consistency | Not guaranteed | Ensured by dynamics constraints |
A model may perform well on benchmarks by relying on "statistical shortcuts in language" without ever forming a robust world dynamics module. This is precisely the source of shortcuts and spurious causality (see the discussion on shortcuts in 大脑的先验知识).
5. Two Classic Approaches: Dreamer and MuZero
Dreamer (Danijar Hafner)
Dreamer is a representative latent world model approach:
- Learns a latent state space from raw observations
- Learns a dynamics model in the latent space
- Plans by unrolling future trajectories in imagination (imagination rollout)
- The entire training process requires minimal interaction with the real environment
Dreamer's philosophy leans toward "modeling the world" — it tries to learn as faithful a world representation as possible, then makes decisions on top of that representation.
MuZero (David Silver / DeepMind)
MuZero is a more "task-oriented" world model:
- Does not attempt to reconstruct raw observations (e.g., pixels)
- Learns only what is useful for decision-making: policy, value, reward, and hidden-state dynamics
- The state representation is driven entirely by task performance, with no concern for interpretability
MuZero's philosophy leans toward "serving decisions" — it retains only the hidden-state evolution most useful for action, regardless of whether the representation "looks like" the real world.
Comparison
| Dimension | Dreamer | MuZero |
|---|---|---|
| Modeling objective | Represent the world as faithfully as possible | Retain only decision-relevant information |
| Reconstructs observations? | Yes (generates latent trajectories) | No (ignores raw observations) |
| Degree of decision-orientation | Moderate | Strong |
| Philosophical leaning | Modeling realism | Instrumentalism |
| Typical applications | Continuous control, robotics | Board games, Atari, and other discrete-decision tasks |
| Interpretability | Higher (latent states can be decoded) | Lower (hidden states do not correspond to real quantities) |
Both belong to the world model paradigm; they simply differ in style. One is more like "understand the world first, then decide," while the other is more like "understand the world only insofar as it helps make better decisions."
6. The Fundamental Debate: Where Does Structure Come From?
This is the most central point of contention in world model research.
Route A: End-to-End Pure Neural Networks
A large enough model + enough data → causality / physics / structure will "emerge."
- Pros: Extremely general; no hand-designed structure required
- Cons: Low sample efficiency; prone to learning shortcuts and spurious causality; unreliable physical consistency
Route B: Explicit Structural Priors
Actively introduce object-centric representations, causal graphs, physical conservation laws, 3D consistency, temporal continuity, etc.
- Pros: High sample efficiency; stronger generalization; physical consistency guaranteed
- Cons: Difficult to design; may limit expressiveness; incorrect structural assumptions can be harmful
The Fusion Route: The Current Mainstream Consensus
Neural networks learn representations; structural priors impose constraints.
This is neither a purely hand-crafted rule system nor a completely unstructured black box. Instead, the right inductive biases are encoded into the model architecture — much as CNNs encode locality and translation equivariance into visual processing, except that the priors a world model needs are far more complex than those of a CNN.
The key challenge on this fusion route is: Which priors must be explicitly designed, and which can be learned from data? This question remains open.
7. 2025–2026: World Models as the Most Prominent Paradigm Shift
Starting in 2025, world models have evolved from an academic concept into one of the most important research directions in industry. At least five major competing paths exist:
| Path | Core Idea | Representative |
|---|---|---|
| JEPA | Prediction in abstract representation spaces | Yann LeCun / AMI Labs |
| Spatial Intelligence | 3D spatial structure understanding and generation | Fei-Fei Li / World Labs |
| Learned Simulation | Learning physical laws from data | Google DeepMind / Genie 3 |
| Physical AI Infrastructure | Physics-aware video prediction | NVIDIA Cosmos |
| Active Inference | Minimizing free energy / surprise | Karl Friston / VERSES |
These five paths approach the same problem from different angles: how to endow AI with a runnable internal model of the world. Detailed analyses can be found in JEPA架构 and 空间智能与学习式仿真.
8. What a Unified World Model Should Look Like
If we set aside any single path and instead consider what properties an "ideal" unified world model should possess, it would need at least:
- Shared state space: Images, text, actions, and audio all map into the same latent state
- State persistence: Objects continue to exist in the internal state even when occluded
- Dynamics: \(s_t, a_t \rightarrow s_{t+1}\) — states can be rolled forward in response to actions
- Constraint compliance: Adherence to physical rules and causal laws
- Interventionability: Changing an action systematically alters the future
- Compositionality: Multiple objects and relations can be compositionally generalized to novel scenes
This is fundamentally different from "multimodal alignment." Multimodal alignment addresses "Are this image and this sentence talking about the same thing?" — it is a cross-modal dictionary. A unified world model must also answer "What happens next? Why does it happen? If I take an action, how will the future change systematically?" — it is a physics simulator combined with a causal generator.
The core distinction: multimodal alignment solves "what is this"; a unified world model must also solve "what will happen next."
9. Relationship to Prior Knowledge
World models are closely linked to the topic of 大脑的先验知识.
The human brain comes equipped with a low-resolution yet remarkably generalizable world model — physical continuity, object permanence, rigidity, spatial consistency, causal expectations, gravitational intuition, and more. These priors are not learned; they are "pre-trained" into the nervous system by hundreds of millions of years of evolution. It is precisely these priors that enable humans to learn efficiently from very little experience.
The central dilemma facing current AI systems is that they lack this suite of priors. LLMs have indirectly acquired vast world knowledge from text, but this knowledge exists in the form of statistical associations rather than causal mechanisms. The ultimate goal of world model research is to equip AI systems with a comparable "internal world simulator" — whether that simulator emerges from massive data or is explicitly designed in through structural priors.
10. Open Questions
- Granularity of priors: The human brain's priors form a hierarchical structure, from low-level physical continuity to high-level social intention perception. At which level should priors be introduced into AI world models?
- Balancing generality and specialization: Dreamer leans general-purpose; MuZero leans specialized. Where should an ideal world model strike this balance?
- Evaluation criteria: How can we determine whether a model "truly understands the world" versus merely "memorizing statistical patterns"? Reliable evaluation methods are currently lacking.
- Will the five paths converge?: JEPA, spatial intelligence, learned simulation, physical AI, and active inference — will they ultimately merge into a unified framework?
True human intelligence is the result of coupling "understanding the world" with "acting in the world." The ultimate question of world model research is: can we enable machines to achieve the same coupling?