World Models

1. What Is a World Model?

A World Model is an internal representation of the world that can simulate how the world's state evolves over time and in response to actions.

This definition carries several layers of meaning:

Internal representation: Not an external database or knowledge graph, but a structure encoded within the model itself
World state: Not a loose collection of facts, but a structured, trackable state
Temporal evolution: The state is not a static snapshot but a dynamic process that can be propagated forward
Action-responsiveness: State changes are not merely spontaneous — they can be intervened upon by an agent's actions

The essence of a world model is not "knowing many things," but "being able to run an internal simulator of the world."

2. Core Components

A complete world model comprises at least the following five components:

Component	Meaning	Role
State representation	Internal encoding of the current world	Provides the basis for reasoning and prediction
Dynamics / transition model	\(P(s_{t+1} \mid s_t, a_t)\)	Predicts state changes caused by actions
Observation model	Generates observable signals from states	Bridges internal representation and external perception
Reward / value	Evaluates the quality of states	Provides a basis for decision-making
Causal structure	Causal relationships among variables	Supports intervention and counterfactual reasoning

The most central of these is the dynamics model: given the current state \(s_t\) and action \(a_t\), predict the next state \(s_{t+1}\). If a model truly masters \(P(s_{t+1} \mid s_t, a_t)\), it must have internalized a vast body of world regularities — continuity, object permanence, collision dynamics, causal propagation, and more.

3. Why World Models Matter So Much

World models enable four critical capabilities:

Prediction: Foreseeing consequences without executing an action — "If I push this cup, it will fall off the table"
Planning: Searching over action sequences in imagination — "To reach that room, I should open the door first and then turn the corner"
Imagination: Generating scenarios never previously experienced — "What would the world look like if gravity were twice as strong?"
Counterfactual reasoning: Evaluating alternatives to decisions already made — "What would have happened if I had taken the other road?"

Without a world model, an agent is limited to reactive, stimulus–response behavior; with a world model, it can simulate outcomes "in its head" before acting.

4. The Difference Between "Knowing Facts" and "Having a Simulator"

This is an easy-to-confuse yet absolutely fundamental distinction.

An LLM is more like a "world commentator" that has read vast amounts of text, rather than a "world runner" equipped with an internal simulator.

What LLMs can do: State that "a ball will fall when released" or "a dropped cup might shatter" — these are world knowledge compressed from text.

Where LLMs are unreliable: Multi-step spatial tracking, latent variable maintenance, continuous-time processes, and the precise consequences of actions — because their training objective is "predict the next token," not "maintain a world state that evolves over time."

The core difference lies in:

Dimension	Knowing facts (LLM)	Having a simulator (World Model)
Representation	Semantic associations in text	Evolvable latent states
Prediction method	Statistical co-occurrence	State transitions
Multi-step reasoning	Degrades with increasing steps	Can be stably rolled forward
Interventionability	Weak — changing the input does not guarantee consistency	Strong — changing an action systematically alters the future
Physical consistency	Not guaranteed	Ensured by dynamics constraints

A model may perform well on benchmarks by relying on "statistical shortcuts in language" without ever forming a robust world dynamics module. This is precisely the source of shortcuts and spurious causality (see the discussion on shortcuts in 大脑的先验知识).

5. Two Classic Approaches: Dreamer and MuZero

Dreamer (Danijar Hafner)

Dreamer is a representative latent world model approach:

Learns a latent state space from raw observations
Learns a dynamics model in the latent space
Plans by unrolling future trajectories in imagination (imagination rollout)
The entire training process requires minimal interaction with the real environment

Dreamer's philosophy leans toward "modeling the world" — it tries to learn as faithful a world representation as possible, then makes decisions on top of that representation.

MuZero (David Silver / DeepMind)

MuZero is a more "task-oriented" world model:

Does not attempt to reconstruct raw observations (e.g., pixels)
Learns only what is useful for decision-making: policy, value, reward, and hidden-state dynamics
The state representation is driven entirely by task performance, with no concern for interpretability

MuZero's philosophy leans toward "serving decisions" — it retains only the hidden-state evolution most useful for action, regardless of whether the representation "looks like" the real world.

Comparison

Dimension	Dreamer	MuZero
Modeling objective	Represent the world as faithfully as possible	Retain only decision-relevant information
Reconstructs observations?	Yes (generates latent trajectories)	No (ignores raw observations)
Degree of decision-orientation	Moderate	Strong
Philosophical leaning	Modeling realism	Instrumentalism
Typical applications	Continuous control, robotics	Board games, Atari, and other discrete-decision tasks
Interpretability	Higher (latent states can be decoded)	Lower (hidden states do not correspond to real quantities)

Both belong to the world model paradigm; they simply differ in style. One is more like "understand the world first, then decide," while the other is more like "understand the world only insofar as it helps make better decisions."

6. The Fundamental Debate: Where Does Structure Come From?

This is the most central point of contention in world model research.

Route A: End-to-End Pure Neural Networks

A large enough model + enough data → causality / physics / structure will "emerge."

Pros: Extremely general; no hand-designed structure required
Cons: Low sample efficiency; prone to learning shortcuts and spurious causality; unreliable physical consistency

Route B: Explicit Structural Priors

Actively introduce object-centric representations, causal graphs, physical conservation laws, 3D consistency, temporal continuity, etc.

Pros: High sample efficiency; stronger generalization; physical consistency guaranteed
Cons: Difficult to design; may limit expressiveness; incorrect structural assumptions can be harmful

The Fusion Route: The Current Mainstream Consensus

Neural networks learn representations; structural priors impose constraints.

This is neither a purely hand-crafted rule system nor a completely unstructured black box. Instead, the right inductive biases are encoded into the model architecture — much as CNNs encode locality and translation equivariance into visual processing, except that the priors a world model needs are far more complex than those of a CNN.

The key challenge on this fusion route is: Which priors must be explicitly designed, and which can be learned from data? This question remains open.

7. 2025–2026: World Models as the Most Prominent Paradigm Shift

Starting in 2025, world models have evolved from an academic concept into one of the most important research directions in industry. At least five major competing paths exist:

Path	Core Idea	Representative
JEPA	Prediction in abstract representation spaces	Yann LeCun / AMI Labs
Spatial Intelligence	3D spatial structure understanding and generation	Fei-Fei Li / World Labs
Learned Simulation	Learning physical laws from data	Google DeepMind / Genie 3
Physical AI Infrastructure	Physics-aware video prediction	NVIDIA Cosmos
Active Inference	Minimizing free energy / surprise	Karl Friston / VERSES

These five paths approach the same problem from different angles: how to endow AI with a runnable internal model of the world. Detailed analyses can be found in JEPA架构 and 空间智能与学习式仿真.

8. What a Unified World Model Should Look Like

If we set aside any single path and instead consider what properties an "ideal" unified world model should possess, it would need at least:

Shared state space: Images, text, actions, and audio all map into the same latent state
State persistence: Objects continue to exist in the internal state even when occluded
Dynamics: \(s_t, a_t \rightarrow s_{t+1}\) — states can be rolled forward in response to actions
Constraint compliance: Adherence to physical rules and causal laws
Interventionability: Changing an action systematically alters the future
Compositionality: Multiple objects and relations can be compositionally generalized to novel scenes

This is fundamentally different from "multimodal alignment." Multimodal alignment addresses "Are this image and this sentence talking about the same thing?" — it is a cross-modal dictionary. A unified world model must also answer "What happens next? Why does it happen? If I take an action, how will the future change systematically?" — it is a physics simulator combined with a causal generator.

The core distinction: multimodal alignment solves "what is this"; a unified world model must also solve "what will happen next."

9. Relationship to Prior Knowledge

World models are closely linked to the topic of 大脑的先验知识.

The human brain comes equipped with a low-resolution yet remarkably generalizable world model — physical continuity, object permanence, rigidity, spatial consistency, causal expectations, gravitational intuition, and more. These priors are not learned; they are "pre-trained" into the nervous system by hundreds of millions of years of evolution. It is precisely these priors that enable humans to learn efficiently from very little experience.

The central dilemma facing current AI systems is that they lack this suite of priors. LLMs have indirectly acquired vast world knowledge from text, but this knowledge exists in the form of statistical associations rather than causal mechanisms. The ultimate goal of world model research is to equip AI systems with a comparable "internal world simulator" — whether that simulator emerges from massive data or is explicitly designed in through structural priors.

10. Open Questions

Granularity of priors: The human brain's priors form a hierarchical structure, from low-level physical continuity to high-level social intention perception. At which level should priors be introduced into AI world models?
Balancing generality and specialization: Dreamer leans general-purpose; MuZero leans specialized. Where should an ideal world model strike this balance?
Evaluation criteria: How can we determine whether a model "truly understands the world" versus merely "memorizing statistical patterns"? Reliable evaluation methods are currently lacking.
Will the five paths converge?: JEPA, spatial intelligence, learned simulation, physical AI, and active inference — will they ultimately merge into a unified framework?

True human intelligence is the result of coupling "understanding the world" with "acting in the world." The ultimate question of world model research is: can we enable machines to achieve the same coupling?