Prior Knowledge in the Brain

1. The Core Question: Why Does the Human Brain Learn Fast While Models Learn Slowly?

Why can humans learn from just a few examples, while current models typically require massive amounts of data?

The key lies in the fact that learning efficiency depends on priors / inductive bias.

Here, "prior" does not refer strictly to a Bayesian probability prior, but rather in a broader sense:

The model's presupposed tendencies about the structure of the world, established before seeing any data.

Its role is to:

Help the model narrow down the hypothesis space
Enable the model to extract correct patterns from limited data more easily
Improve sample efficiency

Without sufficiently appropriate inductive biases, a model can still learn in theory, but it will be extremely inefficient.

2. A Classic Example of Priors: CNNs

CNNs can learn from images effectively because they have several structural assumptions built in:

Locality: neighboring pixels are more correlated
Weight sharing
Translation equivariance / invariance: an object appearing in the top-left corner is essentially the same as one appearing in the bottom-right

CNNs do not "discover image patterns from scratch" — instead, part of the visual world's regularities are baked into the architecture.

This is a reasonable compression of the physical world — but it is only a very weak prior.

3. Priors in the Human Brain: Far More Powerful Than CNNs

The human brain does not merely carry a local prior similar to that of a CNN. Rather, it comes equipped with an entire suite of structural biases matched to the real world:

Physical continuity: the world state usually changes continuously rather than jumping erratically
Object permanence: being occluded does not mean ceasing to exist
Rigidity: many objects maintain a stable shape over short time scales
Spatial coherence / 3D structure: the world has three-dimensional structure
Local interaction: interactions typically occur between nearby regions
Causal expectation: pushing something makes it move; striking something changes it
Gravitational intuition: released objects fall downward
Agent / object distinction
Social intention perception: inferring what others "intend to do"

More precisely:

The human brain is not simply "born able to learn" — it is natively pre-configured as a system well-suited for learning about the real world.

4. Beyond Priors: The Brain's Learning Mechanisms and Goal Systems

The human brain comes with a low-resolution but highly generalizable world model. This model has several key characteristics:

Generative: the brain can imagine the future, fill in missing information, and perform counterfactual reasoning (e.g., "what would have happened if I had done that instead?")
Compositional: it can generalize from limited experience, combining a small number of concepts to produce new ones
Cross-modal unification: vision, hearing, touch, and language are coupled and modeled in a unified manner

In addition to powerful priors, the brain also has several important mechanisms:

Learning Mechanisms

Active learning: humans actively explore and generate data
Online learning: humans can perform one-shot / few-shot learning and continuously self-correct

Goal Systems

Curiosity / information gain maximization
Surprise / prediction error
Sense of control / agency

These are collectively referred to as intrinsic objectives.

So the complete picture is:

The brain's advantage = strong priors + active learning + intrinsic objectives + continual updating

By contrast, modern models typically operate more like: passively consuming data, with fixed training objectives, offline training, and no continual self-correction.

5. What Is a World Model?

A world model is not simply "knowing many facts." Rather, it is:

An internal representation of the world state, along with the ability to simulate how it changes over time and in response to actions.

A complete world model involves at least the following:

State representation
Dynamics / transition model
Observation model
Reward or value — especially important for decision-making tasks
(Ideally) Causal structure

What World Models Fill In

Predicting the future
Causal structure
Actionability

The central question for world models is: Which priors must be explicitly designed, and which can be learned from data?

6. The Core of a World Model: State Transitions

\[P(s_{t+1} \mid s_t, a_t)\]

This means: given the current state \(s_t\) and the current action \(a_t\), predict the next state \(s_{t+1}\).

This is the most critical component of a world model — action-conditioned dynamics. It answers:

What is the current state of the world?
If I take an action,
What will the next state be?

If a model can reliably learn \(P(s_{t+1} \mid s_t, a_t)\), then it must internalize many real-world regularities (continuity, object persistence, local interaction, collision dynamics, etc.).

However, it is important to distinguish:

Priors are structural preferences that exist before learning
World dynamics are patterns of world evolution learned from experience

The former helps the latter be learned more efficiently.

7. Two Approaches to World Models: Dreamer vs. MuZero

Dreamer (Representative: Danijar Hafner)

A typical latent world model approach:

Learns a latent state from observations
Learns the dynamics of this latent state
Rolls out imagined future trajectories
Uses imagination for planning

It leans toward "modeling the world" and is closer to a modeling-oriented philosophy.

MuZero (Representative: David Silver)

A more "task-oriented" world model:

Does not explicitly predict observations (e.g., pixels)
Only learns what is useful for decision-making: policy, value, reward, hidden dynamics
More like a "task-oriented world model" — retaining only the hidden state evolution most useful for action

It leans toward "serving decisions" and is closer to an instrumentalist philosophy.

Comparison

Model	Reconstructs the world?	Decision-oriented?
Dreamer	Yes (generates latent trajectory)	Moderate
MuZero	No (ignores raw observations)	Strong

Both belong to the world model paradigm; they simply differ in style.

8. Multimodal Alignment ≠ Unified World Model

Existing multimodal large models (e.g., GPT-4V, Gemini) are indeed building "unified representations," but two things must be distinguished:

What Has Been Achieved: Multimodal Alignment

Different modalities (text / image / audio) are mapped into a shared representation space, enabling cross-modal alignment — image captioning, text-to-image generation, and so on.

What it solves: "Are this image and this sentence describing the same thing?"

It is more like a cross-modal dictionary.

What Has Not Been Achieved: A Unified World Model

What has not yet been truly realized:

Unified world dynamics
Consistent physical constraints
Cross-modal causal consistency

A unified world model is more like a physics simulator + causal generator, and should at least possess:

Shared state space: images, text, and actions all map to the same latent state
State persistence: objects continue to exist after being occluded
Dynamics: \(s_t, a_t \rightarrow s_{t+1}\)
Constraints: adherence to physical / causal rules
Interventionality: changing an action systematically alters future outcomes
Compositionality: multiple objects and relations can be combined for generalization

The Core Distinction

"Being in the same vector space" only means that different modalities can correspond to one another; A "unified world model" requires the model to internally maintain a latent state that evolves over time, is influenced by actions, and is constrained by world regularities.

Multimodal alignment addresses: What is this?
A unified world model must also address: What will happen next? Why will it happen? If I take an action, how will the future systematically change?

Why Image-Text Alignment Is Not Enough

Because a single static vector can encode "correlation," but not necessarily "generative mechanism."

For example: a model may know that "a glass falling" frequently co-occurs with "shards," but this does not mean it has learned gravity, collision dynamics, material brittleness, or the relationship between velocity and impact force. It may have only learned statistical co-occurrence.

The key is not "whether there is a unified vector space," but rather:

Whether the internal representation is a state that can be advanced, intervened upon, and used for prediction.

9. LLMs Appear to Understand the World, but Lack Dynamics

Strengths of LLMs

The primary training objective of LLMs is next-token prediction. This objective forces them to learn a vast amount of statistical structure:

Linguistic structure, facts, semantic relationships
Event co-occurrence patterns, narrative templates
Common-sense expressions, compressed world knowledge embedded in human writing

As a result, LLMs seem to "understand the world well" — the text they trained on inherently contains a great deal of world knowledge.

What LLMs "Understand"

LLMs can state: a ball released from the hand will fall, a dropped cup may shatter, and keys placed in a drawer will most likely still be there. These all resemble common sense.

What LLMs "Don't Fully Understand"

Whenever a task requires precise, sustained state evolution, LLMs tend to struggle:

Multi-step spatial tracking
Latent variable maintenance
Continuous-time processes
Precise consequences of actions
Long-term multi-entity interactions

This is because their training objective is "predict the next token," not "maintain a world state that evolves over time and advance it based on actions."

An Intuitive Analogy

An LLM is more like a well-read "world commentator" than an internally simulating "world engine."

It can talk about the world very well, but it cannot necessarily advance the world reliably.

Why Next-Token Prediction Does Not Naturally Give Rise to Dynamics

Because this objective most directly optimizes for correct text continuation, narrative plausibility, and statistical naturalness, rather than:

Explicitly maintaining world state
Multi-step object identity tracking
Stably advancing latent state based on actions
Guaranteeing physical consistency

A model can achieve high performance through "linguistic statistical shortcuts" without ever forming a robust world dynamics module.

10. Shortcuts and Spurious Causality

What Are Shortcuts?

A model finds a path that scores highly on the training data but does not correspond to the true underlying mechanism.

It appears to have learned the task, but has only captured some superficial correlation.

What Is Spurious Causality?

Spurious causality is a typical form of shortcut: the model mistakes "correlation" for "causation" (correlation ≠ causation).

Example 1: Cows and Grasslands

In the training set, cows frequently appear against grassland backgrounds and camels against desert backgrounds. The model may learn "green background → cow." It has not learned what a cow looks like — only the background.

Example 2: Dropping a Cup Causes It to Shatter

In the data, most instances of "cup falling" result in "shattering," so the model memorizes "falling → shattering." But it has not learned the truly causal factors: height, material, surface hardness, impact velocity, etc. Given a plastic cup or a drop onto a foam pad, it may still predict "shattering."

Why Shortcuts Severely Impede World Models

A world model needs to learn stable mechanisms, interventionable structure, and cross-distribution generalization. Shortcuts learn the laziest opportunistic rules in the current dataset. This leads to:

Failure upon environment change
Collapse in multi-step reasoning
Incorrect action-consequence predictions

Moreover, many datasets are themselves full of shortcuts, and a model only needs to capture these surface signals to achieve high scores — so "good performance" does not equal "having learned the correct mechanism."

11. Object-Centric World Models

What Is Object-Centric?

Instead of treating input as a monolithic signal, the model decomposes it into "objects + object attributes + object relations + object evolution."

This is closer to how humans understand the world. For example, when viewing a desktop scene, we do not simply see pixel blocks — we see a cup, a book, a phone, a table, along with their positions, materials, and relationships.

Why Object-Centric Matters

Many real-world regularities are naturally organized "per object": objects move, collide, become occluded, and maintain identity continuity. If a model has object-level representations, it can more easily learn compositional generalization and more stable causal relationships.

Why Object-Centric World Models Are Extremely Difficult

1. Object boundaries are not naturally clear

What counts as an object is inherently ambiguous: Is a cloud an object? A shadow? Is a stream of water one object or many?

2. Occlusion, deformation, merging, and splitting

Objects can be half-occluded, ropes can bend, water droplets can split, two people standing close together can visually merge — maintaining identity continuity is very hard for a model.

3. Variable number of objects

A scene might contain 1 cup, 5 people, or 200 leaves. Using slots or object lists requires handling variable-length sets with dynamic addition and deletion.

4. Combinatorial explosion of relations

As the number of objects grows, interaction relations explode rapidly: who touches whom, who constrains whom, which interactions matter. The model must be object-centric, relational, and sparse all at once.

5. Extremely weak training supervision

There are typically no ready-made labels telling the model "this is object X" or "it corresponds to that object across frames." The model must discover "objectness" from unsupervised or weakly supervised data.

6. The world is not only about objects — there are also fields

Lighting, fluids, temperature, wind, sound waves, and other field-centric phenomena do not naturally fit a pure object-based representation. So while object-centric modeling is important, it is not the whole story.

12. Explicit Structure vs. Pure Neural Network Learning

This is one of the most central debates in the field today.

Approach A: End-to-End Pure Neural Networks

Viewpoint: a sufficiently large model + data → causality / physics will "emerge"

Pros: strong generality, no need for manual modeling
Cons: sample inefficiency, risk of learning spurious causality (shortcuts)

Approach B: Explicitly Incorporating Structure

Introduce object-centric representations, causal graphs, physical constraints (e.g., conservation laws), 3D consistency, temporal continuity, causal modularity, etc.

Pros: higher data efficiency, stronger generalization
Cons: hard to design, may limit expressiveness, and incorrect priors can be harmful

The Hybrid Approach (The More Likely Direction)

Neural networks learn representations; structural priors provide constraints.

Not a purely hand-crafted rule system, nor a completely unstructured black box, but rather embedding the right inductive biases into the model.

Human priors are not just a few rules — they form an entire hierarchical structure. Too few priors offer limited help; overly strong priors can lock the model into incorrect assumptions (the world is not always rigid, object boundaries are not always clear, social systems are far more complex than physical ones). So the practical approach is typically:

Use structural priors to constrain neural networks, rather than to replace them.

13. Biological Evolution = Ultra-Large-Scale Meta-Learning

Why Evolution Can Be Analogized to "Learning"

Evolution involves an optimization-like process:

Mutation ≈ parameter perturbation
Selection pressure ≈ loss function
Survival of the fittest ≈ optimization

It can be viewed as reinforcement learning / black-box optimization over the space of genomes.

But Evolution Is Not the Same as Ordinary Machine Learning

Not individual online learning: evolution operates at the population and intergenerational level
No gradients: there is no fine-grained credit assignment like backpropagation
Extremely coarse feedback: only crude signals such as survival / reproductive success
Unstable objective function: the environment keeps changing; it is not a fixed loss

A More Precise Characterization

Evolution = cross-generational, ultra-long-term, extremely low-efficiency but ultra-large-scale meta-learning

What it learns is not any specific task, but rather:

Which body structures are effective
Which perceptual systems are effective
Which inductive biases are effective
Which learning mechanisms are effective

Therefore:

The human brain is not "smart out of nowhere" — it has been pre-trained by evolution over an extended period to become a system that excels at learning in the real world.

Current models remain at the stage of "directly learning tasks," without having completed "learning how to learn + how to model the world."

14. The Complete Logical Chain

Learning efficiency depends on inductive bias.
CNNs possess only weak visual priors; the human brain carries extensive priors adapted to the real world.
The brain's advantage is not priors alone — it also includes active learning, intrinsic objectives, and continual updating.
Truly understanding the world is not merely sharing image-text semantics — it requires a world state that can be advanced over time and in response to actions. This is what a world model is.
A core component of world models is learning \(P(s_{t+1} \mid s_t, a_t)\).
Dreamer and MuZero both belong to the world model paradigm, leaning toward "modeling the world" and "serving decisions," respectively.
Modern multimodal large models have achieved multimodal alignment, but multimodal alignment ≠ unified world model.
LLMs have acquired extensive world knowledge, but next-token prediction does not naturally give rise to stable dynamics.
Models easily take shortcuts, treating correlation as causation and forming spurious causal beliefs.
Object-centric world models are important but extremely difficult due to challenges in object discovery, occlusion, deformation, and relational combinatorics.
The future will more likely follow a hybrid approach of "neural networks + structural priors."
The human brain is powerful because evolution effectively performed hundreds of millions of years of meta-learning, writing effective biases into the system.

15. A Sharper Question

Should we first build "models that can act" (agents), or first build "models that understand the world"?

Dreamer / MuZero → leans toward agent
LLM → leans toward world knowledge (but without action)

True human intelligence is the result of coupling both.