Spatial Intelligence and Learned Simulation

I. Introduction

Beyond JEPA, the period of 2025--2026 has seen the emergence of multiple distinct paths toward world models. This article focuses on three of them: spatial intelligence, learned simulation, and physical AI infrastructure. Each approaches the same core question from a different angle:

How can we give AI a runnable internal model of the physical world?

Behind each path lies a different assumption about what a world model should look like. Understanding the differences and complementarities among these assumptions is key to grasping the current landscape of world model research.

II. Spatial Intelligence -- Fei-Fei Li / World Labs

Core Thesis

The core thesis of spatial intelligence is:

The foundation of intelligence is an understanding of three-dimensional spatial structure. To truly understand the world is to understand how things exist, relate, and change in 3D space.

This thesis stems from a deep insight by Fei-Fei Li's team: humans understand the world in three dimensions, not two. We do not live inside images -- we live in a three-dimensional world with depth, volume, and spatial relationships.

World Labs and Marble

World Labs is a company founded by Fei-Fei Li, valued at approximately $5 billion, dedicated to spatial intelligence research.

Its flagship product, Marble, can generate persistent, explorable 3D environments from text, images, or video.

The modifiers "persistent" and "explorable" are crucial here:

Persistence: The generated 3D world is not a one-off render but a continuously existing environment. You can leave an area and come back -- it is still there.
Explorability (navigability): Users can move freely and change viewpoints within the generated 3D world. This is not about producing a pretty picture -- it is about producing a space.

Why Three Dimensions Matter

This question deserves careful thought. Two-dimensional image generation is already highly mature (e.g., DALL-E, Midjourney) -- so why do we need 3D?

Dimension	2D Image Generation	3D World Generation
Output	A flat image	A navigable space
Viewpoint	Fixed, single viewpoint	Arbitrary viewpoints
Occlusion handling	Not addressed -- occluded objects are simply invisible	Must model complete 3D structure
Object understanding	Surface textures	Volume, depth, spatial relationships
Persistence	None -- each generation is independent	Yes -- the world persists

2D generation can get by with "making things look real." 3D generation cannot rely on this -- you must genuinely understand spatial structure, or inconsistencies will be exposed the moment a user shifts the viewpoint.

3D world generation is a far more rigorous test of a model's spatial understanding.

Connection to Human Cognition

Human spatial cognition is an extraordinarily fundamental ability -- infants demonstrate rudimentary understanding of 3D space within months of birth (object permanence, depth perception, etc.). The philosophical assumption behind the spatial intelligence approach is:

If we can make AI understand 3D space the way humans do, other forms of world understanding will rest on a solid foundation.

III. Learned Simulation -- Google DeepMind / Genie

Core Thesis

The core thesis of learned simulation is:

There is no need to hand-code a physics engine -- let the model learn physical laws directly from data.

Traditional simulators (such as game engines and physics simulators) rely on manually written physical rules -- gravitational acceleration, collision detection, friction coefficients, and so on. The idea behind learned simulation is: can a neural network learn these laws on its own from observational data?

Genie 3

Genie 3 (released August 2025) is Google DeepMind's third-generation interactive world model and the most important milestone on the learned simulation path.

Key technical specifications:

Real-time interaction: The first interactive world model capable of running in real time
720p / 24fps: Generates navigable 3D environments at usable image quality and frame rate
No hard-coded physics: All physical behaviors are learned from data, with no preset physics equations

Reportedly, OpenAI initiated an internal "code red" after seeing Genie 3's demo -- an indication that the industry views learned simulation as a potentially disruptive technical paradigm.

The Philosophy of "Emergent Physics"

Genie 3 represents a radical philosophical stance:

Physical laws do not need to be explicitly programmed -- given enough data and a sufficiently large model, physics will emerge from the data.

This is consistent with "Path A: End-to-end pure neural network" discussed in World Models. Its appeal lies in generality -- no need to write separate rules for each physical phenomenon; its risk lies in reliability -- is the "physics" that emerges truly stable and consistent?

Open Questions

Learned simulation faces several key challenges:

Physical consistency: Is the "physics" learned by the model self-consistent across all situations, or only approximately correct within the training distribution?
Long-term stability: As simulation time progresses, do errors accumulate and lead to unrealistic behavior?
Controllability: Can users precisely control physical parameters in the simulation (e.g., changing gravity), or can the model only reproduce the physics it has seen?

IV. Physical AI Infrastructure -- NVIDIA Cosmos

Positioning

NVIDIA Cosmos occupies a different position from the approaches above -- it is not an end-user application but an infrastructure layer.

Cosmos provides foundational physics-aware world modeling capabilities for other AI systems -- especially robotics and autonomous driving -- to call upon.

Core Capabilities

At the heart of Cosmos is a foundation world model capable of generating physics-aware video predictions:

Given a current scene and an action, it predicts the visual changes in future scenes
Predictions comply with basic physical constraints (objects do not vanish into thin air, motion trajectories are continuous, etc.)
With over 2 million downloads, it has already become a critical foundational component in robotics and embodied AI

The Infrastructure Mindset

The philosophy behind Cosmos is:

Rather than having every application train its own world model from scratch, provide a general-purpose, physics-aware world model as shared infrastructure.

This mirrors the foundation model approach in the LLM domain -- first train a large, general-purpose model, then fine-tune it for specific tasks. Cosmos aims to do the same for world models.

V. A Panoramic Comparison of Five Paths

Together with JEPA and Karl Friston's Active Inference, current world model research spans at least five major paths:

Path	Core Idea	Key Feature	Leading Institution	Core Strength
JEPA	Predict in abstract representation space	Extreme sample efficiency	LeCun / AMI Labs	Does not waste capacity on irrelevant details
Spatial Intelligence	3D spatial structure understanding and generation	Persistent, navigable 3D worlds	Fei-Fei Li / World Labs	Closely mirrors human spatial cognition
Learned Simulation	Learn physical laws from data	Real-time interactive world model	DeepMind / Genie 3	High generality; no hand-crafted physics engine needed
Physical AI	Physics-aware video prediction	Infrastructure layer	NVIDIA Cosmos	General-purpose component callable by other systems
Active Inference	Minimize free energy / surprise	Biological plausibility	Karl Friston / VERSES	Deep connections to neuroscience

Philosophical Assumptions of Each Path

A deeper comparison of the five paths reveals that their fundamental disagreement lies in the question: "What is the most important property of a world model?"

JEPA holds that the most important factor is the level of abstraction in representation -- predicting at the right level of abstraction is more efficient than predicting at the raw signal level
Spatial Intelligence holds that the most important factor is 3D spatial structure -- understanding three dimensions is a prerequisite for understanding the world
Learned Simulation holds that the most important factors are interactivity and real-time performance -- a world model must be usable in real-time, interactive settings
Physical AI holds that the most important factor is reusability -- a world model should be infrastructure, not a bespoke application
Active Inference holds that the most important factor is consistency with biological systems -- a world model should follow the fundamental principles of how the brain operates

VI. Convergence or Divergence?

A natural question arises: will these five paths ultimately converge into a unified framework, or will they continue to diverge?

Signs of Convergence

All paths are constructing some form of "internal world representation"
All paths agree that text/token prediction alone is insufficient for understanding the world
Multiple paths are beginning to cross-pollinate -- for instance, spatial intelligence borrows techniques from learned simulation, and JEPA is starting to incorporate 3D structure

Possibilities for Divergence

Different paths optimize different objective functions and may arrive at different local optima
A "universal world model" may simply not exist -- different domains may require different types of world models
Commercial interests may drive each path to develop independently rather than merge

A Possible Unifying Perspective

If we return to the six elements of a unified world model proposed in World Models -- shared state space, state persistence, dynamics, constraints, interventionability, and compositionality -- we can observe that:

JEPA focuses primarily on shared state space and dynamics
Spatial Intelligence focuses primarily on state persistence and constraints (3D consistency)
Learned Simulation focuses primarily on dynamics and interventionability
Physical AI provides constraints at the infrastructure level
Active Inference provides a theoretical framework

The five paths may not be competing -- they may be solving different sub-problems of a unified world model.

VII. The Choice of Inductive Biases

Returning to the core question from The Brain's Prior Knowledge: which inductive biases matter most?

The five paths have, in effect, chosen different inductive biases:

Path	Core Inductive Bias
JEPA	Semantic abstraction matters more than pixel-level detail
Spatial Intelligence	3D spatial structure is the fundamental skeleton of the world
Learned Simulation	Physical laws can emerge from sufficiently large amounts of data
Physical AI	Physical consistency is an indispensable constraint
Active Inference	The essence of intelligence is minimizing prediction error

The human brain's answer is: all of the above. The brain possesses abstract representation capabilities, 3D spatial understanding, the ability to learn physical intuition, physical consistency constraints, and continuously minimizes prediction error.

This perhaps hints at the ultimate direction:

A true world model does not take a single inductive bias and push it to its extreme -- rather, it integrates multiple inductive biases into a unified architecture.

VIII. Open Questions

Data sources: Both spatial intelligence and learned simulation require vast amounts of 3D/video data. High-quality 3D data is far harder to obtain than text data -- will this become a bottleneck?
Evaluation criteria: How should we assess the quality of a world model? By generation quality (FID)? By downstream task performance (robot success rate)? By physical consistency? Different evaluation criteria may steer research in different directions.
Computational cost: The computational demands of running a high-fidelity world model in real time are enormous. Genie 3 achieved 720p/24fps, but this falls far short of the simulation fidelity of the human brain.
From simulation to action: Even with a perfect world model, what else is needed to bridge the gap from "understanding how the world works" to "acting effectively in the world"? Planning algorithms, value functions, exploration strategies -- how should the interface between these components and the world model be designed?

World model research is at an exhilarating stage. Five paths are advancing in parallel, each achieving breakthroughs. Whether they will ultimately converge into a unified paradigm is one of the most important questions to watch in AI over the coming years.

Spatial Intelligence and Learned Simulation

I. Introduction

II. Spatial Intelligence -- Fei-Fei Li / World Labs

Core Thesis

World Labs and Marble

Why Three Dimensions Matter

Connection to Human Cognition

III. Learned Simulation -- Google DeepMind / Genie

Core Thesis

Genie 3

The Philosophy of "Emergent Physics"

Open Questions

IV. Physical AI Infrastructure -- NVIDIA Cosmos

Positioning

Core Capabilities

The Infrastructure Mindset

V. A Panoramic Comparison of Five Paths

Philosophical Assumptions of Each Path

VI. Convergence or Divergence?

Signs of Convergence

Possibilities for Divergence

A Possible Unifying Perspective

VII. The Choice of Inductive Biases

VIII. Open Questions

评论 #