Spatial Intelligence and Learned Simulation
I. Introduction
Beyond JEPA, the period of 2025--2026 has seen the emergence of multiple distinct paths toward world models. This article focuses on three of them: spatial intelligence, learned simulation, and physical AI infrastructure. Each approaches the same core question from a different angle:
How can we give AI a runnable internal model of the physical world?
Behind each path lies a different assumption about what a world model should look like. Understanding the differences and complementarities among these assumptions is key to grasping the current landscape of world model research.
II. Spatial Intelligence -- Fei-Fei Li / World Labs
Core Thesis
The core thesis of spatial intelligence is:
The foundation of intelligence is an understanding of three-dimensional spatial structure. To truly understand the world is to understand how things exist, relate, and change in 3D space.
This thesis stems from a deep insight by Fei-Fei Li's team: humans understand the world in three dimensions, not two. We do not live inside images -- we live in a three-dimensional world with depth, volume, and spatial relationships.
World Labs and Marble
World Labs is a company founded by Fei-Fei Li, valued at approximately $5 billion, dedicated to spatial intelligence research.
Its flagship product, Marble, can generate persistent, explorable 3D environments from text, images, or video.
The modifiers "persistent" and "explorable" are crucial here:
- Persistence: The generated 3D world is not a one-off render but a continuously existing environment. You can leave an area and come back -- it is still there.
- Explorability (navigability): Users can move freely and change viewpoints within the generated 3D world. This is not about producing a pretty picture -- it is about producing a space.
Why Three Dimensions Matter
This question deserves careful thought. Two-dimensional image generation is already highly mature (e.g., DALL-E, Midjourney) -- so why do we need 3D?
| Dimension | 2D Image Generation | 3D World Generation |
|---|---|---|
| Output | A flat image | A navigable space |
| Viewpoint | Fixed, single viewpoint | Arbitrary viewpoints |
| Occlusion handling | Not addressed -- occluded objects are simply invisible | Must model complete 3D structure |
| Object understanding | Surface textures | Volume, depth, spatial relationships |
| Persistence | None -- each generation is independent | Yes -- the world persists |
2D generation can get by with "making things look real." 3D generation cannot rely on this -- you must genuinely understand spatial structure, or inconsistencies will be exposed the moment a user shifts the viewpoint.
3D world generation is a far more rigorous test of a model's spatial understanding.
Connection to Human Cognition
Human spatial cognition is an extraordinarily fundamental ability -- infants demonstrate rudimentary understanding of 3D space within months of birth (object permanence, depth perception, etc.). The philosophical assumption behind the spatial intelligence approach is:
If we can make AI understand 3D space the way humans do, other forms of world understanding will rest on a solid foundation.
III. Learned Simulation -- Google DeepMind / Genie
Core Thesis
The core thesis of learned simulation is:
There is no need to hand-code a physics engine -- let the model learn physical laws directly from data.
Traditional simulators (such as game engines and physics simulators) rely on manually written physical rules -- gravitational acceleration, collision detection, friction coefficients, and so on. The idea behind learned simulation is: can a neural network learn these laws on its own from observational data?
Genie 3
Genie 3 (released August 2025) is Google DeepMind's third-generation interactive world model and the most important milestone on the learned simulation path.
Key technical specifications:
- Real-time interaction: The first interactive world model capable of running in real time
- 720p / 24fps: Generates navigable 3D environments at usable image quality and frame rate
- No hard-coded physics: All physical behaviors are learned from data, with no preset physics equations
Reportedly, OpenAI initiated an internal "code red" after seeing Genie 3's demo -- an indication that the industry views learned simulation as a potentially disruptive technical paradigm.
The Philosophy of "Emergent Physics"
Genie 3 represents a radical philosophical stance:
Physical laws do not need to be explicitly programmed -- given enough data and a sufficiently large model, physics will emerge from the data.
This is consistent with "Path A: End-to-end pure neural network" discussed in World Models. Its appeal lies in generality -- no need to write separate rules for each physical phenomenon; its risk lies in reliability -- is the "physics" that emerges truly stable and consistent?
Open Questions
Learned simulation faces several key challenges:
- Physical consistency: Is the "physics" learned by the model self-consistent across all situations, or only approximately correct within the training distribution?
- Long-term stability: As simulation time progresses, do errors accumulate and lead to unrealistic behavior?
- Controllability: Can users precisely control physical parameters in the simulation (e.g., changing gravity), or can the model only reproduce the physics it has seen?
IV. Physical AI Infrastructure -- NVIDIA Cosmos
Positioning
NVIDIA Cosmos occupies a different position from the approaches above -- it is not an end-user application but an infrastructure layer.
Cosmos provides foundational physics-aware world modeling capabilities for other AI systems -- especially robotics and autonomous driving -- to call upon.
Core Capabilities
At the heart of Cosmos is a foundation world model capable of generating physics-aware video predictions:
- Given a current scene and an action, it predicts the visual changes in future scenes
- Predictions comply with basic physical constraints (objects do not vanish into thin air, motion trajectories are continuous, etc.)
- With over 2 million downloads, it has already become a critical foundational component in robotics and embodied AI
The Infrastructure Mindset
The philosophy behind Cosmos is:
Rather than having every application train its own world model from scratch, provide a general-purpose, physics-aware world model as shared infrastructure.
This mirrors the foundation model approach in the LLM domain -- first train a large, general-purpose model, then fine-tune it for specific tasks. Cosmos aims to do the same for world models.
V. A Panoramic Comparison of Five Paths
Together with JEPA and Karl Friston's Active Inference, current world model research spans at least five major paths:
| Path | Core Idea | Key Feature | Leading Institution | Core Strength |
|---|---|---|---|---|
| JEPA | Predict in abstract representation space | Extreme sample efficiency | LeCun / AMI Labs | Does not waste capacity on irrelevant details |
| Spatial Intelligence | 3D spatial structure understanding and generation | Persistent, navigable 3D worlds | Fei-Fei Li / World Labs | Closely mirrors human spatial cognition |
| Learned Simulation | Learn physical laws from data | Real-time interactive world model | DeepMind / Genie 3 | High generality; no hand-crafted physics engine needed |
| Physical AI | Physics-aware video prediction | Infrastructure layer | NVIDIA Cosmos | General-purpose component callable by other systems |
| Active Inference | Minimize free energy / surprise | Biological plausibility | Karl Friston / VERSES | Deep connections to neuroscience |
Philosophical Assumptions of Each Path
A deeper comparison of the five paths reveals that their fundamental disagreement lies in the question: "What is the most important property of a world model?"
- JEPA holds that the most important factor is the level of abstraction in representation -- predicting at the right level of abstraction is more efficient than predicting at the raw signal level
- Spatial Intelligence holds that the most important factor is 3D spatial structure -- understanding three dimensions is a prerequisite for understanding the world
- Learned Simulation holds that the most important factors are interactivity and real-time performance -- a world model must be usable in real-time, interactive settings
- Physical AI holds that the most important factor is reusability -- a world model should be infrastructure, not a bespoke application
- Active Inference holds that the most important factor is consistency with biological systems -- a world model should follow the fundamental principles of how the brain operates
VI. Convergence or Divergence?
A natural question arises: will these five paths ultimately converge into a unified framework, or will they continue to diverge?
Signs of Convergence
- All paths are constructing some form of "internal world representation"
- All paths agree that text/token prediction alone is insufficient for understanding the world
- Multiple paths are beginning to cross-pollinate -- for instance, spatial intelligence borrows techniques from learned simulation, and JEPA is starting to incorporate 3D structure
Possibilities for Divergence
- Different paths optimize different objective functions and may arrive at different local optima
- A "universal world model" may simply not exist -- different domains may require different types of world models
- Commercial interests may drive each path to develop independently rather than merge
A Possible Unifying Perspective
If we return to the six elements of a unified world model proposed in World Models -- shared state space, state persistence, dynamics, constraints, interventionability, and compositionality -- we can observe that:
- JEPA focuses primarily on shared state space and dynamics
- Spatial Intelligence focuses primarily on state persistence and constraints (3D consistency)
- Learned Simulation focuses primarily on dynamics and interventionability
- Physical AI provides constraints at the infrastructure level
- Active Inference provides a theoretical framework
The five paths may not be competing -- they may be solving different sub-problems of a unified world model.
VII. The Choice of Inductive Biases
Returning to the core question from The Brain's Prior Knowledge: which inductive biases matter most?
The five paths have, in effect, chosen different inductive biases:
| Path | Core Inductive Bias |
|---|---|
| JEPA | Semantic abstraction matters more than pixel-level detail |
| Spatial Intelligence | 3D spatial structure is the fundamental skeleton of the world |
| Learned Simulation | Physical laws can emerge from sufficiently large amounts of data |
| Physical AI | Physical consistency is an indispensable constraint |
| Active Inference | The essence of intelligence is minimizing prediction error |
The human brain's answer is: all of the above. The brain possesses abstract representation capabilities, 3D spatial understanding, the ability to learn physical intuition, physical consistency constraints, and continuously minimizes prediction error.
This perhaps hints at the ultimate direction:
A true world model does not take a single inductive bias and push it to its extreme -- rather, it integrates multiple inductive biases into a unified architecture.
VIII. Open Questions
- Data sources: Both spatial intelligence and learned simulation require vast amounts of 3D/video data. High-quality 3D data is far harder to obtain than text data -- will this become a bottleneck?
- Evaluation criteria: How should we assess the quality of a world model? By generation quality (FID)? By downstream task performance (robot success rate)? By physical consistency? Different evaluation criteria may steer research in different directions.
- Computational cost: The computational demands of running a high-fidelity world model in real time are enormous. Genie 3 achieved 720p/24fps, but this falls far short of the simulation fidelity of the human brain.
- From simulation to action: Even with a perfect world model, what else is needed to bridge the gap from "understanding how the world works" to "acting effectively in the world"? Planning algorithms, value functions, exploration strategies -- how should the interface between these components and the world model be designed?
World model research is at an exhilarating stage. Five paths are advancing in parallel, each achieving breakthroughs. Whether they will ultimately converge into a unified paradigm is one of the most important questions to watch in AI over the coming years.