Intuitive Physics
I. What Is Intuitive Physics?
Intuitive physics refers to the ability of humans (and many animals) to develop an intuitive understanding of basic physical laws without any formal training in physics.
Intuitive physics is not a precise calculation of Newtonian mechanics, but rather a rapid, approximate, yet sufficiently accurate capacity for physical reasoning in everyday contexts.
A two-year-old child knows that a released ball will fall, that a toy hidden behind a barrier has not ceased to exist, and that a block stacked on top of another needs enough contact area to remain stable. These judgments are not learned from textbooks — they are core competencies that the human cognitive system either possesses innately or develops very early in life.
The fundamental principles encompassed by intuitive physics include:
| Principle | Meaning | Age of Onset in Infants |
|---|---|---|
| Object Permanence | Objects continue to exist after being occluded | ~3–5 months |
| Gravity Intuition | Unsupported objects fall | ~5–7 months |
| Solidity | Objects cannot pass through one another | ~3–4 months |
| Inertia | Moving objects tend to keep moving | ~6 months |
| Support Relations | Objects need support to maintain their position | ~5–6 months |
| Contact Causality | Collisions and contact produce changes in motion | ~6 months |
The remarkably early and universal emergence of these abilities strongly suggests that they are not purely learned from experience, but rather reflect some form of structural prior built into the human brain.
II. Relationship to the Brain's Prior Knowledge
In the article on the brain's prior knowledge, we discussed the core source of the human brain's learning efficiency:
The human brain is not a blank slate — it comes pre-configured as a system suited for learning about the real world.
Intuitive physics is the most concrete and measurable component of this prior knowledge system. The locality assumption in CNNs constitutes a very weak prior, whereas intuitive physics involves object permanence, gravity, solidity, spatiotemporal continuity, and more — an entire suite of strong priors about the structure of the physical world.
What is the origin of these priors? The answer points to evolution. The fundamental laws of the physical world have remained stable over hundreds of millions of years — gravity has always existed, objects have always been solid, and space has always been three-dimensional. Natural selection eliminated individuals who could not quickly grasp these regularities, encoding effective physical intuitions into the brain's initial architecture.
From an evolutionary perspective, intuitive physics is not learned knowledge, but a survival prior "hardcoded" into the nervous system.
III. Josh Tenenbaum: Modeling Human Cognition with Probabilistic Programs
Josh Tenenbaum is a professor in MIT's Department of Brain and Cognitive Sciences and one of the most central figures in research on intuitive physics and intuitive psychology. His core question is:
How do humans learn such rich world knowledge from so little data?
Probabilistic Programs as Cognitive Models
Tenenbaum's central claim is that human cognitive processes can be modeled using probabilistic programs. Specifically, regarding intuitive physics:
- The human brain internally runs an approximate physics engine
- When confronted with a physical scene, the brain predicts object behavior by running mental simulations through this engine
- These simulations are not precise Newtonian mechanics, but noisy, approximate, and probabilistic
This framework accounts for a range of classic findings:
- Why infants show "surprise" (prolonged looking time) at events that violate physical laws
- Why human physical judgments systematically deviate from exact physics in certain situations (because the simulation engine itself is approximate)
- Why humans can generalize from very few examples (because the prior structure dramatically compresses the hypothesis space)
The AI2050 Project
The AI2050 project, in which Tenenbaum is involved, aims to build AI systems with human-level commonsense understanding. Its core philosophy is:
True AI common sense should not depend on massive amounts of data, but should instead emerge — as it does in infants — from limited experience combined with the right prior structure.
The project attempts to address a fundamental question: can we build an "initial program" for AI, analogous to the human brain's, that equips it with core knowledge similar to that of an infant, enabling it to learn efficiently from minimal interaction on that foundation?
IV. The IntPhys 2 Benchmark: How Poor Is AI's Intuitive Physics?
To quantify whether AI systems possess intuitive physics, rigorous benchmarks are needed. IntPhys 2 is an evaluation framework designed specifically for this purpose, drawing on the Violation of Expectation (VoE) paradigm used in developmental psychology to test infant cognition.
Four Fundamental Principles
IntPhys 2 tests AI models' understanding of the following four physical principles:
| Principle | Normal Event | Violation Event |
|---|---|---|
| Permanence | An object reappears after leaving the field of view | An object appears or disappears from thin air |
| Immutability | An object retains its properties | An object spontaneously changes shape or color |
| Spatiotemporal Continuity | An object moves along a continuous path | An object teleports to a different location |
| Solidity | Objects bounce off each other upon collision | Objects pass through one another |
Testing Method
The core logic of the test mirrors that of developmental psychology:
- Present the model with a video of a normal physical scenario
- Present the model with a video that violates a physical principle
- Observe whether the model can distinguish between the two — that is, whether it is "surprised" by the physically impossible event
Key Findings
State-of-the-art vision models perform reasonably well in simple scenarios, but in complex situations involving occlusion and multi-object interactions, their performance approaches chance level. Humans score near perfect on the same tests.
The implications of this finding are profound: even models that match or exceed human performance on visual recognition tasks exhibit fundamental deficiencies in the most basic forms of physical reasoning. A model can accurately identify what objects are in an image, yet fails to understand how those objects should move and interact.
V. Why Do Current AI Models Lack Intuitive Physics?
This deficiency is not accidental — it reflects deep limitations of the prevailing AI paradigm.
The Predicament of Language Models
LLMs can say "the ball fell to the ground," but this is merely statistical co-occurrence at the linguistic level, not genuine physical understanding. As discussed in the article on prior knowledge:
LLMs are more like a "world commentator" that has read vast amounts of text, rather than a "world simulator" with an internal physics engine.
An LLM can "talk about" gravity, but it lacks an internal model capable of actually simulating gravitational effects.
The Predicament of Vision Models
Vision models (such as ViT, CLIP, etc.) excel at extracting static features — color, shape, texture, and spatial relations. But physical reasoning demands understanding of dynamic processes: how forces propagate, how motion changes, and how collisions unfold. This information is implicit in static images. While it can be observed in video, models tend to learn superficial visual patterns (shortcuts) rather than the underlying physical mechanisms.
Root Causes
The fundamental reasons why current models lack intuitive physics can be summarized as follows:
- Lack of appropriate prior structure: Models are not endowed with inductive biases for object permanence, spatiotemporal continuity, etc.
- Misaligned training objectives: Predicting the next token or the next pixel frame is not the same as understanding physical causality
- Lack of embodied interaction: No opportunity to act in the physical world and receive feedback
- Physical information is implicit in the data: Physical knowledge in text and images is highly compressed — insufficient for models to spontaneously develop a physics engine
VI. DeepMind's Approach: From Reasoning to Physical Manipulation
Faced with the absence of intuitive physics, DeepMind is exploring a path that combines the reasoning capabilities of large models with physical manipulation.
Gemini Robotics
In 2025, DeepMind released Gemini Robotics 1.5 and Gemini Robotics-ER (Embodied Reasoning). The design philosophy behind the latter is particularly noteworthy:
- Gemini Robotics-ER serves as the "high-level brain" of the robotic system, responsible for scene understanding, task planning, and reasoning
- Lower-level controllers handle concrete motor execution
- The ER model integrates the reasoning capabilities of vision-language models with physical scene understanding
The core assumption underlying this architecture is:
Even if the model itself does not possess a complete physics engine, combining linguistic reasoning with embodied feedback can, to some extent, compensate for the lack of intuitive physics.
However, whether this truly equates to having acquired intuitive physics remains an open question. Relying on linguistic reasoning to "compensate" for absent physical intuition may be fundamentally different from the rapid, automatic, language-independent physical reasoning that humans employ.
VII. Tsinghua Survey: From LLMs to World Models for Embodied AI
A 2025 survey paper from Tsinghua University, titled "From LLMs to World Models," proposes a Three-Loop Architecture for embodied AI:
Loop 1: Active Perception
The agent does not passively receive sensory data but actively selects what to perceive and from what angle, guided by its current task and internal model. This aligns with the active inference philosophy of predictive coding.
Loop 2: Embodied Cognition
Building on perception, the agent constructs and updates internal representations of its environment. These representations are not a static knowledge base, but a world model that can be propagated forward — capable of predicting the consequences of actions and simulating future states.
Loop 3: Dynamic Interaction
The agent plans and makes decisions based on its internal world model, executes actions in the physical environment, and then uses the resulting feedback to update its perception and cognition. The three loops form a continuous cycle.
The key insight of this architecture is:
The world model is not a standalone module but is embedded in the continuous cycle of perception, cognition, and interaction. Only through embodied interaction can the world model be effectively learned and calibrated.
VIII. The Relationship Between Intuitive Physics and World Models
Revisiting the discussion of world models from the article on prior knowledge:
A world model is not about "knowing many facts" — it is an internal representation of the world's state that can simulate how it changes over time and in response to actions.
From this definition, intuitive physics is precisely a domain-specific world model — a mental simulator specialized for the dynamics of the physical world.
| Concept | Core Question | Scope |
|---|---|---|
| World Model | How does the world state change over time and in response to actions? | General (physical, social, abstract) |
| Intuitive Physics | How do objects move and interact under physical laws? | Physical world |
| Intuitive Psychology | What beliefs, intentions, and goals do others have? | Social world |
Tenenbaum's research spans both intuitive physics and intuitive psychology because they share the same cognitive framework: using internal simulation to predict and explain the external world. The only difference is that one simulates physical processes while the other simulates the mental states of others.
IX. Why Intuitive Physics Is Critical for AI
Safe Navigation and Manipulation
A robot without intuitive physics does not know that a glass near the edge of a table risks falling, does not understand the stability conditions for stacking heavy objects, and cannot anticipate what a fast-moving object will collide with. These judgments, self-evident to humans, are blind spots for AI systems that lack physical intuition.
Foundation of Commonsense Reasoning
Everyday human conversation is replete with physical metaphors and default assumptions. When someone says "set this down securely," it implies an intuitive understanding of gravity, support surfaces, and friction. AI systems lacking these intuitions will make repeated errors when interpreting and executing natural language instructions.
Key to Few-Shot Learning
As a strong prior, intuitive physics dramatically compresses the amount of data needed for learning. As argued in the article on prior knowledge: a system equipped with built-in physical intuition needs only minimal interaction to understand object behavior in novel scenes. Without such priors, a system requires massive data to learn each physical regularity from scratch.
X. Philosophical Implications: The Necessity of Prior Knowledge
The existence of intuitive physics provides important evidence for a central debate in AI research.
Pure Learning vs. Structural Priors
| Position | Core Claim | Proponents |
|---|---|---|
| Pure Learning | Sufficiently large models + sufficient data can learn everything | Extreme extrapolation of Scaling Laws |
| Structural Priors | Certain world knowledge must be explicitly injected via architecture or priors | LeCun, Tenenbaum, Bengio |
Developmental psychology evidence on intuitive physics strongly supports the latter:
- Infants demonstrate physical reasoning abilities at a stage when they have almost no active manipulation experience
- The developmental timetable for these abilities is highly consistent across cultures
- Certain physical principles (e.g., object permanence) can even be observed in non-human primates
This implies:
Certain knowledge about the physical world is very likely not learned from scratch, but rather pre-encoded into the cognitive system by evolution as inductive biases. If AI systems aspire to similar physical reasoning ability and sample efficiency, a purely "data-driven" approach may be insufficient — certain structural priors need to be explicitly injected.
This does not deny the role of learning, but rather emphasizes that priors and learning are not opposed but complementary. The right priors make learning more efficient, while rich learning experience allows priors to be flexibly applied in novel situations.
XI. Summary
Intuitive physics is one of the most fundamental, earliest-developing, and most easily overlooked capabilities in human cognition. It is not advanced physics knowledge, but rather a set of low-level priors about how the physical world operates — a concrete instantiation of a world model in the physical domain.
The complete logical chain:
- Humans innately possess intuitive physics, enabling rapid reasoning about object permanence, gravity, solidity, spatiotemporal continuity, and more.
- This ability is a structural prior endowed by evolution, continuous with the brain's broader system of prior knowledge.
- Tenenbaum models intuitive physics using probabilistic programs, demonstrating that the human brain can be understood as a reasoning system running approximate physical simulations.
- Benchmarks such as IntPhys 2 reveal a fundamental deficiency in current AI: models with powerful visual capabilities perform near chance on basic physical reasoning.
- Teams at DeepMind, Tsinghua, and elsewhere are exploring remedial paths: combining reasoning capabilities with embodied interaction to build perception-cognition-interaction loop architectures.
- Intuitive physics is essentially a world model for the physical domain — a mental simulator capable of modeling physical dynamics.
- This carries profound philosophical implications for AI: certain world knowledge may need to be explicitly injected as structural priors, rather than relying entirely on data-driven learning.