Overview of Embodied Intelligence

Introduction

Embodied Intelligence (Embodied AI) is an independent and expansive discipline within artificial intelligence, studying how to enable agents to perceive, understand, plan, and act through a body in physical or virtual environments. It spans multiple fields including AI, robotics, cognitive science, control theory, computer vision, and neuroscience, and has evolved into an independent research domain with a complete theoretical framework, technology stack, and industrial ecosystem.

In earlier research frameworks, embodied intelligence was often simply classified as "a subset of AI Agents." This categorization had some validity in the early stages of the field — at that time, embodied intelligence research was limited in scale and often represented just one application scenario within the general AI Agent framework. However, with the explosive growth of foundation models, robot hardware, simulation technologies, and large-scale data collection, embodied intelligence has evolved into a first-class discipline with its own unique problem definitions, methodologies, and evaluation systems. Viewing embodied intelligence merely as a subset of AI Agents is like viewing computer vision as a subset of signal processing — there are historical connections, but they no longer reflect the actual scale and independence of the discipline.

A more accurate understanding is that embodied intelligence and AI Agents are two intersecting but independently standing research directions. They share some foundational concepts (such as planning, memory, tool use), but differ fundamentally in core problems, technology stacks, and evaluation methods:

Dimension	AI Agent	Embodied Intelligence
Core Medium	Software programs	Physical entities with bodies (robots/virtual characters)
Interaction Space	Digital environments (APIs, web, code)	Physical/virtual 3D space
Core Challenges	Reasoning, planning, tool orchestration	Perception-action loops, continuous control, physical interaction
Time Scale	Discrete task steps	Continuous real-time control (millisecond-level)
Safety Constraints	Software-level (reversible)	Physical-level (irreversible, safety risks)
Data Modalities	Primarily text and structured data	Vision, tactile, torque, proprioception, and other multimodal data
Evaluation Methods	Task completion rate, accuracy	Success rate + physical metrics (force, precision, speed)

For detailed discussion of AI Agents (task-oriented software agents), please refer to the AI Agents section of this site.

What is Embodied Intelligence

Definition

Embodied intelligence refers to an agent that exists within an environment through some form of body, is capable of perception and action through that body, and generates cognition and behavior through dynamic coupling with the environment.

This definition contains three indispensable elements:

Body: The agent must possess an entity that exists in space, whether physical or virtual
Environment: The body must be situated in a rule-governed environment (physical world or simulated world)
Coupling: There must be a bidirectional causal relationship between the body and the environment — the agent influences the environment through its body, and the environment constrains and influences the agent

Without any one of these elements, it does not constitute embodied intelligence. ChatGPT lacks a body — it is not embodied intelligence. A stationary camera has a "body" but lacks action capability and causal coupling — it is not embodied intelligence either. An NPC in a game that can perceive, move, and interact, even though its body is virtual, is embodied intelligence.

The Essence of the Body

The concept of body is what fundamentally distinguishes embodied intelligence from other AI directions. Varela offered a profound insight in The Embodied Mind:

The body is not an input device; it is the "mode" through which cognition occurs.

This means the body is not merely a shell for sensors or a carrier for actuators, but a necessary condition for intelligence to emerge. Specifically, a "body" should possess the following core characteristics:

The body must exist in some space (real or virtual)
The body has spatial attributes such as size and position
The body can perceive (Perception) — obtaining information from the environment
The body can act (Action) — exerting influence on the environment
The body is constrained by environmental rules (physical laws, energy consumption, collisions, dynamics)
The body can affect the environment and be affected by it — i.e., causal coupling

Based on these characteristics, we can clearly determine the nature of different entities:

Entity	Spatial Existence	Perception	Action	Causal Coupling	Embodied?
ChatGPT	✗	✗	✗	✗	✗
Web Agent	✗	Limited	Limited	✗	✗
Game NPC	✓	✓	✓	✓	✓
Industrial Arm	✓	✓	✓	✓	✓
Humanoid Robot	✓	✓	✓	✓	✓
AR Glasses Assistant	Borderline	✓	Limited	Limited	Borderline

The Embodied Nature of Intelligence

Embodied cognition is the philosophical foundation of embodied intelligence, with core claims originating from the work of Merleau-Ponty and Varela:

Intelligence is not a computational process occurring in isolation within the brain; it emerges through the continuous coupling of body and world.

This idea has profound implications for artificial intelligence. Traditional AI (symbolicism, pure statistical learning) views intelligence as an input-compute-output information processing pipeline, with the body as an optional peripheral. But the embodied cognition perspective holds that the form, capabilities, and limitations of the body itself shape the nature of intelligence — the "hardware fact" that a hand has five fingers determines how humans manipulate objects, which in turn influences human cognition about tools and invention.

This is known as the "4E Cognition" framework:

Embodied: Cognition depends on the physical structure of the body
Embedded: Cognition is situated in specific environmental contexts
Enacted: Cognition arises through active interaction with the environment
Extended: Cognition can extend to tools and environments beyond the body

For detailed theoretical exploration of 4E cognition, see Embodied Cognition Theory.

Historical Development of Embodied Intelligence

The development of embodied intelligence can be traced to the convergence of multiple disciplines, evolving from philosophical speculation to engineering practice:

Philosophical and Cognitive Science Foundations (1940s-1990s)

Birth of Cybernetics (1948): Norbert Wiener proposed cybernetics, placing feedback loops as the core framework for understanding biological and machine behavior, laying the intellectual foundation for the perception-action loop
Sensorimotor Theory (1960s): Gibson proposed ecological psychology and affordance theory, arguing that perception is not passive reception but active exploration in service of action
Embodied Cognition Philosophy (1991): Varela et al. published The Embodied Mind, formally proposing the theoretical framework of embodied cognition and challenging the traditional "brain-centric" view

Behavior-Based Robotics (1986-2000s)

Brooks' Subsumption Architecture (1986): Rodney Brooks at MIT proposed a behavior-based robot architecture that required no internal representations, arguing that "the world itself is its best model," directly mapping perception to action. This was the first engineering realization of embodied intelligence
Early Humanoid Robots: Honda's ASIMO (2000) and other early humanoid robots demonstrated the possibility of physical embodiment, but relied on manual programming rather than learning

The Deep Learning Turning Point (2013-2020)

Deep Reinforcement Learning Breakthroughs: DQN (2013), AlphaGo (2016) demonstrated the power of end-to-end learning
Simulation Platform Maturation: MuJoCo, OpenAI Gym/Gymnasium, Isaac Gym and other simulation environments enabled large-scale robot learning
Sim2Real Transfer: Domain randomization and other techniques enabled policies trained in simulation to transfer to the real world
Imitation Learning Revival: DAgger, GAIL and other algorithms enabled efficient robot learning from human demonstrations

The Foundation Model Era (2022-Present)

Vision-Language-Action Models (VLA): RT-1 (2022), RT-2 (2023) brought the capabilities of large language models and vision models into robot control
Diffusion Policy: Applying diffusion models to action generation, achieving breakthrough progress on manipulation tasks
General Robot Foundation Models: Octo, OpenVLA, pi0 and others exploring cross-robot, cross-task general policies
Large-Scale Datasets: Open X-Embodiment, DROID and other projects driving the scaling of robot data
Humanoid Robot Wave: Figure, Tesla Optimus, Unitree and others pushing humanoid robots from labs toward industrialization
World Models and Video Generation: Sora and other video generation models demonstrating potential as world simulators, providing new paths for planning and imagination capabilities in embodied intelligence

For a more detailed timeline of milestones, see Embodied Intelligence Milestones.

Core Pillars of Embodied Intelligence

As a complete system, embodied intelligence can be decomposed into six core pillars. Understanding these six pillars is key to understanding the entire discipline:

┌─────────────────────────────────────────────────────────┐
│              Embodied Intelligence System                │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │Perception│  │World Model│  │ Planning  │              │
│  │          │  │          │  │& Decision │              │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘              │
│       │             │             │                      │
│       └─────────────┼─────────────┘                      │
│                     │                                    │
│  ┌──────────┐  ┌────┴─────┐  ┌──────────┐              │
│  │  Action  │  │ Learning │  │  Memory  │              │
│  │& Control │  │          │  │          │              │
│  └──────────┘  └──────────┘  └──────────┘              │
│                                                         │
│  ─────────── Body + Environment (Physical/Virtual) ──── │
└─────────────────────────────────────────────────────────┘

1. Perception

Perception is the first step in an embodied agent's interaction with the environment — transforming raw sensor data into structured environmental representations. Unlike pure visual AI, embodied perception must serve action and must be real-time, multimodal, and three-dimensional.

Core technologies of multimodal perception:

Visual Perception: RGB cameras and depth cameras provide rich environmental information. Current mainstream visual encoders include foundation models like CLIP, DINOv2, and SigLIP, as well as detection and segmentation models like SAM and Grounding DINO
3D Spatial Perception: Reconstructing 3D understanding from 2D images, including technologies like NeRF, 3D Gaussian Splatting, and point cloud processing. Critical for grasping, navigation, and other tasks
Tactile Perception: Tactile sensors (e.g., GelSight) provide contact force, deformation, and texture information, indispensable for fine manipulation
Proprioception: Joint angles, torques, IMUs and other sensors provide information about the robot's own state
Video Understanding: Understanding motion, causal relationships, and temporal changes in dynamic scenes is a key capability distinguishing embodied perception from static image understanding
Multimodal Fusion: How to unify the encoding of vision, tactile, auditory, proprioceptive and other modalities is an important current research direction

The current trend is vision-dominant. Visual input has the highest bandwidth and richest information, which is why many autonomous driving companies have been able to smoothly transition into humanoid robotics — they have deep expertise in visual processing.

2. World Model

The world model is the core mechanism by which embodied agents create internal representations of their environment. It is not passive recording, but an active internal simulator that serves reasoning, prediction, planning, and action.

Embodied intelligence must rely on world models to understand the environment, plan actions, and predict consequences.

Humans innately possess rich world models: infants as young as three months can begin to understand object permanence and know that walls cannot be passed through; soon after, they can predict that objects will fall and infer others' intentions. Embodied agents similarly need this capability.

World models can be divided into two major categories:

Physical World Model

Focuses on the structure and physical laws of the environment, containing four core elements:

Objects: Properties of objects — shape, size, color, material, mass
Space: Spatial relationships between objects — position, distance, adjacency, containment
Dynamics: Laws of change in the environment — motion, collision, fluid, deformation
Causality: Causal relationships between actions and outcomes — pushing a cup will topple it, pressing a switch turns on the light

The physical world model enables embodied agents to:

Roll forward: Imagine how the environment will change based on candidate actions
Evaluate: Score imagined futures
Execute: Select the optimal plan and execute, then re-plan based on new observations (i.e., MPC, Model Predictive Control)

Mental World Model

Focuses on human mental states and social interactions. For embodied agents that need to collaborate with humans (such as household service robots), this layer is crucial:

Intentions: Understanding human goals and motivations
Emotions: Perceiving users' emotional states
Beliefs: Understanding others' cognition about the world (which may differ from reality)
Social norms: Understanding cultural customs, etiquette, and interpersonal relationships

The core capability of the mental world model is Theory of Mind (ToM) — the ability to understand that others have independent thoughts, beliefs, and intentions. A household service robot needs to understand not only that "a cup dropped on the floor will break" (physics) but also that "the owner may be sad that the cup broke, or perhaps deliberately smashed it to vent frustration" (mental). Without this understanding, a robot is merely a cold actuator.

Technical Implementations of World Models

Current technical approaches to world models include:

Approach	Core Idea	Representative Work
Learned Dynamics Models	Neural networks learning state transition functions	RSSM, Dreamer series
Video Prediction Models	Generating future frames as imagination	Sora, UniSim
Physics Simulators	White-box physics engines	MuJoCo, Isaac Sim
Neural Implicit Representations	Representing scenes with continuous functions	NeRF, 3DGS
JEPA Framework	Predicting future representations in abstract space	V-JEPA (Yann LeCun)
LLM-based Common Sense	Leveraging language model world knowledge	SayCan, Inner Monologue

For detailed theoretical discussion of world models, see Representations and World Models. For model and algorithm implementations, see World Models.

3. Planning and Decision Making

Planning is the process of decomposing high-level goals into executable action sequences. Embodied agents face unique planning challenges: they must make decisions in continuous physical space, under uncertainty, and with real-time constraints.

Planning in embodied intelligence typically operates at multiple hierarchical levels, analogous to the hierarchical structure of the human brain:

Level	Analogy	Focus	Time Scale	Typical Technologies
Task Planning	Cerebral cortex	"What to do" — task decomposition, logical sequencing	Seconds~minutes	LLM/VLM, PDDL, HTN
Motion Planning	Cerebellum	"How to move" — trajectory generation, obstacle avoidance	Milliseconds~seconds	RRT*, trajectory optimization, MPC
Low-level Control	Spinal cord/muscles	"How much torque" — joint actuation	Milliseconds	PID, impedance control, inverse dynamics

The core challenge of high-level planning (task level) is that real-world activities are endlessly varied. Even "stir-frying" alone involves chopping, seasoning, heat control, ingredient states, and kitchen space utilization. The introduction of LLMs has brought breakthroughs in task planning — using language model common sense knowledge for task decomposition and high-level reasoning (e.g., SayCan, Code as Policies). But how to reliably translate language-level abstract plans into physical-level precise actions remains an open problem.

Progress in low-level planning (motion level) has been more significant. Traditional robots relied primarily on proprioception (joint angles, etc.), but the current trend is to use visual input. Vision-driven motion planning enables robots to handle more complex and uncertain environments.

Unifying task and motion planning (TAMP) is an important current research direction — how to make high-level symbolic reasoning and low-level geometric constraints work together within a single framework.

For detailed planning theory, see Task and Motion Planning Theory.

4. Action and Control

Action is the core characteristic that distinguishes embodied intelligence from all other AI directions. Embodied agents must not only "think" but also "do" — translating decisions into actual physical movements.

Manipulation is one of the core tasks in current embodied intelligence research, involving:

Grasping: From simple parallel-jaw grasps to dexterous hand manipulation
Placement and Assembly: Precise pose alignment and force control
Tool Use: Understanding and using human tools
Deformable Object Manipulation: Cloth folding, rope manipulation, and other non-rigid tasks

Locomotion is another core task:

Bipedal Walking: Balance and walking control for humanoid robots
Quadruped Motion: Robust movement over complex terrain
Whole-body Coordination: Simultaneous locomotion and manipulation (loco-manipulation)

Control theory provides the mathematical foundation for action:

Classical control: PID, impedance control, hybrid force/position control
Optimal control: LQR, MPC
Learned control policies: End-to-end control based on RL or imitation learning
Whole-Body Control (WBC): Multi-task balancing and coordination for humanoid robots

For robotics fundamentals (kinematics, dynamics, control), see Robotics Fundamentals.

5. Learning

Learning capability is what transforms embodied intelligence from "programmed robots" to "intelligent robots." The main learning paradigms include:

Imitation Learning: Learning policies from human demonstrations

Behavioral Cloning (BC): Directly imitating actions
DAgger: Interactive correction
ACT, Diffusion Policy: Current state-of-the-art imitation learning methods

Reinforcement Learning (RL): Learning through trial-and-error and reward signals

Large-scale parallel simulation training (e.g., thousands of robots training simultaneously in Isaac Gym)
Reward engineering: Designing appropriate reward functions
Sim2Real transfer: Deploying simulation-trained policies

Foundation Model-Driven Learning: Leveraging knowledge from pre-trained large models

VLA models (Vision-Language-Action): Unifying visual understanding, language understanding, and action generation in a single model
Diffusion Policy: Using diffusion models to generate action sequences
Language-conditioned learning: Driving robot behavior with natural language instructions

Data is the core bottleneck. Unlike NLP and CV fields that can obtain massive data from the internet, robot data collection costs are extremely high (requiring real physical interactions). Current main solutions:

Teleoperation: ALOHA, UMI, and other systems
Large-scale datasets: Open X-Embodiment, DROID, Bridge Data V2
Simulation data + domain randomization
Video data utilization: Learning physical common sense and manipulation skills from internet videos
Synthetic data augmentation: Using video generation models to expand training data

See the Robot Learning section for details.

6. Memory

Memory capability is the foundation of advanced intelligence. A robot without memory can only respond to the current moment's stimuli — it cannot perform long-horizon tasks, improve from experience, or provide personalized service.

Memory in embodied agents can be organized at the following levels:

Fixed Memory: Neural network weights. Acquired through pre-training, constant during inference. Stores general world knowledge and skills. Difficult to update — requires fine-tuning and faces catastrophic forgetting
Working Memory: Activation states during model inference. In sequence models, the KV-cache of attention mechanisms serves as working memory. Capacity is limited by context window
External Memory: Information stored outside the model architecture, accessed through retrieval mechanisms. Includes vector databases (RAG), knowledge graphs, experience replay buffers, etc.
Episodic Memory: Records the agent's specific experiences and interaction episodes. This is the memory form closest to human "recollection," crucial for personalization and lifelong learning

Episodic memory is considered the future direction for embodied intelligence memory systems. It must satisfy:

Personalization: Building customized memories for specific users and environments
Lifelong Learning: Memory that can grow continuously, but at a rate slower than the growth of interactions (i.e., requiring effective compression and forgetting mechanisms)
Retrievability: The ability to quickly retrieve relevant memories based on current context

Current Transformer architectures have fundamental limitations in memory — the linear growth of KV-cache cannot support lifelong interaction. The future direction is not simply scaling up context windows, but developing dynamic, compressible internal representations that enable agents to continuously learn and evolve over long lifespans.

Technical Paradigms of Embodied Intelligence

Paradigm 1: From Modular to End-to-End

The architecture of embodied intelligence systems has evolved from purely modular to end-to-end, and then to hybrid architectures:

Modular Architecture: Splits the system into independent modules such as perception, planning, and control, each developed independently. Pros: interpretable, debuggable, easy to incorporate safety constraints. Cons: error accumulation and information bottlenecks between modules.

End-to-End Architecture: Directly maps raw observations to actions, with a single neural network handling all processing. Pros: avoids information loss, stronger generalization. Cons: poor interpretability, large data requirements, difficult to guarantee safety.

Hybrid Architecture (Current Mainstream):

High level: LLM/VLM for task understanding and decomposition
Mid level: Learned policies or traditional planners for trajectory generation
Low level: Classical controllers for safety and precision

For detailed architecture comparison, see Embodied Intelligence Roadmap.

Paradigm 2: Foundation Model-Driven Embodied Intelligence

The success of large language models and vision-language models has profoundly changed the research paradigm of embodied intelligence. Core technical directions include:

Vision-Language-Action Models (VLA):

VLA unifies visual understanding, language understanding, and action generation in a single model, and is currently the most watched technical direction. Representative work includes:

Model	Institution	Core Innovation
RT-1	Google	First large-scale robotics Transformer
RT-2	Google	VLM directly outputting action tokens
Octo	UC Berkeley	Cross-robot general policy
OpenVLA	Stanford et al.	Open-source VLA
pi0	Physical Intelligence	VLM + Flow Matching action head
RoboCasa	UT Austin	Large-scale simulation data training

Diffusion Policy:

Applies diffusion generative models to action sequence generation. Diffusion policies naturally support multimodal action distributions (the same task may have multiple reasonable completion methods), achieving excellent performance on manipulation tasks.

LLM as High-Level Planner:

Leveraging LLM common sense knowledge and reasoning capabilities for task decomposition and high-level planning:

SayCan: "Grounding" LLM knowledge with robot capabilities
Code as Policies: Having LLMs directly generate robot control code
Inner Monologue: LLM adjusting plans through multi-round feedback

See the Models and Algorithms section for details.

Paradigm 3: Sim-to-Real (Sim2Real)

Due to the high cost and safety risks of real-world data collection, training in simulation environments and transferring to the real world is a core paradigm of embodied intelligence:

Domain Randomization: Randomizing physical parameters, visual appearance, etc. in simulation to make policies robust to the sim-real gap
Domain Adaptation: Explicitly reducing distributional differences between simulation and reality
Teacher-Student Distillation: Training teacher policies with privileged information in simulation, then distilling to student policies using only real-world-available information
Simulator Advances: NVIDIA Isaac Sim/Lab, MuJoCo, Habitat and other simulation platforms continuously improving physical fidelity

See Sim2Real and Real-World Deployment for details.

Paradigm 4: Data Scaling

Data is one of the biggest bottlenecks in current embodied intelligence development. Core progress around data includes:

Large-scale Robot Datasets: Open X-Embodiment (aggregating million-level episodes from 22 robot platforms), DROID (distributed teleoperation)
Teleoperation Systems: ALOHA (low-cost bimanual teleoperation), UMI (Universal Manipulation Interface), GELLO — these have lowered data collection barriers
Internet Video Mining: Extracting skill knowledge from manipulation videos on YouTube and similar platforms
Synthetic Data: Using video generation models to synthesize training data
Data Flywheel: Deploy-collect-train-improve closed loops enabling continuous data scale growth

See Teleoperation and Data Collection for details.

Classification of Embodied Agents

Physical Robots (Robotic Agents)

Physical robots are the most direct carriers of embodied intelligence, possessing definitive physical bodies. In the Chinese AI industry, "embodied intelligence" is almost synonymous with robotics. Classification by form factor:

Form	Characteristics	Typical Scenarios
Humanoid Robot	Human-like form, most versatile but hardest to control	General household/industrial service
Robot Arm	Fixed base, high precision	Manufacturing, laboratories
Quadruped Robot	Strong terrain adaptability	Inspection, rescue, exploration
Dexterous Hand	Fine manipulation capability	Manipulation research, service
Drone	Aerial perspective, rapid deployment	Logistics, surveying, inspection
Mobile Manipulation	Mobile base + arm	Warehouse logistics, home service

Humanoid robots are currently the biggest focus of the industry. From 2024 to 2026, companies like Figure, Tesla Optimus, Unitree, AGIBOT, and Fourier Intelligence have intensively released humanoid robot products, with enormous capital investment. The core appeal of humanoid robots is versatility — human environments are designed for human bodies, and humanoid robots are naturally compatible.

See Robot Forms and Hardware for details.

Virtual Embodied Agents (VEA)

VEAs are agents with virtual bodies that perceive and act in virtual environments. Although they lack physical entities, they possess complete embodied characteristics in virtual space — spatial existence, perception, action, and causal coupling.

By purpose, VEAs can be classified into three categories:

Social Agents: Virtual humans in metaverse, virtual assistant, and AI companion scenarios
Game-based Agents: NPCs that can autonomously survive and maintain social ecosystems in open-world games
Simulation Training Agents: Serving as digital twins of physical robots for policy validation before Sim2Real transfer

VEAs hold important value in embodied intelligence research: they provide a low-cost, safe, massively parallelizable experimental environment. Many embodied intelligence algorithms are first validated in virtual environments before being transferred to the physical world.

See AI Agents — Virtual Embodied Agents section for details.

Boundary Discussion on Wearable Agents

Wearable agents (such as AR glasses, smart gloves) hold great potential for augmenting human capabilities. However, based on our strict definition of "body," wearable agents are closer to intelligent assistive devices rather than embodied agents with independent bodies. They augment the human body rather than possessing their own body. Classifying them as "Quasi-Embodied AI" may be more appropriate.

Core Challenges of Embodied Intelligence

Despite remarkable progress in recent years, embodied intelligence still faces a series of deep challenges:

1. Data Bottleneck

The cost of collecting robot data is far higher than for text and image data. NLP training data can simply be scraped from the web, but robot manipulation data requires real physical interaction — equipment, scenes, personnel, and time are all indispensable. Even the largest robot datasets (like Open X-Embodiment) are minuscule compared to ImageNet or Common Crawl.

2. Insufficient Generalization

Current embodied intelligence systems degrade sharply when facing out-of-distribution scenarios. Changing a cup, a table, or a lighting condition can cause failure. From "being able to grasp 100 objects in a lab" to "being able to cook anything in any kitchen," there is an enormous gap.

3. Long-Horizon Task Planning

The robot's "cerebellum" (low-level motor control) has progressed significantly, but the "cerebrum" (high-level task planning) has been slow. How to reliably decompose high-level, ambiguous instructions like "clean the room" into hundreds of precise physical actions remains without mature solutions. LLMs provide new possibilities, but their reliability and predictability are far from meeting practical needs.

4. Safety and Robustness

Embodied agents act in the physical world, where errors cannot be rolled back. A software bug at most causes a program crash, but a robot control bug can cause physical harm. How to ensure safety while maintaining exploration and learning capabilities is a core challenge for engineering deployment.

5. Hardware Limitations

Despite rapid hardware progress, current robots are far from biological organisms in flexibility, strength, energy efficiency, and cost. The human hand has over 20 degrees of freedom and extremely fine force control capability — no mechanical hand currently comes close to this level.

6. Lack of Evaluation Standards

Unlike NLP (with MMLU, HELM, etc.) and CV (with ImageNet, COCO, etc.), embodied intelligence lacks a widely accepted unified evaluation system. Different labs use different robots, different tasks, and different metrics, making research results difficult to compare laterally.

7. Real-Time Requirements

Embodied agents must complete the full loop from perception to action within millisecond-level timescales. This places extremely high demands on model inference speed, creating tension with the current trend of "bigger is better" in large models. How to balance model capability and inference speed is a core issue in engineering practice.

8. From Lab to Real World

There is a vast gap between the carefully controlled conditions of the laboratory and the chaotic real world. Lighting changes, occlusion, distractors, unpredictable user behavior, equipment wear... the combinatorial explosion of these factors makes real-world deployment extremely challenging.

Relationship with AI Agents

Embodied intelligence and AI Agents share some foundational concepts: planning, memory, tool use, and environment interaction. At a high architectural level, both can be abstracted as the "perceive-reason-act" cycle. Research on cognitive architectures, multi-agent collaboration, and reasoning strategies from the AI Agent field is also inspirational for embodied intelligence.

But the core difference lies in: an AI Agent's "action" is calling APIs and operating software; embodied intelligence's "action" is moving and manipulating objects in physical space. This difference brings fundamentally different technical challenges — continuous control, real-time requirements, safety, multimodal perception, and physical constraints.

There exists an intersection zone — Virtual Embodied Agents (VEA) possess both the software characteristics of AI Agents and the spatial interaction characteristics of embodied intelligence.

Relationship with Traditional Robotics

Traditional robotics focuses on kinematics, dynamics, control theory, motion planning, and other "hardware + control" level problems. Embodied intelligence inherits these foundations but adds the "intelligence" layer on top — learning, reasoning, generalization, and collaboration with humans. One could say that Embodied Intelligence = Robotics + AI, representing the deep fusion of traditional robotics with modern AI methods.

Relationship with Cognitive Science

The theoretical foundation of embodied intelligence is deeply influenced by cognitive science. Concepts like the 4E cognition framework, intuitive physics, and Theory of Mind come directly from cognitive science research. Conversely, embodied intelligence also provides computational experimental platforms for cognitive science — testing hypotheses about human cognition by building and testing embodied intelligence models.

Relationship with Computer Vision

3D vision, scene understanding, object detection, pose estimation, and other CV technologies are core to embodied perception. But embodied vision has unique requirements: it must serve action (not for "seeing" but for "doing"), must be real-time, and must handle occlusion and viewpoint changes during interaction.

Relationship with Autonomous Driving

Autonomous driving can be viewed as a specialized instance of embodied intelligence — the vehicle is the "body," the road is the "environment," and driving is the "task." The two fields share perception, planning, and control technology stacks, but differ significantly in manipulation complexity (driving vs. dexterous manipulation) and environment structure. Many autonomous driving companies (like Tesla) are transitioning toward general embodied intelligence (humanoid robots), with technology transfer serving as an important driving force.

The embodied intelligence section of this site is organized as follows, with each subsection containing independent detailed notes:

Section	Content Summary	Target Audience
Overview	Survey, roadmap, milestones, key conferences	Everyone
Theory	Embodied cognition, perception-action loops, world model theory, paper readings	Researchers
Robotics Fundamentals	Kinematics, dynamics, motion planning, control, SLAM	Learners
Robot Learning	Imitation learning, RL, Sim2Real, diffusion policy, data collection	Algorithm researchers/engineers
Models and Algorithms	VLA, world models, foundation models, open-source models	Algorithm researchers
Simulation & Software Development	Simulation platforms, simulation assets, world building, ROS2, NVIDIA ecosystem, dev frameworks	Engineers
Hardware	Sensors, actuators, compute platforms, open-source hardware	Hardware engineers
Robot Forms	Humanoid, quadruped, dexterous hands, drones	Everyone
Deployment	Sim2Real deployment, safety, calibration, real-time systems	Engineers
Industry	Company landscape, market research, industry trends	Industry observers

Additionally, this site has sections closely related to embodied intelligence:

AI Agents: Software-level agent technologies, intersecting with embodied intelligence's high-level planning
Reinforcement Learning: One of the core learning paradigms for embodied intelligence
Deep Learning: Foundation models, visual encoders, and other underlying technologies
Robot Engineering: More engineering-practice-oriented robotics technical details

References

Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind: Cognitive Science and Human Experience. MIT Press.
Brooks, R. A. (1986). A Robust Layered Control System for a Mobile Robot. IEEE Journal on Robotics and Automation.
Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Houghton Mifflin.
Merleau-Ponty, M. (1945). Phénoménologie de la perception. Gallimard.
Meta AI Research. (2025). Embodied AI Agents: Modeling the World.
Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale.
Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
Chi, C. et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
Black, K. et al. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control.
Open X-Embodiment Collaboration. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
Ahn, M. et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan).
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence (JEPA framework).
Sclar, M. et al. (2024). ExploreToM: A Benchmark for Theory of Mind Reasoning.