Milestones in Embodied Intelligence
Overview
The development of embodied intelligence spans over half a century, from early symbolic-reasoning robots to today's foundation model-driven general-purpose robots. This article traces key milestones along a timeline, analyzing the technical innovations behind each breakthrough and their profound impact on the field.
Timeline Overview
timeline
title History of Embodied Intelligence
section Early Period (1960s-1990s)
1969 : Shakey - First general-purpose mobile robot
1973 : WABOT-1 - First full-scale humanoid robot
1979 : Stanford Cart - Vision-based navigation pioneer
section Growth Period (2000s-2010s)
2000 : ASIMO - Humanoid bipedal walking
2005 : BigDog - Dynamic quadruped balancing
2015 : DRC - Disaster response robot competition
section Explosion Period (2019-Present)
2019 : OpenAI Rubik's Cube - Dexterous manipulation + Sim2Real
2022 : RT-1 - Large-scale robot learning
2023 : RT-2 - VLM-to-VLA transfer
2024 : Open X-Embodiment + pi0
1. Shakey (1969) -- The Dawn of General-Purpose Mobile Robots
Background
Developed by SRI International, Shakey was the world's first general-purpose mobile robot capable of reasoning about its own actions.
Technical Innovations
- STRIPS Planner: The first automated planning system, defining the precondition-effect formalization framework
- Perception-Reasoning-Action Loop: Combined AI planning with physical world execution
- Vision-Based Navigation: Used a television camera and bump sensors for environment perception
Historical Significance
Shakey demonstrated that symbolic reasoning can drive actions in the physical world. The STRIPS planning formalism remains the theoretical foundation of PDDL to this day.
2. WABOT-1 (1973) -- The First Full-Scale Humanoid Robot
Background
Developed at Waseda University in Japan, WABOT-1 was the world's first full-scale humanoid robot.
Technical Innovations
- Bipedal Walking System: Achieved static balance walking, albeit at extremely slow speeds
- Vision System: Used two external cameras for object recognition and distance measurement
- Hand Grasping: Simple grasping driven by tactile sensors
- Language Interaction: Capable of simple conversation in Japanese
Historical Significance
WABOT-1 pioneered the humanoid robot research paradigm, demonstrating the feasibility of building full-scale humanoid systems and laying the groundwork for subsequent research such as ASIMO.
3. Stanford Cart (1979) -- Vision-Based Autonomous Navigation
Background
Developed by Hans Moravec at Stanford University, the Stanford Cart was a representative work in early vision-based navigation.
Technical Innovations
- Stereo Vision: Obtained depth information by capturing images from different positions with a single camera
- Obstacle Detection: Vision-based obstacle avoidance
- Path Planning: Autonomous path planning in obstacle-laden environments
Historical Significance
Although extremely slow (it took approximately 5 hours to traverse a 20-meter room), the Stanford Cart demonstrated that pure visual information can support autonomous navigation -- an idea that blossomed again 40 years later in Tesla FSD and embodied navigation systems.
4. ASIMO (2000) -- Breakthrough in Humanoid Bipedal Walking
Background
ASIMO (Advanced Step in Innovative Mobility) was a humanoid robot developed by Honda over 14 years of research.
Technical Innovations
- Dynamic Walking: Dynamic balance walking based on ZMP (Zero Moment Point) $\(\text{ZMP}: \quad x_{zmp} = \frac{\sum_i m_i(\ddot{z}_i + g)x_i - \sum_i m_i \ddot{x}_i z_i}{\sum_i m_i(\ddot{z}_i + g)}\)$
- Stair Climbing: Capable of ascending and descending stairs
- Gesture Recognition: Recognized simple gesture commands
- Autonomous Obstacle Avoidance: Real-time path adjustment
Historical Significance
ASIMO demonstrated that humanoid robots can achieve dynamic, stable locomotion in human environments. The ZMP method became the dominant paradigm for humanoid locomotion control for over a decade.
5. BigDog (2005) -- Dynamic Quadruped Locomotion
Background
A quadruped robot developed by Boston Dynamics for the U.S. military.
Technical Innovations
- Dynamic Balancing: Hydraulically driven, capable of maintaining balance on rough terrain
- Disturbance Recovery: Able to recover balance after being kicked (the iconic demonstration video)
- Terrain Adaptation: Adapted to ice, slopes, gravel, and various other terrains
- Load Capacity: Could carry approximately 150 kg of payload
Historical Significance
BigDog demonstrated that robots can achieve near-animal-level dynamic locomotion capabilities, pioneering modern dynamic legged locomotion research and eventually evolving into iconic products like Spot and Atlas.
6. DARPA Robotics Challenge (2015) -- Disaster Response Robots
Background
A robotics competition initiated by DARPA in response to the Fukushima nuclear disaster aftermath, requiring robots to perform tasks such as driving, opening doors, traversing rubble, and closing valves in disaster environments.
Technical Innovations
- Whole-Body Motion Planning: Locomotion in complex unstructured environments
- Human-Robot Collaborative Teleoperation: Combining remote control with autonomous decision-making
- Multimodal Perception Fusion: LiDAR + vision + force sensing
- Multi-Task General Platform: A single platform completing multiple heterogeneous tasks
Key Findings
Most robots frequently failed at simple tasks (such as opening doors), exposing the severe lack of robustness in robot systems at the time -- directly driving the subsequent adoption of learning-based methods.
Historical Significance
DRC demonstrated the limitations of traditional engineering approaches in unstructured environments, marking a critical turning point in robotics from pure engineering toward learning-driven methods.
7. OpenAI Rubik's Cube (2019) -- Sim-to-Real and Dexterous Manipulation
Background
OpenAI used reinforcement learning to train a dexterous hand (Shadow Hand) to solve a Rubik's cube in the real world.
Technical Innovations
- Large-Scale Domain Randomization: Randomized \(>100\) physical parameters in simulation $\(\pi^* = \arg\max_\pi \mathbb{E}_{\xi \sim P(\xi)} \left[ \sum_t r(s_t, a_t) \right]\)$ where \(\xi\) is the randomization parameter vector
- Automatic Domain Randomization (ADR): Automatically adjusted randomization ranges
- Memory-Augmented Policy: LSTM policy network to handle partial observability
- Fingertip Manipulation: Fine control of 24 degrees of freedom
Historical Significance
This work demonstrated that Sim-to-Real transfer can solve extremely fine manipulation tasks, and domain randomization became a standard technique for robot RL thereafter. It also revealed a limitation: the computational resources required for training were enormous.
8. RT-1 (2022) -- Large-Scale Robot Learning
Background
Robotics Transformer released by Google DeepMind, trained on 130k real demonstrations.
Technical Innovations
- Tokenized Actions: Discretized continuous actions into tokens
- FiLM-conditioned EfficientNet: Visual encoder fusing language instructions through FiLM layers $\(\text{FiLM}(x) = \gamma(l) \odot x + \beta(l)\)$
- Large-Scale Real Data: 13 robots, 17 months, 130k+ trajectories
- Multi-Task Learning: A single model handling 700+ tasks
Historical Significance
RT-1 demonstrated the effectiveness of scaling data and model capacity for robot policies, pioneering the study of "Scaling Laws for Robot Learning."
9. RT-2 (2023) -- From VLM to VLA
Background
Google DeepMind fine-tuned a Vision-Language Model (VLM) directly into a Vision-Language-Action model (VLA).
Technical Innovations
- Actions as Text Tokens: Encoded robot actions as natural language token sequences
- VLM Knowledge Transfer: Directly transferred internet-pretrained vision-language knowledge to robot control
- Emergent Reasoning Abilities: Could understand semantic instructions never seen before (e.g., "throw the trash in the trash can")
- Symbolic Reasoning + Physical Manipulation: Unified symbolic reasoning and physical control in a single model
Historical Significance
RT-2 demonstrated that internet knowledge in VLMs can be grounded in the physical world, establishing the VLA paradigm that became the foundational framework for subsequent models like Octo and pi0.
10. Open X-Embodiment (2024) -- Cross-Embodiment Transfer
Background
Jointly released by 33 research institutions, comprising a dataset of 22 robot types, 1 million+ real trajectories, and RT-X models.
Technical Innovations
- Unified Data Format: RLDS (Reinforcement Learning Datasets) standard
- Cross-Robot Transfer: Sharing training data across robots of different morphologies
- Positive Transfer Validation: Experiments demonstrated that cross-embodiment data improves individual robot performance
- Open Ecosystem: Open-source datasets and models
Historical Significance
Open X-Embodiment pioneered the open data ecosystem for embodied intelligence, demonstrating the feasibility of cross-embodiment transfer learning, analogous to the significance of Common Crawl for language models in NLP.
11. pi0 (2024) -- General-Purpose Robot Foundation Model
Background
A general-purpose robot policy model launched by Physical Intelligence.
Technical Innovations
- VLM Backbone: Based on a pretrained VLM as the perception and reasoning foundation
- Flow Matching Action Head: $\(v_\theta(x_t, t) = \frac{dx_t}{dt}, \quad x_1 = x_0 + \int_0^1 v_\theta(x_t, t) dt\)$ Uses flow matching instead of diffusion models for action generation
- Multi-Task Generalization: A single model performing tasks such as folding clothes, tidying tables, and packing boxes
- Zero-Shot Transfer: Works on unseen scenarios and objects
Historical Significance
pi0 represents a new paradigm for general-purpose robot foundation models, successfully bringing the large-scale pretraining + flexible fine-tuning paradigm from language to robotics.
12. Milestone Comparison Summary
| Milestone | Year | What It Proved | Core Methodology |
|---|---|---|---|
| Shakey | 1969 | Symbolic reasoning can drive physical actions | STRIPS planning |
| WABOT-1 | 1973 | Full-scale humanoid robots are feasible | Engineering integration |
| Stanford Cart | 1979 | Vision can support autonomous navigation | Stereo vision |
| ASIMO | 2000 | Humanoid dynamic walking | ZMP control |
| BigDog | 2005 | Animal-level dynamic locomotion | Hydraulics + feedback control |
| DRC | 2015 | Insufficient robustness of traditional methods | Teleoperation + autonomy |
| Rubik's Cube | 2019 | Sim2Real + dexterous manipulation | RL + domain randomization |
| RT-1 | 2022 | Data scaling laws | Transformer + large data |
| RT-2 | 2023 | VLM to VLA transfer | Actions as tokens |
| Open X-Embodiment | 2024 | Cross-embodiment transfer | Open data ecosystem |
| pi0 | 2024 | General robot foundation model | VLM + Flow Matching |
13. Future Outlook
Based on current trends, the next possible milestones:
- Truly General-Purpose Home Robots: Capable of completing various daily tasks in open home environments
- Self-Learning Robots: Acquiring skills through exploration and interaction without human demonstrations
- Multi-Robot Collaboration: Multiple heterogeneous robots cooperatively completing complex tasks
- Long-Term Autonomous Operation: Robots operating continuously in real environments for months without human intervention
References
- Nilsson, N. J. "Shakey the Robot." SRI International, 1984
- Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances," 2022
- Brohan et al., "RT-1" and "RT-2," 2022-2023
- Open X-Embodiment Collaboration, 2024
- Black et al., "pi0," 2024
Related Notes: