Introduction to Robot Foundation Models
Foundation Models have achieved tremendous success in NLP and vision, which naturally raises a question: Can the same paradigm be used to build general-purpose robot intelligence? This article reviews the three major paradigms of current robot foundation models, the exploration of Scaling Laws, and the latest progress in cross-embodiment transfer.
Related notes: Model Roadmap | VLA Models | LLM-Driven Robotics | Open-Source Model Summary
If you want the full evolution map of this section before diving into the three paradigms, start with Model Roadmap.
1. Why Robot Foundation Models Are Needed
1.1 Bottlenecks of Traditional Methods
Traditional robot learning methods have several core problems:
- Task specificity: Each task requires training a separate policy, with no cross-task generalization
- Low data efficiency: Each new environment requires collecting data from scratch
- Embodiment binding: Policies trained for a specific robot cannot transfer to other hardware platforms
1.2 Advantages of Foundation Models
The core hypothesis of Foundation Models is:
Specifically for robotics, foundation models can bring:
| Feature | Traditional Methods | Foundation Model Approach |
|---|---|---|
| Generalization | Single task, single environment | Cross-task, cross-environment |
| Data utilization | Only uses current task data | Joint training on multi-source data |
| Cross-embodiment transfer | Not supported | Partially supported |
| Language understanding | Not available | Natural language instruction following |
| New task adaptation | Train from scratch | Few-shot / zero-shot |
2. Three Major Paradigms
Current robot foundation model research can be summarized into three major paradigms. Each paradigm differs fundamentally in abstraction level, modality fusion depth, and action output method.
Paradigm Overview Architecture
graph TB
subgraph ParadigmA["Paradigm A: LLM as High-Level Planner"]
A1[Natural Language Instruction] --> A2[LLM/VLM Planner]
A2 --> A3[Sub-task Sequence]
A3 --> A4[Low-Level Skill Policies]
A4 --> A5[Robot Actions]
A6[Environment Feedback] --> A2
end
subgraph ParadigmB["Paradigm B: VLM Fine-tuned for Action Output"]
B1[Image + Language Instruction] --> B2[Pretrained VLM Backbone]
B2 --> B3[Action Token Decoding]
B3 --> B5[Robot Actions]
end
subgraph ParadigmC["Paradigm C: Dedicated Robot Foundation Model"]
C1[Multi-modal Sensor Input] --> C2[Dedicated Encoder]
C2 --> C3[Unified Transformer Backbone]
C3 --> C4[Action Head / Diffusion Head]
C4 --> C5[Continuous Robot Actions]
end
style ParadigmA fill:#e1f5fe
style ParadigmB fill:#f3e5f5
style ParadigmC fill:#e8f5e9
2.1 Paradigm A: LLM as High-Level Planner
Core Idea: Use a large language model as the "brain," responsible for task understanding, reasoning, and sub-task decomposition, while delegating low-level motion control to pretrained skill policies.
Representative Works:
- SayCan (Google, 2022): Multiplies the LLM's language probability with the robot's affordance score to select executable sub-tasks
- Code as Policies (Liang et al., 2023): LLM directly generates Python code to call robot APIs
- Inner Monologue (Google, 2022): Introduces perceptual feedback loops, allowing the LLM to dynamically adjust plans based on execution results
Mathematical Formulation:
In SayCan, given language instruction \(l\) and candidate skill set \(\{c_i\}\), the next skill to execute is selected as:
where \(s\) is the current environment state.
Advantages:
- Leverages the powerful reasoning and commonsense knowledge of LLMs
- No end-to-end training needed; modular design
- Easy to incorporate human feedback
Disadvantages:
- Depends on a predefined low-level skill library
- A "grounding" gap exists between LLMs and the physical world
- High inference latency, unsuitable for real-time control
More details: LLM-Driven Robots
2.2 Paradigm B: VLM Fine-tuned for Action Output
Core Idea: Fine-tune a pretrained Vision-Language Model (VLM) to directly output robot actions. Treat actions as a form of "language," represented as tokens.
Representative Works:
- RT-2 (Google DeepMind, 2023): Co-fine-tuned on PaLI-X (55B) and PaLM-E (12B), discretizing actions into token sequences of 256 bins per dimension
- OpenVLA (Stanford/Berkeley, 2024): Based on Prismatic VLM + Llama 2 7B, open-source VLA
Action Tokenization:
RT-2 discretizes the continuous action space. For each action dimension \(a_d \in [a_{\min}, a_{\max}]\), it is uniformly divided into \(K=256\) bins:
Output format: "1 128 91 241 5 101 127" — corresponding to xyz translation, rpy rotation, and gripper open/close.
Advantages:
- Inherits VLM's visual understanding and language reasoning capabilities
- End-to-end training without manually designed intermediate representations
- Can leverage knowledge from large-scale internet data pretraining
Disadvantages:
- Action discretization loses precision
- Large model size (billions of parameters), slow inference
- Limited effectiveness on fine-grained manipulation tasks (e.g., insertion, suturing)
2.3 Paradigm C: Dedicated Robot Foundation Model
Core Idea: Instead of following general VLM architectures, design specialized model architectures and training pipelines based on the characteristics of robot data.
Representative Works:
- Octo (Berkeley, 2023): Open-source multi-embodiment Transformer supporting various sensor inputs and action spaces
- pi0 (Physical Intelligence, 2024): Uses a flow matching action head for continuous action output
- HPT (MIT, 2024): Unified pretraining for heterogeneous sensor inputs
Mathematical Formulation of the Action Head:
pi0 uses Flow Matching as the action head. Given condition \(c\) (visual + language features), it learns a vector field \(v_\theta\) that maps a noise distribution to an action distribution:
where \(\mathbf{a}_0 \sim \mathcal{N}(0, I)\) is the initial noise and \(\mathbf{a}_1\) is the predicted action sequence.
Training objective (Conditional Flow Matching Loss):
where \(\mathbf{a}_t = (1-t)\mathbf{a}_0 + t\mathbf{a}_1\) is the linear interpolation path.
Advantages:
- Can output continuous actions with high precision
- Model architecture can be optimized for robot data characteristics
- Supports multimodal sensor inputs
Disadvantages:
- Requires large amounts of robot data for pretraining
- Lacks the rich semantic understanding of general VLMs
3. Scaling Laws: Data Volume and Performance
3.1 Evolution of the RT Series
Google DeepMind's RT series pioneered the exploration of robot Scaling Laws:
graph LR
A["RT-1 (2022)"] -->|"Scale up"| B["RT-2 (2023)"]
B -->|"Data diversification"| C["RT-X (2023)"]
A1["130K episodes<br/>700+ tasks<br/>13 robots"] --> A
B1["PaLI-X 55B + PaLM-E 12B<br/>Web data co-training<br/>Emergent reasoning"] --> B
C1["Open X-Embodiment<br/>1M+ episodes<br/>22 robot types<br/>160K+ tasks"] --> C
3.2 Key Findings
Leap from RT-1 to RT-2:
| Dimension | RT-1 | RT-2 |
|---|---|---|
| Model scale | 35M parameters | 12B–55B parameters |
| Training data | 130K robot episodes | Robot data + Web data |
| Number of tasks | 700+ | 700+ (same) |
| Generalization | Seen objects/scenes | Unseen objects (emergent) |
| Reasoning ability | None | Can understand "move apple to bowl of matching color" |
| Control frequency | 3Hz | 1–3Hz |
The most important finding of RT-2 is emergent capabilities: through pretraining on web-scale vision-language data, the model acquired reasoning abilities never present in the robot data, such as understanding "move the bottle next to the flag of Taylor Swift's country."
3.3 Is the Data Scale Sufficient?
The enormous gap between current robot data scale and language model data scale:
| Domain | Data Scale | Token/Sample Count |
|---|---|---|
| GPT-4 | ~13T tokens | ~13,000,000M |
| ImageNet | 14M images | - |
| Open X-Embodiment | 1M+ episodes | ~millions |
| Single lab | 1K–100K episodes | - |
The gap is 3–4 orders of magnitude or more. This raises several key questions:
- Do Scaling Laws similar to language models exist in robotics? — Preliminary evidence suggests increasing data improves generalization, but a power-law relationship has not yet been established
- Can web data compensate for the lack of robot data? — RT-2's experiments suggest yes, but the precision of physical interactions remains limited
- Is simulation data effective? — The Sim-to-Real gap remains a major challenge
4. Open X-Embodiment: Cross-Embodiment Dataset
4.1 Overview
Open X-Embodiment (led by Google DeepMind, 2023) is currently the largest cross-embodiment robot dataset:
- Data scale: Over 1 million episodes
- Robot types: 22 different robot morphologies
- Task count: 160,000+ different task descriptions
- Contributing institutions: 33 datasets from 21 institutions
4.2 Data Diversity Architecture
graph TB
OXE[Open X-Embodiment Dataset]
OXE --> R1[Single-arm Tabletop Robots]
OXE --> R2[Bimanual Manipulation Platforms]
OXE --> R3[Mobile Manipulation Robots]
OXE --> R4[Dexterous Hands]
R1 --> D1[Bridge V2<br/>60K episodes]
R1 --> D2[RT-1 Data<br/>130K episodes]
R2 --> D3[ALOHA Data]
R3 --> D4[Kuka Data]
R4 --> D5[DROID Data]
OXE --> Format[Unified RLDS Format]
Format --> F1[observation: Image + Proprioception]
Format --> F2[action: End-effector Pose]
Format --> F3[language_instruction: Text]
4.3 Key Conclusions
Open X-Embodiment experiments revealed several important conclusions:
- Positive Transfer: RT-X models trained on mixed multi-embodiment data outperform models trained solely on single-embodiment data for most individual embodiments
- Data diversity > Data volume: Compared to simply increasing data from the same robot, adding data from different robots is more helpful for generalization
- Challenge of unified action spaces: The action spaces of different robots differ greatly (joint space vs. end-effector space, different degrees of freedom), requiring unified action representations
5. Future Outlook
5.1 Open Questions
- Universal action representation: How to design a unified action space so that robots of different morphologies can share the same model?
- Real-time performance: Current large model inference speeds (1–10Hz) are far below robot control requirements (100–1000Hz) — how to address this?
- Safety: How to combine the black-box nature of foundation models with robot safety constraints?
- Data flywheel: How to build an automated loop of data collection → model training → deployment → data collection?
5.2 Three-Paradigm Convergence Trend
graph TB
T1["2022: Independent development of each paradigm"] --> T2["2023-2024: Paradigm B+C convergence"]
T2 --> T3["2025+: Three-paradigm convergence"]
T3 --> F1["High level: LLM reasoning and planning"]
T3 --> F2["Mid level: VLM understanding and grounding"]
T3 --> F3["Low level: Dedicated action model"]
F1 <--> F2
F2 <--> F3
From pi0.5 (Physical Intelligence, 2025), this trend is visible: high-level language model for task decomposition, mid-level VLM for scene understanding, and low-level flow matching model for fine-grained action output. This may become the mainstream architecture direction for future robot foundation models.
6. Summary
| Comparison Dimension | Paradigm A (LLM Planning) | Paradigm B (VLM Fine-tuning) | Paradigm C (Dedicated Foundation Model) |
|---|---|---|---|
| Representative models | SayCan, Code as Policies | RT-2, OpenVLA | Octo, pi0, HPT |
| Action output | Calls low-level APIs | Discrete tokens | Continuous values / Diffusion sampling |
| Control precision | Low (depends on low-level) | Medium | High |
| Reasoning ability | Strong | Medium–Strong | Weak–Medium |
| Control frequency | <1Hz | 1–3Hz | 5–50Hz |
| Data requirements | Low (leverages pretraining) | Medium (fine-tuning) | Large (pretraining) |
| Open-source degree | High | Medium | Medium–High |
Robot foundation models are still in their early stages, but the three paradigms are rapidly developing and converging. Key driving forces include: larger-scale cross-embodiment datasets, more efficient model architectures, and simulation-to-real transfer technologies.
References:
- Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", 2022
- Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", 2023
- Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
- Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
- Team et al., "Octo: An Open-Source Generalist Robot Policy", 2023
- Bommasani et al., "On the Opportunities and Risks of Foundation Models", 2021