Introduction to Robot Foundation Models

Foundation Models have achieved tremendous success in NLP and vision, which naturally raises a question: Can the same paradigm be used to build general-purpose robot intelligence? This article reviews the three major paradigms of current robot foundation models, the exploration of Scaling Laws, and the latest progress in cross-embodiment transfer.

Related notes: Model Roadmap | VLA Models | LLM-Driven Robotics | Open-Source Model Summary

If you want the full evolution map of this section before diving into the three paradigms, start with Model Roadmap.

1. Why Robot Foundation Models Are Needed

1.1 Bottlenecks of Traditional Methods

Traditional robot learning methods have several core problems:

Task specificity: Each task requires training a separate policy, with no cross-task generalization
Low data efficiency: Each new environment requires collecting data from scratch
Embodiment binding: Policies trained for a specific robot cannot transfer to other hardware platforms

1.2 Advantages of Foundation Models

The core hypothesis of Foundation Models is:

\[\text{Large-scale pretraining} + \text{Small-scale fine-tuning} \rightarrow \text{Downstream task generalization}\]

Specifically for robotics, foundation models can bring:

Feature	Traditional Methods	Foundation Model Approach
Generalization	Single task, single environment	Cross-task, cross-environment
Data utilization	Only uses current task data	Joint training on multi-source data
Cross-embodiment transfer	Not supported	Partially supported
Language understanding	Not available	Natural language instruction following
New task adaptation	Train from scratch	Few-shot / zero-shot

2. Three Major Paradigms

Current robot foundation model research can be summarized into three major paradigms. Each paradigm differs fundamentally in abstraction level, modality fusion depth, and action output method.

Paradigm Overview Architecture

graph TB
    subgraph ParadigmA["Paradigm A: LLM as High-Level Planner"]
        A1[Natural Language Instruction] --> A2[LLM/VLM Planner]
        A2 --> A3[Sub-task Sequence]
        A3 --> A4[Low-Level Skill Policies]
        A4 --> A5[Robot Actions]
        A6[Environment Feedback] --> A2
    end

    subgraph ParadigmB["Paradigm B: VLM Fine-tuned for Action Output"]
        B1[Image + Language Instruction] --> B2[Pretrained VLM Backbone]
        B2 --> B3[Action Token Decoding]
        B3 --> B5[Robot Actions]
    end

    subgraph ParadigmC["Paradigm C: Dedicated Robot Foundation Model"]
        C1[Multi-modal Sensor Input] --> C2[Dedicated Encoder]
        C2 --> C3[Unified Transformer Backbone]
        C3 --> C4[Action Head / Diffusion Head]
        C4 --> C5[Continuous Robot Actions]
    end

    style ParadigmA fill:#e1f5fe
    style ParadigmB fill:#f3e5f5
    style ParadigmC fill:#e8f5e9

2.1 Paradigm A: LLM as High-Level Planner

Core Idea: Use a large language model as the "brain," responsible for task understanding, reasoning, and sub-task decomposition, while delegating low-level motion control to pretrained skill policies.

Representative Works:

SayCan (Google, 2022): Multiplies the LLM's language probability with the robot's affordance score to select executable sub-tasks
Code as Policies (Liang et al., 2023): LLM directly generates Python code to call robot APIs
Inner Monologue (Google, 2022): Introduces perceptual feedback loops, allowing the LLM to dynamically adjust plans based on execution results

Mathematical Formulation:

In SayCan, given language instruction \(l\) and candidate skill set \(\{c_i\}\), the next skill to execute is selected as:

\[c^* = \arg\max_{c_i} \underbrace{p(c_i | l)}_{\text{language model score}} \cdot \underbrace{p(\text{success} | c_i, s)}_{\text{affordance score}}\]

where \(s\) is the current environment state.

Advantages:

Leverages the powerful reasoning and commonsense knowledge of LLMs
No end-to-end training needed; modular design
Easy to incorporate human feedback

Disadvantages:

Depends on a predefined low-level skill library
A "grounding" gap exists between LLMs and the physical world
High inference latency, unsuitable for real-time control

More details: LLM-Driven Robots

2.2 Paradigm B: VLM Fine-tuned for Action Output

Core Idea: Fine-tune a pretrained Vision-Language Model (VLM) to directly output robot actions. Treat actions as a form of "language," represented as tokens.

Representative Works:

RT-2 (Google DeepMind, 2023): Co-fine-tuned on PaLI-X (55B) and PaLM-E (12B), discretizing actions into token sequences of 256 bins per dimension
OpenVLA (Stanford/Berkeley, 2024): Based on Prismatic VLM + Llama 2 7B, open-source VLA

Action Tokenization:

RT-2 discretizes the continuous action space. For each action dimension \(a_d \in [a_{\min}, a_{\max}]\), it is uniformly divided into \(K=256\) bins:

\[\text{token}(a_d) = \left\lfloor \frac{a_d - a_{\min}}{a_{\max} - a_{\min}} \cdot (K-1) \right\rfloor\]

Output format: "1 128 91 241 5 101 127" — corresponding to xyz translation, rpy rotation, and gripper open/close.

Advantages:

Inherits VLM's visual understanding and language reasoning capabilities
End-to-end training without manually designed intermediate representations
Can leverage knowledge from large-scale internet data pretraining

Disadvantages:

Action discretization loses precision
Large model size (billions of parameters), slow inference
Limited effectiveness on fine-grained manipulation tasks (e.g., insertion, suturing)

2.3 Paradigm C: Dedicated Robot Foundation Model

Core Idea: Instead of following general VLM architectures, design specialized model architectures and training pipelines based on the characteristics of robot data.

Representative Works:

Octo (Berkeley, 2023): Open-source multi-embodiment Transformer supporting various sensor inputs and action spaces
pi0 (Physical Intelligence, 2024): Uses a flow matching action head for continuous action output
HPT (MIT, 2024): Unified pretraining for heterogeneous sensor inputs

Mathematical Formulation of the Action Head:

pi0 uses Flow Matching as the action head. Given condition \(c\) (visual + language features), it learns a vector field \(v_\theta\) that maps a noise distribution to an action distribution:

\[\frac{d\mathbf{a}_t}{dt} = v_\theta(\mathbf{a}_t, t, c), \quad t \in [0, 1]\]

where \(\mathbf{a}_0 \sim \mathcal{N}(0, I)\) is the initial noise and \(\mathbf{a}_1\) is the predicted action sequence.

Training objective (Conditional Flow Matching Loss):

\[\mathcal{L}_{\text{CFM}} = \mathbb{E}_{t, \mathbf{a}_0, \mathbf{a}_1}\left[\| v_\theta(\mathbf{a}_t, t, c) - (\mathbf{a}_1 - \mathbf{a}_0) \|^2\right]\]

where \(\mathbf{a}_t = (1-t)\mathbf{a}_0 + t\mathbf{a}_1\) is the linear interpolation path.

Advantages:

Can output continuous actions with high precision
Model architecture can be optimized for robot data characteristics
Supports multimodal sensor inputs

Disadvantages:

Requires large amounts of robot data for pretraining
Lacks the rich semantic understanding of general VLMs

3. Scaling Laws: Data Volume and Performance

3.1 Evolution of the RT Series

Google DeepMind's RT series pioneered the exploration of robot Scaling Laws:

graph LR
    A["RT-1 (2022)"] -->|"Scale up"| B["RT-2 (2023)"]
    B -->|"Data diversification"| C["RT-X (2023)"]

    A1["130K episodes<br/>700+ tasks<br/>13 robots"] --> A
    B1["PaLI-X 55B + PaLM-E 12B<br/>Web data co-training<br/>Emergent reasoning"] --> B
    C1["Open X-Embodiment<br/>1M+ episodes<br/>22 robot types<br/>160K+ tasks"] --> C

3.2 Key Findings

Leap from RT-1 to RT-2:

Dimension	RT-1	RT-2
Model scale	35M parameters	12B–55B parameters
Training data	130K robot episodes	Robot data + Web data
Number of tasks	700+	700+ (same)
Generalization	Seen objects/scenes	Unseen objects (emergent)
Reasoning ability	None	Can understand "move apple to bowl of matching color"
Control frequency	3Hz	1–3Hz

The most important finding of RT-2 is emergent capabilities: through pretraining on web-scale vision-language data, the model acquired reasoning abilities never present in the robot data, such as understanding "move the bottle next to the flag of Taylor Swift's country."

3.3 Is the Data Scale Sufficient?

The enormous gap between current robot data scale and language model data scale:

Domain	Data Scale	Token/Sample Count
GPT-4	~13T tokens	~13,000,000M
ImageNet	14M images	-
Open X-Embodiment	1M+ episodes	~millions
Single lab	1K–100K episodes	-

The gap is 3–4 orders of magnitude or more. This raises several key questions:

Do Scaling Laws similar to language models exist in robotics? — Preliminary evidence suggests increasing data improves generalization, but a power-law relationship has not yet been established
Can web data compensate for the lack of robot data? — RT-2's experiments suggest yes, but the precision of physical interactions remains limited
Is simulation data effective? — The Sim-to-Real gap remains a major challenge

4. Open X-Embodiment: Cross-Embodiment Dataset

4.1 Overview

Open X-Embodiment (led by Google DeepMind, 2023) is currently the largest cross-embodiment robot dataset:

Data scale: Over 1 million episodes
Robot types: 22 different robot morphologies
Task count: 160,000+ different task descriptions
Contributing institutions: 33 datasets from 21 institutions

4.2 Data Diversity Architecture

graph TB
    OXE[Open X-Embodiment Dataset]

    OXE --> R1[Single-arm Tabletop Robots]
    OXE --> R2[Bimanual Manipulation Platforms]
    OXE --> R3[Mobile Manipulation Robots]
    OXE --> R4[Dexterous Hands]

    R1 --> D1[Bridge V2<br/>60K episodes]
    R1 --> D2[RT-1 Data<br/>130K episodes]
    R2 --> D3[ALOHA Data]
    R3 --> D4[Kuka Data]
    R4 --> D5[DROID Data]

    OXE --> Format[Unified RLDS Format]
    Format --> F1[observation: Image + Proprioception]
    Format --> F2[action: End-effector Pose]
    Format --> F3[language_instruction: Text]

4.3 Key Conclusions

Open X-Embodiment experiments revealed several important conclusions:

Positive Transfer: RT-X models trained on mixed multi-embodiment data outperform models trained solely on single-embodiment data for most individual embodiments
Data diversity > Data volume: Compared to simply increasing data from the same robot, adding data from different robots is more helpful for generalization
Challenge of unified action spaces: The action spaces of different robots differ greatly (joint space vs. end-effector space, different degrees of freedom), requiring unified action representations

5. Future Outlook

5.1 Open Questions

Universal action representation: How to design a unified action space so that robots of different morphologies can share the same model?
Real-time performance: Current large model inference speeds (1–10Hz) are far below robot control requirements (100–1000Hz) — how to address this?
Safety: How to combine the black-box nature of foundation models with robot safety constraints?
Data flywheel: How to build an automated loop of data collection → model training → deployment → data collection?

5.2 Three-Paradigm Convergence Trend

graph TB
    T1["2022: Independent development of each paradigm"] --> T2["2023-2024: Paradigm B+C convergence"]
    T2 --> T3["2025+: Three-paradigm convergence"]

    T3 --> F1["High level: LLM reasoning and planning"]
    T3 --> F2["Mid level: VLM understanding and grounding"]
    T3 --> F3["Low level: Dedicated action model"]

    F1 <--> F2
    F2 <--> F3

From pi0.5 (Physical Intelligence, 2025), this trend is visible: high-level language model for task decomposition, mid-level VLM for scene understanding, and low-level flow matching model for fine-grained action output. This may become the mainstream architecture direction for future robot foundation models.

6. Summary

Comparison Dimension	Paradigm A (LLM Planning)	Paradigm B (VLM Fine-tuning)	Paradigm C (Dedicated Foundation Model)
Representative models	SayCan, Code as Policies	RT-2, OpenVLA	Octo, pi0, HPT
Action output	Calls low-level APIs	Discrete tokens	Continuous values / Diffusion sampling
Control precision	Low (depends on low-level)	Medium	High
Reasoning ability	Strong	Medium–Strong	Weak–Medium
Control frequency	<1Hz	1–3Hz	5–50Hz
Data requirements	Low (leverages pretraining)	Medium (fine-tuning)	Large (pretraining)
Open-source degree	High	Medium	Medium–High

Robot foundation models are still in their early stages, but the three paradigms are rapidly developing and converging. Key driving forces include: larger-scale cross-embodiment datasets, more efficient model architectures, and simulation-to-real transfer technologies.

References:

Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", 2022
Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", 2023
Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
Team et al., "Octo: An Open-Source Generalist Robot Policy", 2023
Bommasani et al., "On the Opportunities and Risks of Foundation Models", 2021

Introduction to Robot Foundation Models

1. Why Robot Foundation Models Are Needed

1.1 Bottlenecks of Traditional Methods

1.2 Advantages of Foundation Models

2. Three Major Paradigms

Paradigm Overview Architecture

2.1 Paradigm A: LLM as High-Level Planner

2.2 Paradigm B: VLM Fine-tuned for Action Output

2.3 Paradigm C: Dedicated Robot Foundation Model

3. Scaling Laws: Data Volume and Performance

3.1 Evolution of the RT Series

3.2 Key Findings

3.3 Is the Data Scale Sufficient?

4. Open X-Embodiment: Cross-Embodiment Dataset

4.1 Overview

4.2 Data Diversity Architecture

4.3 Key Conclusions

5. Future Outlook

5.1 Open Questions

5.2 Three-Paradigm Convergence Trend

6. Summary

评论 #