Skip to content

Introduction to Robot Foundation Models

Foundation Models have achieved tremendous success in NLP and vision, which naturally raises a question: Can the same paradigm be used to build general-purpose robot intelligence? This article reviews the three major paradigms of current robot foundation models, the exploration of Scaling Laws, and the latest progress in cross-embodiment transfer.

Related notes: Model Roadmap | VLA Models | LLM-Driven Robotics | Open-Source Model Summary

If you want the full evolution map of this section before diving into the three paradigms, start with Model Roadmap.


1. Why Robot Foundation Models Are Needed

1.1 Bottlenecks of Traditional Methods

Traditional robot learning methods have several core problems:

  • Task specificity: Each task requires training a separate policy, with no cross-task generalization
  • Low data efficiency: Each new environment requires collecting data from scratch
  • Embodiment binding: Policies trained for a specific robot cannot transfer to other hardware platforms

1.2 Advantages of Foundation Models

The core hypothesis of Foundation Models is:

\[\text{Large-scale pretraining} + \text{Small-scale fine-tuning} \rightarrow \text{Downstream task generalization}\]

Specifically for robotics, foundation models can bring:

Feature Traditional Methods Foundation Model Approach
Generalization Single task, single environment Cross-task, cross-environment
Data utilization Only uses current task data Joint training on multi-source data
Cross-embodiment transfer Not supported Partially supported
Language understanding Not available Natural language instruction following
New task adaptation Train from scratch Few-shot / zero-shot

2. Three Major Paradigms

Current robot foundation model research can be summarized into three major paradigms. Each paradigm differs fundamentally in abstraction level, modality fusion depth, and action output method.

Paradigm Overview Architecture

graph TB
    subgraph ParadigmA["Paradigm A: LLM as High-Level Planner"]
        A1[Natural Language Instruction] --> A2[LLM/VLM Planner]
        A2 --> A3[Sub-task Sequence]
        A3 --> A4[Low-Level Skill Policies]
        A4 --> A5[Robot Actions]
        A6[Environment Feedback] --> A2
    end

    subgraph ParadigmB["Paradigm B: VLM Fine-tuned for Action Output"]
        B1[Image + Language Instruction] --> B2[Pretrained VLM Backbone]
        B2 --> B3[Action Token Decoding]
        B3 --> B5[Robot Actions]
    end

    subgraph ParadigmC["Paradigm C: Dedicated Robot Foundation Model"]
        C1[Multi-modal Sensor Input] --> C2[Dedicated Encoder]
        C2 --> C3[Unified Transformer Backbone]
        C3 --> C4[Action Head / Diffusion Head]
        C4 --> C5[Continuous Robot Actions]
    end

    style ParadigmA fill:#e1f5fe
    style ParadigmB fill:#f3e5f5
    style ParadigmC fill:#e8f5e9

2.1 Paradigm A: LLM as High-Level Planner

Core Idea: Use a large language model as the "brain," responsible for task understanding, reasoning, and sub-task decomposition, while delegating low-level motion control to pretrained skill policies.

Representative Works:

  • SayCan (Google, 2022): Multiplies the LLM's language probability with the robot's affordance score to select executable sub-tasks
  • Code as Policies (Liang et al., 2023): LLM directly generates Python code to call robot APIs
  • Inner Monologue (Google, 2022): Introduces perceptual feedback loops, allowing the LLM to dynamically adjust plans based on execution results

Mathematical Formulation:

In SayCan, given language instruction \(l\) and candidate skill set \(\{c_i\}\), the next skill to execute is selected as:

\[c^* = \arg\max_{c_i} \underbrace{p(c_i | l)}_{\text{language model score}} \cdot \underbrace{p(\text{success} | c_i, s)}_{\text{affordance score}}\]

where \(s\) is the current environment state.

Advantages:

  • Leverages the powerful reasoning and commonsense knowledge of LLMs
  • No end-to-end training needed; modular design
  • Easy to incorporate human feedback

Disadvantages:

  • Depends on a predefined low-level skill library
  • A "grounding" gap exists between LLMs and the physical world
  • High inference latency, unsuitable for real-time control

More details: LLM-Driven Robots

2.2 Paradigm B: VLM Fine-tuned for Action Output

Core Idea: Fine-tune a pretrained Vision-Language Model (VLM) to directly output robot actions. Treat actions as a form of "language," represented as tokens.

Representative Works:

  • RT-2 (Google DeepMind, 2023): Co-fine-tuned on PaLI-X (55B) and PaLM-E (12B), discretizing actions into token sequences of 256 bins per dimension
  • OpenVLA (Stanford/Berkeley, 2024): Based on Prismatic VLM + Llama 2 7B, open-source VLA

Action Tokenization:

RT-2 discretizes the continuous action space. For each action dimension \(a_d \in [a_{\min}, a_{\max}]\), it is uniformly divided into \(K=256\) bins:

\[\text{token}(a_d) = \left\lfloor \frac{a_d - a_{\min}}{a_{\max} - a_{\min}} \cdot (K-1) \right\rfloor\]

Output format: "1 128 91 241 5 101 127" — corresponding to xyz translation, rpy rotation, and gripper open/close.

Advantages:

  • Inherits VLM's visual understanding and language reasoning capabilities
  • End-to-end training without manually designed intermediate representations
  • Can leverage knowledge from large-scale internet data pretraining

Disadvantages:

  • Action discretization loses precision
  • Large model size (billions of parameters), slow inference
  • Limited effectiveness on fine-grained manipulation tasks (e.g., insertion, suturing)

2.3 Paradigm C: Dedicated Robot Foundation Model

Core Idea: Instead of following general VLM architectures, design specialized model architectures and training pipelines based on the characteristics of robot data.

Representative Works:

  • Octo (Berkeley, 2023): Open-source multi-embodiment Transformer supporting various sensor inputs and action spaces
  • pi0 (Physical Intelligence, 2024): Uses a flow matching action head for continuous action output
  • HPT (MIT, 2024): Unified pretraining for heterogeneous sensor inputs

Mathematical Formulation of the Action Head:

pi0 uses Flow Matching as the action head. Given condition \(c\) (visual + language features), it learns a vector field \(v_\theta\) that maps a noise distribution to an action distribution:

\[\frac{d\mathbf{a}_t}{dt} = v_\theta(\mathbf{a}_t, t, c), \quad t \in [0, 1]\]

where \(\mathbf{a}_0 \sim \mathcal{N}(0, I)\) is the initial noise and \(\mathbf{a}_1\) is the predicted action sequence.

Training objective (Conditional Flow Matching Loss):

\[\mathcal{L}_{\text{CFM}} = \mathbb{E}_{t, \mathbf{a}_0, \mathbf{a}_1}\left[\| v_\theta(\mathbf{a}_t, t, c) - (\mathbf{a}_1 - \mathbf{a}_0) \|^2\right]\]

where \(\mathbf{a}_t = (1-t)\mathbf{a}_0 + t\mathbf{a}_1\) is the linear interpolation path.

Advantages:

  • Can output continuous actions with high precision
  • Model architecture can be optimized for robot data characteristics
  • Supports multimodal sensor inputs

Disadvantages:

  • Requires large amounts of robot data for pretraining
  • Lacks the rich semantic understanding of general VLMs

3. Scaling Laws: Data Volume and Performance

3.1 Evolution of the RT Series

Google DeepMind's RT series pioneered the exploration of robot Scaling Laws:

graph LR
    A["RT-1 (2022)"] -->|"Scale up"| B["RT-2 (2023)"]
    B -->|"Data diversification"| C["RT-X (2023)"]

    A1["130K episodes<br/>700+ tasks<br/>13 robots"] --> A
    B1["PaLI-X 55B + PaLM-E 12B<br/>Web data co-training<br/>Emergent reasoning"] --> B
    C1["Open X-Embodiment<br/>1M+ episodes<br/>22 robot types<br/>160K+ tasks"] --> C

3.2 Key Findings

Leap from RT-1 to RT-2:

Dimension RT-1 RT-2
Model scale 35M parameters 12B–55B parameters
Training data 130K robot episodes Robot data + Web data
Number of tasks 700+ 700+ (same)
Generalization Seen objects/scenes Unseen objects (emergent)
Reasoning ability None Can understand "move apple to bowl of matching color"
Control frequency 3Hz 1–3Hz

The most important finding of RT-2 is emergent capabilities: through pretraining on web-scale vision-language data, the model acquired reasoning abilities never present in the robot data, such as understanding "move the bottle next to the flag of Taylor Swift's country."

3.3 Is the Data Scale Sufficient?

The enormous gap between current robot data scale and language model data scale:

Domain Data Scale Token/Sample Count
GPT-4 ~13T tokens ~13,000,000M
ImageNet 14M images -
Open X-Embodiment 1M+ episodes ~millions
Single lab 1K–100K episodes -

The gap is 3–4 orders of magnitude or more. This raises several key questions:

  1. Do Scaling Laws similar to language models exist in robotics? — Preliminary evidence suggests increasing data improves generalization, but a power-law relationship has not yet been established
  2. Can web data compensate for the lack of robot data? — RT-2's experiments suggest yes, but the precision of physical interactions remains limited
  3. Is simulation data effective? — The Sim-to-Real gap remains a major challenge

4. Open X-Embodiment: Cross-Embodiment Dataset

4.1 Overview

Open X-Embodiment (led by Google DeepMind, 2023) is currently the largest cross-embodiment robot dataset:

  • Data scale: Over 1 million episodes
  • Robot types: 22 different robot morphologies
  • Task count: 160,000+ different task descriptions
  • Contributing institutions: 33 datasets from 21 institutions

4.2 Data Diversity Architecture

graph TB
    OXE[Open X-Embodiment Dataset]

    OXE --> R1[Single-arm Tabletop Robots]
    OXE --> R2[Bimanual Manipulation Platforms]
    OXE --> R3[Mobile Manipulation Robots]
    OXE --> R4[Dexterous Hands]

    R1 --> D1[Bridge V2<br/>60K episodes]
    R1 --> D2[RT-1 Data<br/>130K episodes]
    R2 --> D3[ALOHA Data]
    R3 --> D4[Kuka Data]
    R4 --> D5[DROID Data]

    OXE --> Format[Unified RLDS Format]
    Format --> F1[observation: Image + Proprioception]
    Format --> F2[action: End-effector Pose]
    Format --> F3[language_instruction: Text]

4.3 Key Conclusions

Open X-Embodiment experiments revealed several important conclusions:

  1. Positive Transfer: RT-X models trained on mixed multi-embodiment data outperform models trained solely on single-embodiment data for most individual embodiments
  2. Data diversity > Data volume: Compared to simply increasing data from the same robot, adding data from different robots is more helpful for generalization
  3. Challenge of unified action spaces: The action spaces of different robots differ greatly (joint space vs. end-effector space, different degrees of freedom), requiring unified action representations

5. Future Outlook

5.1 Open Questions

  • Universal action representation: How to design a unified action space so that robots of different morphologies can share the same model?
  • Real-time performance: Current large model inference speeds (1–10Hz) are far below robot control requirements (100–1000Hz) — how to address this?
  • Safety: How to combine the black-box nature of foundation models with robot safety constraints?
  • Data flywheel: How to build an automated loop of data collection → model training → deployment → data collection?

5.2 Three-Paradigm Convergence Trend

graph TB
    T1["2022: Independent development of each paradigm"] --> T2["2023-2024: Paradigm B+C convergence"]
    T2 --> T3["2025+: Three-paradigm convergence"]

    T3 --> F1["High level: LLM reasoning and planning"]
    T3 --> F2["Mid level: VLM understanding and grounding"]
    T3 --> F3["Low level: Dedicated action model"]

    F1 <--> F2
    F2 <--> F3

From pi0.5 (Physical Intelligence, 2025), this trend is visible: high-level language model for task decomposition, mid-level VLM for scene understanding, and low-level flow matching model for fine-grained action output. This may become the mainstream architecture direction for future robot foundation models.


6. Summary

Comparison Dimension Paradigm A (LLM Planning) Paradigm B (VLM Fine-tuning) Paradigm C (Dedicated Foundation Model)
Representative models SayCan, Code as Policies RT-2, OpenVLA Octo, pi0, HPT
Action output Calls low-level APIs Discrete tokens Continuous values / Diffusion sampling
Control precision Low (depends on low-level) Medium High
Reasoning ability Strong Medium–Strong Weak–Medium
Control frequency <1Hz 1–3Hz 5–50Hz
Data requirements Low (leverages pretraining) Medium (fine-tuning) Large (pretraining)
Open-source degree High Medium Medium–High

Robot foundation models are still in their early stages, but the three paradigms are rapidly developing and converging. Key driving forces include: larger-scale cross-embodiment datasets, more efficient model architectures, and simulation-to-real transfer technologies.


References:

  • Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", 2022
  • Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", 2023
  • Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
  • Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
  • Team et al., "Octo: An Open-Source Generalist Robot Policy", 2023
  • Bommasani et al., "On the Opportunities and Risks of Foundation Models", 2021

评论 #