Skip to content

Key Paper Deep Dives

Overview

This article provides in-depth analysis of 6 milestone papers in embodied intelligence. Each paper is examined across five dimensions: problem definition, method design, key formulas, experimental results, and historical significance, helping readers systematically understand the technical evolution from LLM-driven robots to general-purpose foundation models.


1. SayCan -- When Language Models Meet Robot Affordances

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Ahn et al., 2022 (Google Research)

1.1 Problem

Large language models possess rich world knowledge and reasoning capabilities, but they do not understand what a specific robot can do in a specific scenario. How can we combine the semantic knowledge of LLMs with the physical capabilities of robots?

1.2 Core Idea

The LLM evaluates "what should be done," while the robot policy evaluates "what can be done"; multiplying the two yields the final decision.

1.3 Method

Affordance Scoring:

\[\text{score}(a_i) = p(\text{useful} | a_i, l) \cdot p(\text{possible} | a_i, s_t)\]

where:

  • \(p(\text{useful} | a_i, l)\): LLM language score. Given user instruction \(l\), the LLM evaluates the plausibility of candidate skill \(a_i\) as the next step. Concretely implemented as the LLM's token probability for "\(l\). The robot should: 1. \(a_i\)."
  • \(p(\text{possible} | a_i, s_t)\): Affordance score. Provided by a pretrained value function \(V^{a_i}(s_t)\), reflecting the success probability of executing skill \(a_i\) in current state \(s_t\).

Greedy Decoding:

At each planning step:

\[a_t^* = \arg\max_{a_i \in \mathcal{A}} \left[ p(\text{useful} | a_i, l_t) \cdot V^{a_i}(s_t) \right]\]

After executing \(a_t^*\), the result is appended to the LLM's context, and planning continues for the next step until the LLM outputs "done."

1.4 Skill Library

  • 551 skills: pick, place, go to, open, close, etc.
  • Each skill has an independent BC (behavioral cloning) policy and value function
  • Trained on real mobile manipulation robots (Everyday Robots)

1.5 Key Results

Metric SayCan LLM Only Affordance Only
Planning Success Rate 84% 14% -
Execution Success Rate 74% 0% -
Long-Horizon Tasks Handled Severe hallucination No planning capability

1.6 Significance and Limitations

Significance:

  • First systematic combination of LLMs with robot control
  • Proposed an elegant "affordance filtering" framework
  • Pioneered the LLM for Robotics research direction

Limitations:

  • The skill library is fixed and predefined
  • Requires training a separate policy and value function for each skill
  • Cannot handle tasks outside the skill library

2. RT-1 -- Large-Scale Robotics Transformer

RT-1: Robotics Transformer for Real-World Control at Scale Brohan et al., 2022 (Google/Everyday Robots)

2.1 Problem

Previous robot learning methods typically trained on small-scale data and struggled to generalize to new scenarios and instructions. Can we improve robot policy generalization by scaling data and model size, as was done with Transformers in NLP?

2.2 Method

Architecture:

Input: - 6 historical images (current frame + 5 history frames), encoded by EfficientNet-B3 - Natural language instruction, encoded by Universal Sentence Encoder

FiLM Conditioning: Language embeddings modulate visual features through FiLM (Feature-wise Linear Modulation) layers:

\[\text{FiLM}(x; l) = \gamma(l) \odot x + \beta(l)\]

where \(\gamma(l)\) and \(\beta(l)\) are scaling and bias parameters mapped from language embedding \(l\).

Tokenized Action Space:

Continuous actions are discretized into 256 bins:

\[a_t = [x, y, z, \text{roll}, \text{pitch}, \text{yaw}, \text{gripper}]\]

Each dimension is discretized into 256 values, and the Transformer autoregressively predicts each action dimension token.

TokenLearner: Uses the TokenLearner module to compress visual tokens from 81 to 8, significantly reducing computation.

2.3 Training Data

Attribute Value
Demonstration Trajectories 130,000+
Collection Robots 13
Collection Duration 17 months
Task Types 700+
Object Types Hundreds

2.4 Key Results

Evaluation Dimension RT-1 Gato BC-Z
Seen Task Success Rate 97% 63% 72%
Unseen Tasks (New Instructions) 76% 34% 48%
Unseen Tasks (New Objects) 53% 24% 29%
Long-Horizon Tasks High Low Medium

2.5 Key Findings

  1. Data scale is critical: Performance grows approximately logarithmically with data volume
  2. Multi-task training aids generalization: Joint training on 700+ tasks outperforms single-task training
  3. Real data > simulation data: At this scale, real data is more valuable than simulation data

2.6 Significance

RT-1 was the "GPT moment" for robot learning -- the first time a single Transformer policy was trained on large-scale real data, demonstrating strong multi-task generalization capabilities. It proved that Scaling Laws apply in the robotics domain as well.


3. RT-2 -- From Vision-Language Model to Vision-Language-Action Model

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Brohan et al., 2023 (Google DeepMind)

3.1 Problem

The internet contains vast amounts of vision-language data rich in world knowledge. Can the knowledge within vision-language models (VLMs) be directly transferred to robot control?

3.2 Core Innovation: Actions as Text Tokens

Key Insight: Robot actions can be represented as text token sequences and processed uniformly with language tokens.

Action representation:

\[a_t = \underbrace{[x, y, z, \text{rx}, \text{ry}, \text{rz}, \text{gripper}]}_{\text{7-dimensional action}} \rightarrow \underbrace{[\text{token}_1, \text{token}_2, \ldots, \text{token}_7]}_{\text{7 text tokens}}\]

Each action dimension is discretized into 256 bins, mapped to special tokens: rt_000 through rt_255.

3.3 Training Pipeline

  1. Pretraining Phase: PaLI-X (55B) or PaLM-E (12B) pretrained on internet-scale vision-language data
  2. Co-fine-tuning: Simultaneously fine-tuned on web data and robot data
    • Web data: Visual question answering, image captioning, etc.
    • Robot data: RT-1's data (with action tokens added)
\[\mathcal{L} = \mathcal{L}_{\text{web}}(\text{VQA, caption, ...}) + \lambda \cdot \mathcal{L}_{\text{robot}}(\text{action tokens})\]

3.4 Emergent Capabilities

RT-2 exhibited reasoning abilities absent from the training data:

Emergent Capability Example
Symbolic Reasoning "Throw the trash in the correct bin" (requires judging recyclable/non-recyclable)
Mathematical Reasoning "Move to next to the triangle" (requires shape recognition)
Language Generalization Understanding instructions never seen in robot data
Visual Concept Transfer Manipulating objects never seen in robot training

3.5 Key Results

Evaluation Dimension RT-2 (PaLI-X) RT-1 VC-1
Seen Tasks 95% 97% 73%
Unseen Objects 62% 32% 22%
Unseen Backgrounds 72% 36% 29%
Semantic Reasoning Tasks 62% 0% 0%

3.6 Significance

RT-2 established the VLA (Vision-Language-Action) paradigm:

\[\text{VLM} \xrightarrow{\text{Action Token Fine-tuning}} \text{VLA}\]

It demonstrated that internet-pretrained vision-language knowledge can be effectively transferred to physical robot control. This means robots can leverage the entire internet's knowledge base.


4. Diffusion Policy -- Diffusion Model-Driven Robot Policies

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion Chi et al., 2023 (Columbia University/Toyota Research Institute)

4.1 Problem

Traditional behavioral cloning methods perform poorly when facing multimodal distributions. For example, when navigating around an obstacle, the agent can go left or right; mean regression causes the policy to collide directly with the obstacle:

\[a_{\text{mean}} = \frac{a_{\text{left}} + a_{\text{right}}}{2} = a_{\text{collision}}\]

How can we learn policies that express multimodal action distributions?

4.2 Method: DDPM for Action Generation

Core Idea: Model policy learning as a conditional denoising diffusion process (DDPM).

Forward Diffusion (Adding Noise):

\[q(a_t^k | a_t^{k-1}) = \mathcal{N}(a_t^k; \sqrt{1-\beta_k} a_t^{k-1}, \beta_k I)\]
\[q(a_t^K | a_t^0) = \mathcal{N}(a_t^K; \sqrt{\bar{\alpha}_K} a_t^0, (1-\bar{\alpha}_K) I)\]

where \(k\) is the diffusion step (not the timestep); after \(K\) steps, the action becomes pure noise.

Reverse Denoising (Generating Actions):

\[p_\theta(a_t^{k-1} | a_t^k, o_t) = \mathcal{N}(a_t^{k-1}; \mu_\theta(a_t^k, k, o_t), \sigma_k^2 I)\]

Network \(\epsilon_\theta\) predicts the noise:

\[\mu_\theta(a_t^k, k, o_t) = \frac{1}{\sqrt{\alpha_k}}\left(a_t^k - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(a_t^k, k, o_t)\right)\]

Training Objective:

\[\mathcal{L} = \mathbb{E}_{k, a_t^0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_k} a_t^0 + \sqrt{1-\bar{\alpha}_k}\epsilon, k, o_t)\|^2 \right]\]

4.3 Key Design Choices

Action Chunk Prediction:

Instead of predicting a single-step action, predict a sequence of \(T_a\) future actions:

\[A_t = [a_t, a_{t+1}, \ldots, a_{t+T_a-1}]\]

This provides temporal consistency, avoiding the jitter problem of step-by-step prediction.

Observation History:

Uses the most recent \(T_o\) steps of observations as conditioning:

\[O_t = [o_{t-T_o+1}, \ldots, o_t]\]

Two Architecture Variants:

Variant Conditioning Method Characteristics
CNN-based 1D temporal CNN processes action sequences, FiLM injects observations Fast inference, suitable for real-time control
Transformer-based Cross-attention fuses observations and actions More flexible, slightly better performance

4.4 Key Results

Performance across 11 manipulation tasks:

Method Average Success Rate Multimodal Tasks
Diffusion Policy (CNN) 86.8% Excellent
Diffusion Policy (Transformer) 83.5% Excellent
LSTM-GMM 62.7% Fair
IBC (Implicit BC) 52.3% Fair
BeT 50.1% Fair

4.5 Why Diffusion Models Suit Robot Policies

  1. Multimodal Expression: Naturally supports multimodal action distributions
  2. High-Dimensional Action Spaces: Diffusion models excel at high-dimensional distribution modeling
  3. Stable Training: More stable than GANs
  4. Flexible Conditioning: Easy to incorporate various conditioning information
  5. Temporal Consistency: Action chunk prediction provides smooth trajectories

4.6 Significance

Diffusion Policy introduced the generative model paradigm into robot policy learning, solving the core challenge of behavioral cloning (multimodal distributions). Since then, diffusion models have become one of the standard choices for robot manipulation policies.


5. Open X-Embodiment -- Cross-Embodiment Open Dataset

Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration, 2024 (33 institutions)

5.1 Problem

Robot learning faces severe data fragmentation:

  • Each lab collects its own data
  • Different robots, different formats, different tasks
  • Cannot leverage other robots' experience

How to build the "ImageNet" for robot learning?

5.2 Dataset

Scale:

Attribute Value
Participating Institutions 33
Number of Datasets 60+
Robot Types 22
Total Trajectories 1,000,000+
Data Format RLDS (unified)

Data Format Standardization (RLDS):

Each trajectory is unified as:

{
  "steps": [
    {
      "observation": {
        "image": ...,           # RGB image
        "wrist_image": ...,     # Wrist camera (optional)
        "state": ...            # Proprioception
      },
      "action": ...,            # Standardized action
      "language_instruction": ...,
      "reward": ...,
      "is_terminal": ...
    },
    ...
  ]
}

Robot Type Coverage:

  • Single-arm tabletop manipulation (Franka, UR5, xArm, ...)
  • Bimanual manipulation (ALOHA, Baxter, ...)
  • Mobile manipulation (Everyday Robots, Stretch, ...)
  • Quadruped robots (A1, Spot, ...)
  • Dexterous hands (Allegro, LEAP, ...)

5.3 RT-X Models

Cross-embodiment models trained on Open X-Embodiment data:

RT-1-X: RT-1 architecture trained on mixed data

RT-2-X: RT-2 architecture trained on mixed data

5.4 Key Findings

Positive Transfer:

Evaluation Target RT-1-X vs RT-1 (Single Dataset) Improvement
Average Performance on Target Robot +50% of evaluated scenarios improved Significant
Cross-Robot Generalization Clearly better than single dataset Significant

Key experiment: A policy for robot A, after cross-embodiment training, performed better than when trained on robot A's data alone. This proved that different robots' experiences can mutually benefit each other.

Data Scale Effect:

\[\text{Performance} \propto \log(\text{dataset size})\]

Even adding data from robots with completely different morphologies from the target still improved overall performance.

5.5 Significance

Open X-Embodiment represents the "ImageNet moment" for embodied intelligence:

  • Established standards and culture for cross-embodiment data sharing
  • Demonstrated the feasibility of cross-embodiment transfer learning
  • Provided the data foundation for subsequent general robot foundation models
  • Promoted the development of open-source data ecosystems

6. pi0 -- General-Purpose Robot Foundation Model

pi0: A Vision-Language-Action Flow Model for General Robot Control Black et al., 2024 (Physical Intelligence)

6.1 Problem

How to build a truly general-purpose robot foundation model -- one that works across multiple robots and tasks, and can be quickly adapted to new tasks with minimal data?

6.2 Architecture

pi0 adopts a dual-component architecture:

VLM Backbone:

Based on a pretrained vision-language model (PaLI-Gemma 3B variant), processing: - Multi-view image inputs - Natural language instructions - Proprioceptive state

Flow Matching Action Head:

Unlike RT-2's discrete token output, pi0 uses Flow Matching to generate continuous actions.

6.3 Flow Matching

Flow Matching is an alternative to diffusion models that learns a velocity field to transform a noise distribution into a data distribution:

Basic Idea:

Define a linear path from noise \(x_0 \sim \mathcal{N}(0, I)\) to data \(x_1 \sim p_{\text{data}}\):

\[x_t = (1-t) x_0 + t x_1, \quad t \in [0, 1]\]

The corresponding velocity field is:

\[u_t(x_t) = x_1 - x_0\]

Training Objective:

\[\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1} \left[ \|v_\theta(x_t, t, c) - (x_1 - x_0)\|^2 \right]\]

where \(c\) is the conditioning information (images, language, proprioception) and \(v_\theta\) is the neural network-parameterized velocity field.

Inference (Generating Actions):

\[x_1 = x_0 + \int_0^1 v_\theta(x_t, t, c) \, dt\]

Solved through numerical integration (e.g., Euler method):

\[x_{t+\Delta t} = x_t + v_\theta(x_t, t, c) \cdot \Delta t\]

6.4 Advantages Over Diffusion Models

Dimension DDPM Diffusion Flow Matching
Path Stochastic (SDE) Deterministic (ODE)
Training Predicts noise \(\epsilon\) Predicts velocity \(v\)
Sampling Steps Typically 50-1000 Typically 10-50
Inference Speed Slower Faster (suitable for real-time control)
Training Stability Good Better

6.5 Training Pipeline

Stage 1: Pretraining

Pretrained on large-scale cross-embodiment data (similar to Open X-Embodiment + proprietary data)

Stage 2: Task Fine-tuning

Fine-tuned on task-specific data with a smaller learning rate

Stage 3: Online Fine-tuning

Rapid adaptation through a small amount of data collected in the deployment environment

6.6 Key Results

pi0 demonstrated strong generalization across multiple tasks and robots:

Task Robot Platform Success Rate
Folding Clothes Bimanual + Dexterous Hand High
Table Cleanup Single-Arm Manipulation High
Box Packing Bimanual High
Zero-Shot Novel Object Manipulation Various Medium-High

Comparison with Baselines (on manipulation tasks):

Method Average Success Rate
pi0 Highest
Diffusion Policy Second
RT-2-X Medium
ACT Lower

6.7 Significance

pi0 represents the latest paradigm for robot foundation models:

  1. VLM as the "Brain": Leveraging internet-pretrained knowledge for understanding and reasoning
  2. Flow Matching as the "Motor System": Efficiently generating smooth continuous actions
  3. Pretrain-Fine-tune Paradigm: Large-scale pretraining + task-specific fine-tuning
  4. Generality: A single model adapting to multiple robots and tasks

7. Technical Evolution Across Papers

flowchart TB
    A[SayCan 2022<br/>LLM + Fixed Skills] --> B[RT-1 2022<br/>Large-Scale Learned Policies]
    B --> C[RT-2 2023<br/>VLM→VLA Transfer]
    D[Diffusion Policy 2023<br/>Generative Policies] --> F[pi0 2024<br/>VLM + Flow Matching]
    C --> E[Open X-Embodiment 2024<br/>Cross-Embodiment Data]
    E --> F
    C --> F

Main Line of Technical Evolution:

Stage Representative Paradigm
LLM-Assisted SayCan LLM planning + predefined skills
Large-Scale Learning RT-1 Transformer + large data
Knowledge Transfer RT-2 VLM \(\rightarrow\) VLA
Generative Policies Diffusion Policy Diffusion models generate actions
Open Ecosystem Open X-Embodiment Cross-embodiment data sharing
Foundation Model pi0 VLM + Flow Matching + pretrain-fine-tune

8. Summary and Outlook

  1. Scaling Up: Data scale, model scale, and task scale continue to expand
  2. Unification: Perception, reasoning, and control are progressively unified into single models
  3. Transfer: Internet knowledge \(\rightarrow\) robots, robot A \(\rightarrow\) robot B
  4. Generation: From discriminative policies to generative policies

8.2 Unresolved Questions

  • Safety: How can the behavior of end-to-end models be guaranteed safe?
  • Interpretability: How can we understand the decision-making process of VLA models?
  • Data Efficiency: Can the same performance be achieved with less data?
  • Long Horizon: How to handle complex tasks requiring hundreds of steps?
  • Embodied Reasoning: Physical reasoning capabilities beyond pattern matching

References

  • Ahn, M. et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances"
  • Brohan, A. et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale"
  • Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control"
  • Chi, C. et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
  • Open X-Embodiment Collaboration (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models"
  • Black, K. et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control"

Related Notes:


评论 #