Key Paper Deep Dives

Overview

This article provides in-depth analysis of 6 milestone papers in embodied intelligence. Each paper is examined across five dimensions: problem definition, method design, key formulas, experimental results, and historical significance, helping readers systematically understand the technical evolution from LLM-driven robots to general-purpose foundation models.

1. SayCan -- When Language Models Meet Robot Affordances

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Ahn et al., 2022 (Google Research)

1.1 Problem

Large language models possess rich world knowledge and reasoning capabilities, but they do not understand what a specific robot can do in a specific scenario. How can we combine the semantic knowledge of LLMs with the physical capabilities of robots?

1.2 Core Idea

The LLM evaluates "what should be done," while the robot policy evaluates "what can be done"; multiplying the two yields the final decision.

1.3 Method

Affordance Scoring:

\[\text{score}(a_i) = p(\text{useful} | a_i, l) \cdot p(\text{possible} | a_i, s_t)\]

where:

\(p(\text{useful} | a_i, l)\): LLM language score. Given user instruction \(l\), the LLM evaluates the plausibility of candidate skill \(a_i\) as the next step. Concretely implemented as the LLM's token probability for "\(l\). The robot should: 1. \(a_i\)."
\(p(\text{possible} | a_i, s_t)\): Affordance score. Provided by a pretrained value function \(V^{a_i}(s_t)\), reflecting the success probability of executing skill \(a_i\) in current state \(s_t\).

Greedy Decoding:

At each planning step:

\[a_t^* = \arg\max_{a_i \in \mathcal{A}} \left[ p(\text{useful} | a_i, l_t) \cdot V^{a_i}(s_t) \right]\]

After executing \(a_t^*\), the result is appended to the LLM's context, and planning continues for the next step until the LLM outputs "done."

1.4 Skill Library

551 skills: pick, place, go to, open, close, etc.
Each skill has an independent BC (behavioral cloning) policy and value function
Trained on real mobile manipulation robots (Everyday Robots)

1.5 Key Results

Metric	SayCan	LLM Only	Affordance Only
Planning Success Rate	84%	14%	-
Execution Success Rate	74%	0%	-
Long-Horizon Tasks	Handled	Severe hallucination	No planning capability

1.6 Significance and Limitations

Significance:

First systematic combination of LLMs with robot control
Proposed an elegant "affordance filtering" framework
Pioneered the LLM for Robotics research direction

Limitations:

The skill library is fixed and predefined
Requires training a separate policy and value function for each skill
Cannot handle tasks outside the skill library

2. RT-1 -- Large-Scale Robotics Transformer

RT-1: Robotics Transformer for Real-World Control at Scale Brohan et al., 2022 (Google/Everyday Robots)

2.1 Problem

Previous robot learning methods typically trained on small-scale data and struggled to generalize to new scenarios and instructions. Can we improve robot policy generalization by scaling data and model size, as was done with Transformers in NLP?

2.2 Method

Architecture:

Input: - 6 historical images (current frame + 5 history frames), encoded by EfficientNet-B3 - Natural language instruction, encoded by Universal Sentence Encoder

FiLM Conditioning: Language embeddings modulate visual features through FiLM (Feature-wise Linear Modulation) layers:

\[\text{FiLM}(x; l) = \gamma(l) \odot x + \beta(l)\]

where \(\gamma(l)\) and \(\beta(l)\) are scaling and bias parameters mapped from language embedding \(l\).

Tokenized Action Space:

Continuous actions are discretized into 256 bins:

\[a_t = [x, y, z, \text{roll}, \text{pitch}, \text{yaw}, \text{gripper}]\]

Each dimension is discretized into 256 values, and the Transformer autoregressively predicts each action dimension token.

TokenLearner: Uses the TokenLearner module to compress visual tokens from 81 to 8, significantly reducing computation.

2.3 Training Data

Attribute	Value
Demonstration Trajectories	130,000+
Collection Robots	13
Collection Duration	17 months
Task Types	700+
Object Types	Hundreds

2.4 Key Results

Evaluation Dimension	RT-1	Gato	BC-Z
Seen Task Success Rate	97%	63%	72%
Unseen Tasks (New Instructions)	76%	34%	48%
Unseen Tasks (New Objects)	53%	24%	29%
Long-Horizon Tasks	High	Low	Medium

2.5 Key Findings

Data scale is critical: Performance grows approximately logarithmically with data volume
Multi-task training aids generalization: Joint training on 700+ tasks outperforms single-task training
Real data > simulation data: At this scale, real data is more valuable than simulation data

2.6 Significance

RT-1 was the "GPT moment" for robot learning -- the first time a single Transformer policy was trained on large-scale real data, demonstrating strong multi-task generalization capabilities. It proved that Scaling Laws apply in the robotics domain as well.

3. RT-2 -- From Vision-Language Model to Vision-Language-Action Model

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Brohan et al., 2023 (Google DeepMind)

3.1 Problem

The internet contains vast amounts of vision-language data rich in world knowledge. Can the knowledge within vision-language models (VLMs) be directly transferred to robot control?

3.2 Core Innovation: Actions as Text Tokens

Key Insight: Robot actions can be represented as text token sequences and processed uniformly with language tokens.

Action representation:

\[a_t = \underbrace{[x, y, z, \text{rx}, \text{ry}, \text{rz}, \text{gripper}]}_{\text{7-dimensional action}} \rightarrow \underbrace{[\text{token}_1, \text{token}_2, \ldots, \text{token}_7]}_{\text{7 text tokens}}\]

Each action dimension is discretized into 256 bins, mapped to special tokens: rt_000 through rt_255.

3.3 Training Pipeline

Pretraining Phase: PaLI-X (55B) or PaLM-E (12B) pretrained on internet-scale vision-language data
Co-fine-tuning: Simultaneously fine-tuned on web data and robot data
- Web data: Visual question answering, image captioning, etc.
- Robot data: RT-1's data (with action tokens added)

\[\mathcal{L} = \mathcal{L}_{\text{web}}(\text{VQA, caption, ...}) + \lambda \cdot \mathcal{L}_{\text{robot}}(\text{action tokens})\]

3.4 Emergent Capabilities

RT-2 exhibited reasoning abilities absent from the training data:

Emergent Capability	Example
Symbolic Reasoning	"Throw the trash in the correct bin" (requires judging recyclable/non-recyclable)
Mathematical Reasoning	"Move to next to the triangle" (requires shape recognition)
Language Generalization	Understanding instructions never seen in robot data
Visual Concept Transfer	Manipulating objects never seen in robot training

3.5 Key Results

Evaluation Dimension	RT-2 (PaLI-X)	RT-1	VC-1
Seen Tasks	95%	97%	73%
Unseen Objects	62%	32%	22%
Unseen Backgrounds	72%	36%	29%
Semantic Reasoning Tasks	62%	0%	0%

3.6 Significance

RT-2 established the VLA (Vision-Language-Action) paradigm:

\[\text{VLM} \xrightarrow{\text{Action Token Fine-tuning}} \text{VLA}\]

It demonstrated that internet-pretrained vision-language knowledge can be effectively transferred to physical robot control. This means robots can leverage the entire internet's knowledge base.

4. Diffusion Policy -- Diffusion Model-Driven Robot Policies

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion Chi et al., 2023 (Columbia University/Toyota Research Institute)

4.1 Problem

Traditional behavioral cloning methods perform poorly when facing multimodal distributions. For example, when navigating around an obstacle, the agent can go left or right; mean regression causes the policy to collide directly with the obstacle:

\[a_{\text{mean}} = \frac{a_{\text{left}} + a_{\text{right}}}{2} = a_{\text{collision}}\]

How can we learn policies that express multimodal action distributions?

4.2 Method: DDPM for Action Generation

Core Idea: Model policy learning as a conditional denoising diffusion process (DDPM).

Forward Diffusion (Adding Noise):

\[q(a_t^k | a_t^{k-1}) = \mathcal{N}(a_t^k; \sqrt{1-\beta_k} a_t^{k-1}, \beta_k I)\]

\[q(a_t^K | a_t^0) = \mathcal{N}(a_t^K; \sqrt{\bar{\alpha}_K} a_t^0, (1-\bar{\alpha}_K) I)\]

where \(k\) is the diffusion step (not the timestep); after \(K\) steps, the action becomes pure noise.

Reverse Denoising (Generating Actions):

\[p_\theta(a_t^{k-1} | a_t^k, o_t) = \mathcal{N}(a_t^{k-1}; \mu_\theta(a_t^k, k, o_t), \sigma_k^2 I)\]

Network \(\epsilon_\theta\) predicts the noise:

\[\mu_\theta(a_t^k, k, o_t) = \frac{1}{\sqrt{\alpha_k}}\left(a_t^k - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}} \epsilon_\theta(a_t^k, k, o_t)\right)\]

Training Objective:

\[\mathcal{L} = \mathbb{E}_{k, a_t^0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_k} a_t^0 + \sqrt{1-\bar{\alpha}_k}\epsilon, k, o_t)\|^2 \right]\]

4.3 Key Design Choices

Action Chunk Prediction:

Instead of predicting a single-step action, predict a sequence of \(T_a\) future actions:

\[A_t = [a_t, a_{t+1}, \ldots, a_{t+T_a-1}]\]

This provides temporal consistency, avoiding the jitter problem of step-by-step prediction.

Observation History:

Uses the most recent \(T_o\) steps of observations as conditioning:

\[O_t = [o_{t-T_o+1}, \ldots, o_t]\]

Two Architecture Variants:

Variant	Conditioning Method	Characteristics
CNN-based	1D temporal CNN processes action sequences, FiLM injects observations	Fast inference, suitable for real-time control
Transformer-based	Cross-attention fuses observations and actions	More flexible, slightly better performance

4.4 Key Results

Performance across 11 manipulation tasks:

Method	Average Success Rate	Multimodal Tasks
Diffusion Policy (CNN)	86.8%	Excellent
Diffusion Policy (Transformer)	83.5%	Excellent
LSTM-GMM	62.7%	Fair
IBC (Implicit BC)	52.3%	Fair
BeT	50.1%	Fair

4.5 Why Diffusion Models Suit Robot Policies

Multimodal Expression: Naturally supports multimodal action distributions
High-Dimensional Action Spaces: Diffusion models excel at high-dimensional distribution modeling
Stable Training: More stable than GANs
Flexible Conditioning: Easy to incorporate various conditioning information
Temporal Consistency: Action chunk prediction provides smooth trajectories

4.6 Significance

Diffusion Policy introduced the generative model paradigm into robot policy learning, solving the core challenge of behavioral cloning (multimodal distributions). Since then, diffusion models have become one of the standard choices for robot manipulation policies.

5. Open X-Embodiment -- Cross-Embodiment Open Dataset

Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration, 2024 (33 institutions)

5.1 Problem

Robot learning faces severe data fragmentation:

Each lab collects its own data
Different robots, different formats, different tasks
Cannot leverage other robots' experience

How to build the "ImageNet" for robot learning?

5.2 Dataset

Scale:

Attribute	Value
Participating Institutions	33
Number of Datasets	60+
Robot Types	22
Total Trajectories	1,000,000+
Data Format	RLDS (unified)

Data Format Standardization (RLDS):

Each trajectory is unified as:

{
  "steps": [
    {
      "observation": {
        "image": ...,           # RGB image
        "wrist_image": ...,     # Wrist camera (optional)
        "state": ...            # Proprioception
      },
      "action": ...,            # Standardized action
      "language_instruction": ...,
      "reward": ...,
      "is_terminal": ...
    },
    ...
  ]
}

Robot Type Coverage:

Single-arm tabletop manipulation (Franka, UR5, xArm, ...)
Bimanual manipulation (ALOHA, Baxter, ...)
Mobile manipulation (Everyday Robots, Stretch, ...)
Quadruped robots (A1, Spot, ...)
Dexterous hands (Allegro, LEAP, ...)

5.3 RT-X Models

Cross-embodiment models trained on Open X-Embodiment data:

RT-1-X: RT-1 architecture trained on mixed data

RT-2-X: RT-2 architecture trained on mixed data

5.4 Key Findings

Positive Transfer:

Evaluation Target	RT-1-X vs RT-1 (Single Dataset)	Improvement
Average Performance on Target Robot	+50% of evaluated scenarios improved	Significant
Cross-Robot Generalization	Clearly better than single dataset	Significant

Key experiment: A policy for robot A, after cross-embodiment training, performed better than when trained on robot A's data alone. This proved that different robots' experiences can mutually benefit each other.

Data Scale Effect:

\[\text{Performance} \propto \log(\text{dataset size})\]

Even adding data from robots with completely different morphologies from the target still improved overall performance.

5.5 Significance

Open X-Embodiment represents the "ImageNet moment" for embodied intelligence:

Established standards and culture for cross-embodiment data sharing
Demonstrated the feasibility of cross-embodiment transfer learning
Provided the data foundation for subsequent general robot foundation models
Promoted the development of open-source data ecosystems

6. pi0 -- General-Purpose Robot Foundation Model

pi0: A Vision-Language-Action Flow Model for General Robot Control Black et al., 2024 (Physical Intelligence)

6.1 Problem

How to build a truly general-purpose robot foundation model -- one that works across multiple robots and tasks, and can be quickly adapted to new tasks with minimal data?

6.2 Architecture

pi0 adopts a dual-component architecture:

VLM Backbone:

Based on a pretrained vision-language model (PaLI-Gemma 3B variant), processing: - Multi-view image inputs - Natural language instructions - Proprioceptive state

Flow Matching Action Head:

Unlike RT-2's discrete token output, pi0 uses Flow Matching to generate continuous actions.

6.3 Flow Matching

Flow Matching is an alternative to diffusion models that learns a velocity field to transform a noise distribution into a data distribution:

Basic Idea:

Define a linear path from noise \(x_0 \sim \mathcal{N}(0, I)\) to data \(x_1 \sim p_{\text{data}}\):

\[x_t = (1-t) x_0 + t x_1, \quad t \in [0, 1]\]

The corresponding velocity field is:

\[u_t(x_t) = x_1 - x_0\]

Training Objective:

\[\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1} \left[ \|v_\theta(x_t, t, c) - (x_1 - x_0)\|^2 \right]\]

where \(c\) is the conditioning information (images, language, proprioception) and \(v_\theta\) is the neural network-parameterized velocity field.

Inference (Generating Actions):

\[x_1 = x_0 + \int_0^1 v_\theta(x_t, t, c) \, dt\]

Solved through numerical integration (e.g., Euler method):

\[x_{t+\Delta t} = x_t + v_\theta(x_t, t, c) \cdot \Delta t\]

6.4 Advantages Over Diffusion Models

Dimension	DDPM Diffusion	Flow Matching
Path	Stochastic (SDE)	Deterministic (ODE)
Training	Predicts noise \(\epsilon\)	Predicts velocity \(v\)
Sampling Steps	Typically 50-1000	Typically 10-50
Inference Speed	Slower	Faster (suitable for real-time control)
Training Stability	Good	Better

6.5 Training Pipeline

Stage 1: Pretraining

Pretrained on large-scale cross-embodiment data (similar to Open X-Embodiment + proprietary data)

Stage 2: Task Fine-tuning

Fine-tuned on task-specific data with a smaller learning rate

Stage 3: Online Fine-tuning

Rapid adaptation through a small amount of data collected in the deployment environment

6.6 Key Results

pi0 demonstrated strong generalization across multiple tasks and robots:

Task	Robot Platform	Success Rate
Folding Clothes	Bimanual + Dexterous Hand	High
Table Cleanup	Single-Arm Manipulation	High
Box Packing	Bimanual	High
Zero-Shot Novel Object Manipulation	Various	Medium-High

Comparison with Baselines (on manipulation tasks):

Method	Average Success Rate
pi0	Highest
Diffusion Policy	Second
RT-2-X	Medium
ACT	Lower

6.7 Significance

pi0 represents the latest paradigm for robot foundation models:

VLM as the "Brain": Leveraging internet-pretrained knowledge for understanding and reasoning
Flow Matching as the "Motor System": Efficiently generating smooth continuous actions
Pretrain-Fine-tune Paradigm: Large-scale pretraining + task-specific fine-tuning
Generality: A single model adapting to multiple robots and tasks

7. Technical Evolution Across Papers

flowchart TB
    A[SayCan 2022<br/>LLM + Fixed Skills] --> B[RT-1 2022<br/>Large-Scale Learned Policies]
    B --> C[RT-2 2023<br/>VLM→VLA Transfer]
    D[Diffusion Policy 2023<br/>Generative Policies] --> F[pi0 2024<br/>VLM + Flow Matching]
    C --> E[Open X-Embodiment 2024<br/>Cross-Embodiment Data]
    E --> F
    C --> F

Main Line of Technical Evolution:

Stage	Representative	Paradigm
LLM-Assisted	SayCan	LLM planning + predefined skills
Large-Scale Learning	RT-1	Transformer + large data
Knowledge Transfer	RT-2	VLM \(\rightarrow\) VLA
Generative Policies	Diffusion Policy	Diffusion models generate actions
Open Ecosystem	Open X-Embodiment	Cross-embodiment data sharing
Foundation Model	pi0	VLM + Flow Matching + pretrain-fine-tune

8. Summary and Outlook

8.1 Common Trends

Scaling Up: Data scale, model scale, and task scale continue to expand
Unification: Perception, reasoning, and control are progressively unified into single models
Transfer: Internet knowledge \(\rightarrow\) robots, robot A \(\rightarrow\) robot B
Generation: From discriminative policies to generative policies

8.2 Unresolved Questions

Safety: How can the behavior of end-to-end models be guaranteed safe?
Interpretability: How can we understand the decision-making process of VLA models?
Data Efficiency: Can the same performance be achieved with less data?
Long Horizon: How to handle complex tasks requiring hundreds of steps?
Embodied Reasoning: Physical reasoning capabilities beyond pattern matching

References

Ahn, M. et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances"
Brohan, A. et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale"
Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control"
Chi, C. et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
Open X-Embodiment Collaboration (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models"
Black, K. et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control"

Related Notes: