Key Paper Deep Dives
Overview
This article provides in-depth analysis of 6 milestone papers in embodied intelligence. Each paper is examined across five dimensions: problem definition, method design, key formulas, experimental results, and historical significance, helping readers systematically understand the technical evolution from LLM-driven robots to general-purpose foundation models.
1. SayCan -- When Language Models Meet Robot Affordances
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Ahn et al., 2022 (Google Research)
1.1 Problem
Large language models possess rich world knowledge and reasoning capabilities, but they do not understand what a specific robot can do in a specific scenario. How can we combine the semantic knowledge of LLMs with the physical capabilities of robots?
1.2 Core Idea
The LLM evaluates "what should be done," while the robot policy evaluates "what can be done"; multiplying the two yields the final decision.
1.3 Method
Affordance Scoring:
where:
- \(p(\text{useful} | a_i, l)\): LLM language score. Given user instruction \(l\), the LLM evaluates the plausibility of candidate skill \(a_i\) as the next step. Concretely implemented as the LLM's token probability for "\(l\). The robot should: 1. \(a_i\)."
- \(p(\text{possible} | a_i, s_t)\): Affordance score. Provided by a pretrained value function \(V^{a_i}(s_t)\), reflecting the success probability of executing skill \(a_i\) in current state \(s_t\).
Greedy Decoding:
At each planning step:
After executing \(a_t^*\), the result is appended to the LLM's context, and planning continues for the next step until the LLM outputs "done."
1.4 Skill Library
- 551 skills: pick, place, go to, open, close, etc.
- Each skill has an independent BC (behavioral cloning) policy and value function
- Trained on real mobile manipulation robots (Everyday Robots)
1.5 Key Results
| Metric | SayCan | LLM Only | Affordance Only |
|---|---|---|---|
| Planning Success Rate | 84% | 14% | - |
| Execution Success Rate | 74% | 0% | - |
| Long-Horizon Tasks | Handled | Severe hallucination | No planning capability |
1.6 Significance and Limitations
Significance:
- First systematic combination of LLMs with robot control
- Proposed an elegant "affordance filtering" framework
- Pioneered the LLM for Robotics research direction
Limitations:
- The skill library is fixed and predefined
- Requires training a separate policy and value function for each skill
- Cannot handle tasks outside the skill library
2. RT-1 -- Large-Scale Robotics Transformer
RT-1: Robotics Transformer for Real-World Control at Scale Brohan et al., 2022 (Google/Everyday Robots)
2.1 Problem
Previous robot learning methods typically trained on small-scale data and struggled to generalize to new scenarios and instructions. Can we improve robot policy generalization by scaling data and model size, as was done with Transformers in NLP?
2.2 Method
Architecture:
Input: - 6 historical images (current frame + 5 history frames), encoded by EfficientNet-B3 - Natural language instruction, encoded by Universal Sentence Encoder
FiLM Conditioning: Language embeddings modulate visual features through FiLM (Feature-wise Linear Modulation) layers:
where \(\gamma(l)\) and \(\beta(l)\) are scaling and bias parameters mapped from language embedding \(l\).
Tokenized Action Space:
Continuous actions are discretized into 256 bins:
Each dimension is discretized into 256 values, and the Transformer autoregressively predicts each action dimension token.
TokenLearner: Uses the TokenLearner module to compress visual tokens from 81 to 8, significantly reducing computation.
2.3 Training Data
| Attribute | Value |
|---|---|
| Demonstration Trajectories | 130,000+ |
| Collection Robots | 13 |
| Collection Duration | 17 months |
| Task Types | 700+ |
| Object Types | Hundreds |
2.4 Key Results
| Evaluation Dimension | RT-1 | Gato | BC-Z |
|---|---|---|---|
| Seen Task Success Rate | 97% | 63% | 72% |
| Unseen Tasks (New Instructions) | 76% | 34% | 48% |
| Unseen Tasks (New Objects) | 53% | 24% | 29% |
| Long-Horizon Tasks | High | Low | Medium |
2.5 Key Findings
- Data scale is critical: Performance grows approximately logarithmically with data volume
- Multi-task training aids generalization: Joint training on 700+ tasks outperforms single-task training
- Real data > simulation data: At this scale, real data is more valuable than simulation data
2.6 Significance
RT-1 was the "GPT moment" for robot learning -- the first time a single Transformer policy was trained on large-scale real data, demonstrating strong multi-task generalization capabilities. It proved that Scaling Laws apply in the robotics domain as well.
3. RT-2 -- From Vision-Language Model to Vision-Language-Action Model
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Brohan et al., 2023 (Google DeepMind)
3.1 Problem
The internet contains vast amounts of vision-language data rich in world knowledge. Can the knowledge within vision-language models (VLMs) be directly transferred to robot control?
3.2 Core Innovation: Actions as Text Tokens
Key Insight: Robot actions can be represented as text token sequences and processed uniformly with language tokens.
Action representation:
Each action dimension is discretized into 256 bins, mapped to special tokens: rt_000 through rt_255.
3.3 Training Pipeline
- Pretraining Phase: PaLI-X (55B) or PaLM-E (12B) pretrained on internet-scale vision-language data
- Co-fine-tuning: Simultaneously fine-tuned on web data and robot data
- Web data: Visual question answering, image captioning, etc.
- Robot data: RT-1's data (with action tokens added)
3.4 Emergent Capabilities
RT-2 exhibited reasoning abilities absent from the training data:
| Emergent Capability | Example |
|---|---|
| Symbolic Reasoning | "Throw the trash in the correct bin" (requires judging recyclable/non-recyclable) |
| Mathematical Reasoning | "Move to next to the triangle" (requires shape recognition) |
| Language Generalization | Understanding instructions never seen in robot data |
| Visual Concept Transfer | Manipulating objects never seen in robot training |
3.5 Key Results
| Evaluation Dimension | RT-2 (PaLI-X) | RT-1 | VC-1 |
|---|---|---|---|
| Seen Tasks | 95% | 97% | 73% |
| Unseen Objects | 62% | 32% | 22% |
| Unseen Backgrounds | 72% | 36% | 29% |
| Semantic Reasoning Tasks | 62% | 0% | 0% |
3.6 Significance
RT-2 established the VLA (Vision-Language-Action) paradigm:
It demonstrated that internet-pretrained vision-language knowledge can be effectively transferred to physical robot control. This means robots can leverage the entire internet's knowledge base.
4. Diffusion Policy -- Diffusion Model-Driven Robot Policies
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion Chi et al., 2023 (Columbia University/Toyota Research Institute)
4.1 Problem
Traditional behavioral cloning methods perform poorly when facing multimodal distributions. For example, when navigating around an obstacle, the agent can go left or right; mean regression causes the policy to collide directly with the obstacle:
How can we learn policies that express multimodal action distributions?
4.2 Method: DDPM for Action Generation
Core Idea: Model policy learning as a conditional denoising diffusion process (DDPM).
Forward Diffusion (Adding Noise):
where \(k\) is the diffusion step (not the timestep); after \(K\) steps, the action becomes pure noise.
Reverse Denoising (Generating Actions):
Network \(\epsilon_\theta\) predicts the noise:
Training Objective:
4.3 Key Design Choices
Action Chunk Prediction:
Instead of predicting a single-step action, predict a sequence of \(T_a\) future actions:
This provides temporal consistency, avoiding the jitter problem of step-by-step prediction.
Observation History:
Uses the most recent \(T_o\) steps of observations as conditioning:
Two Architecture Variants:
| Variant | Conditioning Method | Characteristics |
|---|---|---|
| CNN-based | 1D temporal CNN processes action sequences, FiLM injects observations | Fast inference, suitable for real-time control |
| Transformer-based | Cross-attention fuses observations and actions | More flexible, slightly better performance |
4.4 Key Results
Performance across 11 manipulation tasks:
| Method | Average Success Rate | Multimodal Tasks |
|---|---|---|
| Diffusion Policy (CNN) | 86.8% | Excellent |
| Diffusion Policy (Transformer) | 83.5% | Excellent |
| LSTM-GMM | 62.7% | Fair |
| IBC (Implicit BC) | 52.3% | Fair |
| BeT | 50.1% | Fair |
4.5 Why Diffusion Models Suit Robot Policies
- Multimodal Expression: Naturally supports multimodal action distributions
- High-Dimensional Action Spaces: Diffusion models excel at high-dimensional distribution modeling
- Stable Training: More stable than GANs
- Flexible Conditioning: Easy to incorporate various conditioning information
- Temporal Consistency: Action chunk prediction provides smooth trajectories
4.6 Significance
Diffusion Policy introduced the generative model paradigm into robot policy learning, solving the core challenge of behavioral cloning (multimodal distributions). Since then, diffusion models have become one of the standard choices for robot manipulation policies.
5. Open X-Embodiment -- Cross-Embodiment Open Dataset
Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration, 2024 (33 institutions)
5.1 Problem
Robot learning faces severe data fragmentation:
- Each lab collects its own data
- Different robots, different formats, different tasks
- Cannot leverage other robots' experience
How to build the "ImageNet" for robot learning?
5.2 Dataset
Scale:
| Attribute | Value |
|---|---|
| Participating Institutions | 33 |
| Number of Datasets | 60+ |
| Robot Types | 22 |
| Total Trajectories | 1,000,000+ |
| Data Format | RLDS (unified) |
Data Format Standardization (RLDS):
Each trajectory is unified as:
{
"steps": [
{
"observation": {
"image": ..., # RGB image
"wrist_image": ..., # Wrist camera (optional)
"state": ... # Proprioception
},
"action": ..., # Standardized action
"language_instruction": ...,
"reward": ...,
"is_terminal": ...
},
...
]
}
Robot Type Coverage:
- Single-arm tabletop manipulation (Franka, UR5, xArm, ...)
- Bimanual manipulation (ALOHA, Baxter, ...)
- Mobile manipulation (Everyday Robots, Stretch, ...)
- Quadruped robots (A1, Spot, ...)
- Dexterous hands (Allegro, LEAP, ...)
5.3 RT-X Models
Cross-embodiment models trained on Open X-Embodiment data:
RT-1-X: RT-1 architecture trained on mixed data
RT-2-X: RT-2 architecture trained on mixed data
5.4 Key Findings
Positive Transfer:
| Evaluation Target | RT-1-X vs RT-1 (Single Dataset) | Improvement |
|---|---|---|
| Average Performance on Target Robot | +50% of evaluated scenarios improved | Significant |
| Cross-Robot Generalization | Clearly better than single dataset | Significant |
Key experiment: A policy for robot A, after cross-embodiment training, performed better than when trained on robot A's data alone. This proved that different robots' experiences can mutually benefit each other.
Data Scale Effect:
Even adding data from robots with completely different morphologies from the target still improved overall performance.
5.5 Significance
Open X-Embodiment represents the "ImageNet moment" for embodied intelligence:
- Established standards and culture for cross-embodiment data sharing
- Demonstrated the feasibility of cross-embodiment transfer learning
- Provided the data foundation for subsequent general robot foundation models
- Promoted the development of open-source data ecosystems
6. pi0 -- General-Purpose Robot Foundation Model
pi0: A Vision-Language-Action Flow Model for General Robot Control Black et al., 2024 (Physical Intelligence)
6.1 Problem
How to build a truly general-purpose robot foundation model -- one that works across multiple robots and tasks, and can be quickly adapted to new tasks with minimal data?
6.2 Architecture
pi0 adopts a dual-component architecture:
VLM Backbone:
Based on a pretrained vision-language model (PaLI-Gemma 3B variant), processing: - Multi-view image inputs - Natural language instructions - Proprioceptive state
Flow Matching Action Head:
Unlike RT-2's discrete token output, pi0 uses Flow Matching to generate continuous actions.
6.3 Flow Matching
Flow Matching is an alternative to diffusion models that learns a velocity field to transform a noise distribution into a data distribution:
Basic Idea:
Define a linear path from noise \(x_0 \sim \mathcal{N}(0, I)\) to data \(x_1 \sim p_{\text{data}}\):
The corresponding velocity field is:
Training Objective:
where \(c\) is the conditioning information (images, language, proprioception) and \(v_\theta\) is the neural network-parameterized velocity field.
Inference (Generating Actions):
Solved through numerical integration (e.g., Euler method):
6.4 Advantages Over Diffusion Models
| Dimension | DDPM Diffusion | Flow Matching |
|---|---|---|
| Path | Stochastic (SDE) | Deterministic (ODE) |
| Training | Predicts noise \(\epsilon\) | Predicts velocity \(v\) |
| Sampling Steps | Typically 50-1000 | Typically 10-50 |
| Inference Speed | Slower | Faster (suitable for real-time control) |
| Training Stability | Good | Better |
6.5 Training Pipeline
Stage 1: Pretraining
Pretrained on large-scale cross-embodiment data (similar to Open X-Embodiment + proprietary data)
Stage 2: Task Fine-tuning
Fine-tuned on task-specific data with a smaller learning rate
Stage 3: Online Fine-tuning
Rapid adaptation through a small amount of data collected in the deployment environment
6.6 Key Results
pi0 demonstrated strong generalization across multiple tasks and robots:
| Task | Robot Platform | Success Rate |
|---|---|---|
| Folding Clothes | Bimanual + Dexterous Hand | High |
| Table Cleanup | Single-Arm Manipulation | High |
| Box Packing | Bimanual | High |
| Zero-Shot Novel Object Manipulation | Various | Medium-High |
Comparison with Baselines (on manipulation tasks):
| Method | Average Success Rate |
|---|---|
| pi0 | Highest |
| Diffusion Policy | Second |
| RT-2-X | Medium |
| ACT | Lower |
6.7 Significance
pi0 represents the latest paradigm for robot foundation models:
- VLM as the "Brain": Leveraging internet-pretrained knowledge for understanding and reasoning
- Flow Matching as the "Motor System": Efficiently generating smooth continuous actions
- Pretrain-Fine-tune Paradigm: Large-scale pretraining + task-specific fine-tuning
- Generality: A single model adapting to multiple robots and tasks
7. Technical Evolution Across Papers
flowchart TB
A[SayCan 2022<br/>LLM + Fixed Skills] --> B[RT-1 2022<br/>Large-Scale Learned Policies]
B --> C[RT-2 2023<br/>VLM→VLA Transfer]
D[Diffusion Policy 2023<br/>Generative Policies] --> F[pi0 2024<br/>VLM + Flow Matching]
C --> E[Open X-Embodiment 2024<br/>Cross-Embodiment Data]
E --> F
C --> F
Main Line of Technical Evolution:
| Stage | Representative | Paradigm |
|---|---|---|
| LLM-Assisted | SayCan | LLM planning + predefined skills |
| Large-Scale Learning | RT-1 | Transformer + large data |
| Knowledge Transfer | RT-2 | VLM \(\rightarrow\) VLA |
| Generative Policies | Diffusion Policy | Diffusion models generate actions |
| Open Ecosystem | Open X-Embodiment | Cross-embodiment data sharing |
| Foundation Model | pi0 | VLM + Flow Matching + pretrain-fine-tune |
8. Summary and Outlook
8.1 Common Trends
- Scaling Up: Data scale, model scale, and task scale continue to expand
- Unification: Perception, reasoning, and control are progressively unified into single models
- Transfer: Internet knowledge \(\rightarrow\) robots, robot A \(\rightarrow\) robot B
- Generation: From discriminative policies to generative policies
8.2 Unresolved Questions
- Safety: How can the behavior of end-to-end models be guaranteed safe?
- Interpretability: How can we understand the decision-making process of VLA models?
- Data Efficiency: Can the same performance be achieved with less data?
- Long Horizon: How to handle complex tasks requiring hundreds of steps?
- Embodied Reasoning: Physical reasoning capabilities beyond pattern matching
References
- Ahn, M. et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances"
- Brohan, A. et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale"
- Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control"
- Chi, C. et al. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
- Open X-Embodiment Collaboration (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models"
- Black, K. et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control"
Related Notes: