VLA Models (Vision-Language-Action Models)
VLA (Vision-Language-Action) models are one of the most important model paradigms in embodied intelligence today: they receive visual observations and language instructions, and directly output robot actions. This article systematically reviews VLA architecture design, action representation methods, and the development trajectory from RT-1 to pi0.5.
Related notes: Model Roadmap | ACT Model | Imitation Learning | Diffusion Policy | Foundation Models for Robotics | Open-Source Model Summary
If you want the broader model evolution before zooming into the VLA mainline, start with Model Roadmap.
1. VLA Model Definition
1.1 What Is a VLA
The core definition of a VLA model:
where:
- \(\mathbf{o}_{\text{visual}}\): Visual observation (RGB images, depth maps, point clouds, etc.)
- \(\mathbf{l}_{\text{language}}\): Natural language task instruction (e.g., "pick up the red cup")
- \(\mathbf{a}_{\text{action}}\): Robot action (end-effector pose, joint angles, etc.)
What distinguishes VLA from other paradigms is that it does not merely use vision and language for task understanding, but directly outputs executable low-level actions, achieving end-to-end perception-to-action mapping.
1.2 Why VLA Is Needed
Traditional robot learning methods (such as behavioral cloning) typically only accept specific observation formats and lack language understanding capabilities. Pure LLMs/VLMs, on the other hand, cannot directly output robot actions. VLA unifies both:
- Inherited from VLM: Visual understanding, language reasoning, commonsense knowledge
- New capabilities: Action output, physical interaction, real-time control
2. General Architecture
2.1 Three Core Components
All VLA models follow a basic three-component architecture:
graph LR
subgraph Input
IMG[RGB Image] --> VE
LANG[Language Instruction] --> LT
PROP[Proprioception] --> PE
end
subgraph Encoding
VE[Visual Encoder<br/>ViT / SigLIP / DINOv2]
LT[Language Tokenizer<br/>SentencePiece / BPE]
PE[Proprioceptive Encoder<br/>MLP]
end
subgraph Backbone
VE --> TF[Transformer Backbone<br/>Llama / PaLM / Custom]
LT --> TF
PE --> TF
end
subgraph ActionOutput["Action Output"]
TF --> AH[Action Head]
AH --> ACT[Robot Action<br/>Δx,Δy,Δz,Δrx,Δry,Δrz,gripper]
end
style Encoding fill:#e3f2fd
style Backbone fill:#fff3e0
style ActionOutput fill:#e8f5e9
Visual encoder choices:
| Encoder | Pretraining Method | Parameters | Used by |
|---|---|---|---|
| ViT-B/16 | ImageNet-21K | 86M | RT-1 |
| ViT-G | JFT-4B | 1.8B | RT-2 (PaLI-X) |
| SigLIP | WebLI contrastive learning | 400M | OpenVLA, pi0 |
| DINOv2 | Self-supervised | 300M | HPT |
2.2 Action Representation Methods
How VLA models output actions is a core design choice. There are currently three main approaches:
(a) Discrete Tokenization
Uniformly discretize the continuous action space into tokens:
Representatives: RT-2, OpenVLA
Advantages: Can directly reuse the language model's token prediction mechanism
Disadvantages: Discretization loses precision; difficult to express multimodal action distributions
(b) Continuous Regression
The action head directly outputs continuous values:
Training loss is typically MSE:
Representatives: RT-1, Octo (optional)
Advantages: Simple, direct, high precision
Disadvantages: MSE loss assumes a unimodal Gaussian distribution; cannot model multimodal actions
(c) Diffusion / Flow Matching
Uses generative models to model the action distribution:
Samples actions from noise via iterative denoising or flow matching:
Representatives: pi0, RDT, Octo (diffusion head option)
Advantages: Can model complex multimodal action distributions; highest precision
Disadvantages: Inference requires multiple denoising steps; slower
More on diffusion policies: Diffusion Policy
3. Model Development Timeline
3.1 Timeline Overview
timeline
title VLA Model Development Timeline
2022 : RT-1 (Google)
: First large-scale robot Transformer
2023 : RT-2 (Google DeepMind)
: VLM first directly outputs actions
: Octo (Berkeley)
: Open-source multi-embodiment foundation model
2024 : OpenVLA (Stanford/Berkeley)
: Open-source 7B VLA
: pi0 (Physical Intelligence)
: Flow matching action head
: GR-1 (Fourier Intelligence)
: Humanoid-specific VLA
: HPT (MIT)
: Heterogeneous sensor pretraining
: RDT (Tsinghua)
: Diffusion Transformer bimanual manipulation
2025 : pi0.5 (Physical Intelligence)
: Hierarchical task decomposition
: GR-2 (Fourier Intelligence)
: World model component
3.2 Detailed Model Cards
RT-1 (Google, 2022)
- Architecture: EfficientNet-B3 visual encoding + TokenLearner compression + Transformer decoder
- Data: 130K real robot episodes, 700+ tasks, 13 Everyday Robots
- Action space: Discretized tokens (256 bins per dimension), outputs 7DoF end-effector pose + termination signal + mobile base
- Control frequency: 3Hz
- Key contribution: Demonstrated that large-scale real data-trained Transformers can generalize to novel objects and instructions
- Limitation: Only supports a single robot platform; generalization limited to within training distribution
RT-2 (Google DeepMind, 2023)
- Architecture: PaLI-X (55B) or PaLM-E (12B) as backbone, co-fine-tuned
- Data: Robot data + Web-scale vision-language data
- Action representation: Actions encoded as text tokens
"1 128 91 241 5 101 127" - Key contributions: - First demonstration that a VLM can be fine-tuned into a VLA - Emergent reasoning: understanding "move apple to bowl with matching color" - Symbolic reasoning: understanding logic and relations in language
- Limitation: Enormous model (55B), inference speed only 1–3Hz
Octo (Berkeley, 2023)
- Architecture: Pure Transformer, supports multiple observation and action spaces
- Data: Open X-Embodiment dataset (800K+ episodes)
- Action head: Supports both continuous regression and diffusion modes
- Key contributions: - First open-source general robot foundation model - Flexible architecture supporting different robots - Supports both language and goal image task specification
- Parameters: 93M (Base)
OpenVLA (Stanford/Berkeley, 2024)
- Architecture: Prismatic VLM (SigLIP + DinoV2 dual visual encoders) + Llama 2 7B
- Data: Open X-Embodiment dataset
- Action representation: Discretized tokens (256 bins), reusing LLM token prediction
- Key contributions: - Open-source 7B-scale VLA, fine-tunable on consumer GPUs - Demonstrated that VLM architectures can effectively transfer to robot control
- Fine-tuning: LoRA efficient fine-tuning, requiring only small amounts of data on new robots
pi0 (Physical Intelligence, 2024)
- Architecture: 3B pretrained VLM + Flow Matching action head
- Data: Large-scale dataset across multiple robot platforms
- Action representation: Flow Matching generates continuous action sequences (action chunks)
- Key contributions: - Flow Matching action head can model multimodal action distributions - Cross-embodiment transfer: same model controls multiple different robots - Action chunks (predicting multiple future steps at once) improve temporal consistency
- Control frequency: ~50Hz (GPU inference + action chunk)
pi0.5 (Physical Intelligence, 2025)
- Architecture: Dual-layer structure — high-level VLM for sub-task planning + low-level pi0 for fine-grained control
- Key contributions: - Hierarchical Task Decomposition - High-level model understands long-horizon complex tasks - Low-level model executes fine manipulation actions - Completes end-to-end laundry, cleaning, and other long-sequence tasks in real home environments
GR-1 (Fourier Intelligence, 2024)
- Architecture: GPT-style Transformer, simultaneously predicts video frames and actions
- Data: Humanoid robot manipulation data + human video data
- Key contributions: - First humanoid robot-specific VLA model - Multi-task learning of video prediction + action prediction - Can learn from human videos, then transfer to humanoid robots
GR-2 (Fourier Intelligence, 2025)
- Architecture: 3B+ parameters, includes world model component
- Key contributions: - Scale increased to 3B+ parameters - Introduces world model component to predict future visual states - Supports more complex humanoid robot whole-body manipulation
HPT (MIT, 2024)
- Architecture: Heterogeneous Pretrained Transformer, unified processing of sensor inputs with different dimensions
- Key contributions: - Solves the problem of inconsistent sensor dimensions across robots - Uses modality alignment layers (stems) to map heterogeneous inputs to a unified space - Pretrained on Open X-Embodiment + DROID
RDT (Tsinghua, 2024)
- Architecture: Diffusion Transformer (DiT-style), focused on bimanual manipulation
- Data: Bimanual manipulation datasets
- Key contributions: - Introduces DiT architecture to robot action generation - Natively supports high-dimensional bimanual action spaces (14+ DoF) - Diffusion process can model the complex action distributions of bimanual coordination
4. Core Technical Deep Dive
4.1 Action Chunking
Action chunking is a key technique in VLA models. Instead of predicting a single action per step, it predicts a sequence of \(H\) future actions at once:
where \(H\) is the chunk size (typically 16–100 steps).
Benefits:
- Temporal consistency: Avoids jitter and incoherence in step-by-step prediction
- Reduced inference calls: Model inference needed only every \(H\) steps
- Implicit planning: The model learns short-term action planning
Execution strategy: Typically, instead of executing the entire chunk before re-predicting, the model re-predicts every \(k < H\) steps, fusing old and new chunks via exponential weighted averaging:
This design line did not appear out of nowhere. The key bridge model is ACT Model: ACT turned chunked action prediction into a reproducible and interpretable policy paradigm, and many later VLA horizon, chunk execution, and temporal smoothing designs inherit intuition from that line.
4.2 Co-fine-tuning Strategy
Co-fine-tuning, proposed by RT-2, is a key training technique:
During the fine-tuning phase, the original VLM training data is not entirely discarded; instead, robot data and web data are mixed for training. This approach:
- Preserves the VLM's original visual understanding and language capabilities
- Prevents catastrophic forgetting
- Allows web knowledge to continuously influence robot policy learning
4.3 Challenges of Cross-Embodiment Transfer
Key differences between robots:
| Difference Dimension | Example |
|---|---|
| Observation space | Single camera vs. dual cameras vs. wrist camera |
| Action space | 6DoF end-effector vs. 7DoF joint vs. 14DoF bimanual |
| Action range | Tabletop manipulation vs. whole-body movement |
| Control frequency | 3Hz vs. 50Hz vs. 200Hz |
| Dynamics | Light load vs. heavy load |
Handling strategies:
- Action space standardization (Octo): Normalize all actions to a unified range
- Modality alignment layers (HPT): Use learnable stems to map heterogeneous inputs to a unified space
- Task-specific fine-tuning heads: Share the backbone, fine-tune output heads for different robots
5. Model Comparison Summary
| Model | Year | Institution | Parameters | Action Repr. | Data Scale | Open Source |
|---|---|---|---|---|---|---|
| RT-1 | 2022 | 35M | Discrete Token | 130K ep | No | |
| RT-2 | 2023 | Google DeepMind | 12–55B | Discrete Token | 130K ep + Web | No |
| Octo | 2023 | Berkeley | 93M | Continuous/Diffusion | 800K+ ep | Yes |
| OpenVLA | 2024 | Stanford/Berkeley | 7B | Discrete Token | 970K+ ep | Yes |
| pi0 | 2024 | Physical Intelligence | 3B | Flow Matching | Large-scale | Yes |
| pi0.5 | 2025 | Physical Intelligence | 3B+ | Flow Matching | Large-scale | No |
| GR-1 | 2024 | Fourier | ~1B | Continuous Regression | Humanoid data | Partial |
| GR-2 | 2025 | Fourier | 3B+ | Continuous Regression | Humanoid data | No |
| HPT | 2024 | MIT | ~300M | Continuous/Diffusion | Multi-source | Yes |
| RDT | 2024 | Tsinghua | ~1B | Diffusion | Bimanual data | Yes |
6. Future Trends
- Evolution of action heads: From discrete tokens → continuous regression → diffusion/Flow Matching, trending toward higher precision and multimodal modeling
- Hierarchical design: The high-level planning + low-level control paradigm of pi0.5 may become mainstream
- Training efficiency: LoRA, QLoRA, and other efficient fine-tuning methods reduce VLA adaptation costs
- Real-time performance: Model distillation, quantization, action chunking, and other techniques improve inference speed
- Data flywheel: Data collected from deployed VLAs feeds back into model training, forming a positive cycle
References:
- Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", RSS 2023
- Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL 2023
- Team et al., "Octo: An Open-Source Generalist Robot Policy", RSS 2024
- Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
- Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
- Physical Intelligence, "pi0.5: a Vision-Language-Action Model with Open-World Generalization", 2025
- Wu et al., "GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", 2024
- Liang et al., "HPT: Heterogeneous Pre-trained Transformers", 2024
- Liu et al., "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", 2024