VLA Models (Vision-Language-Action Models)

VLA (Vision-Language-Action) models are one of the most important model paradigms in embodied intelligence today: they receive visual observations and language instructions, and directly output robot actions. This article systematically reviews VLA architecture design, action representation methods, and the development trajectory from RT-1 to pi0.5.

Related notes: Model Roadmap | ACT Model | Imitation Learning | Diffusion Policy | Foundation Models for Robotics | Open-Source Model Summary

If you want the broader model evolution before zooming into the VLA mainline, start with Model Roadmap.

1. VLA Model Definition

1.1 What Is a VLA

The core definition of a VLA model:

\[\pi_\theta: (\mathbf{o}_{\text{visual}}, \mathbf{l}_{\text{language}}) \mapsto \mathbf{a}_{\text{action}}\]

where:

\(\mathbf{o}_{\text{visual}}\): Visual observation (RGB images, depth maps, point clouds, etc.)
\(\mathbf{l}_{\text{language}}\): Natural language task instruction (e.g., "pick up the red cup")
\(\mathbf{a}_{\text{action}}\): Robot action (end-effector pose, joint angles, etc.)

What distinguishes VLA from other paradigms is that it does not merely use vision and language for task understanding, but directly outputs executable low-level actions, achieving end-to-end perception-to-action mapping.

1.2 Why VLA Is Needed

Traditional robot learning methods (such as behavioral cloning) typically only accept specific observation formats and lack language understanding capabilities. Pure LLMs/VLMs, on the other hand, cannot directly output robot actions. VLA unifies both:

Inherited from VLM: Visual understanding, language reasoning, commonsense knowledge
New capabilities: Action output, physical interaction, real-time control

2. General Architecture

2.1 Three Core Components

All VLA models follow a basic three-component architecture:

graph LR
    subgraph Input
        IMG[RGB Image] --> VE
        LANG[Language Instruction] --> LT
        PROP[Proprioception] --> PE
    end

    subgraph Encoding
        VE[Visual Encoder<br/>ViT / SigLIP / DINOv2]
        LT[Language Tokenizer<br/>SentencePiece / BPE]
        PE[Proprioceptive Encoder<br/>MLP]
    end

    subgraph Backbone
        VE --> TF[Transformer Backbone<br/>Llama / PaLM / Custom]
        LT --> TF
        PE --> TF
    end

    subgraph ActionOutput["Action Output"]
        TF --> AH[Action Head]
        AH --> ACT[Robot Action<br/>Δx,Δy,Δz,Δrx,Δry,Δrz,gripper]
    end

    style Encoding fill:#e3f2fd
    style Backbone fill:#fff3e0
    style ActionOutput fill:#e8f5e9

Visual encoder choices:

Encoder	Pretraining Method	Parameters	Used by
ViT-B/16	ImageNet-21K	86M	RT-1
ViT-G	JFT-4B	1.8B	RT-2 (PaLI-X)
SigLIP	WebLI contrastive learning	400M	OpenVLA, pi0
DINOv2	Self-supervised	300M	HPT

2.2 Action Representation Methods

How VLA models output actions is a core design choice. There are currently three main approaches:

(a) Discrete Tokenization

Uniformly discretize the continuous action space into tokens:

\[a_d^{\text{token}} = \text{round}\left(\frac{a_d - a_{\min}}{a_{\max} - a_{\min}} \cdot (K-1)\right), \quad K=256\]

Representatives: RT-2, OpenVLA

Advantages: Can directly reuse the language model's token prediction mechanism

Disadvantages: Discretization loses precision; difficult to express multimodal action distributions

(b) Continuous Regression

The action head directly outputs continuous values:

\[\hat{\mathbf{a}} = \text{MLP}(\mathbf{h}_{\text{transformer}})\]

Training loss is typically MSE:

\[\mathcal{L} = \|\hat{\mathbf{a}} - \mathbf{a}^*\|^2\]

Representatives: RT-1, Octo (optional)

Advantages: Simple, direct, high precision

Disadvantages: MSE loss assumes a unimodal Gaussian distribution; cannot model multimodal actions

(c) Diffusion / Flow Matching

Uses generative models to model the action distribution:

\[\mathbf{a} \sim p_\theta(\mathbf{a} | \mathbf{o}, \mathbf{l})\]

Samples actions from noise via iterative denoising or flow matching:

\[\mathbf{a}_1 = \mathbf{a}_0 + \int_0^1 v_\theta(\mathbf{a}_t, t, c) \, dt, \quad \mathbf{a}_0 \sim \mathcal{N}(0, I)\]

Representatives: pi0, RDT, Octo (diffusion head option)

Advantages: Can model complex multimodal action distributions; highest precision

Disadvantages: Inference requires multiple denoising steps; slower

More on diffusion policies: Diffusion Policy

3. Model Development Timeline

3.1 Timeline Overview

timeline
    title VLA Model Development Timeline
    2022 : RT-1 (Google)
         : First large-scale robot Transformer
    2023 : RT-2 (Google DeepMind)
         : VLM first directly outputs actions
         : Octo (Berkeley)
         : Open-source multi-embodiment foundation model
    2024 : OpenVLA (Stanford/Berkeley)
         : Open-source 7B VLA
         : pi0 (Physical Intelligence)
         : Flow matching action head
         : GR-1 (Fourier Intelligence)
         : Humanoid-specific VLA
         : HPT (MIT)
         : Heterogeneous sensor pretraining
         : RDT (Tsinghua)
         : Diffusion Transformer bimanual manipulation
    2025 : pi0.5 (Physical Intelligence)
         : Hierarchical task decomposition
         : GR-2 (Fourier Intelligence)
         : World model component

3.2 Detailed Model Cards

RT-1 (Google, 2022)

Architecture: EfficientNet-B3 visual encoding + TokenLearner compression + Transformer decoder
Data: 130K real robot episodes, 700+ tasks, 13 Everyday Robots
Action space: Discretized tokens (256 bins per dimension), outputs 7DoF end-effector pose + termination signal + mobile base
Control frequency: 3Hz
Key contribution: Demonstrated that large-scale real data-trained Transformers can generalize to novel objects and instructions
Limitation: Only supports a single robot platform; generalization limited to within training distribution

RT-2 (Google DeepMind, 2023)

Architecture: PaLI-X (55B) or PaLM-E (12B) as backbone, co-fine-tuned
Data: Robot data + Web-scale vision-language data
Action representation: Actions encoded as text tokens "1 128 91 241 5 101 127"
Key contributions: - First demonstration that a VLM can be fine-tuned into a VLA - Emergent reasoning: understanding "move apple to bowl with matching color" - Symbolic reasoning: understanding logic and relations in language
Limitation: Enormous model (55B), inference speed only 1–3Hz

Octo (Berkeley, 2023)

Architecture: Pure Transformer, supports multiple observation and action spaces
Data: Open X-Embodiment dataset (800K+ episodes)
Action head: Supports both continuous regression and diffusion modes
Key contributions: - First open-source general robot foundation model - Flexible architecture supporting different robots - Supports both language and goal image task specification
Parameters: 93M (Base)

OpenVLA (Stanford/Berkeley, 2024)

Architecture: Prismatic VLM (SigLIP + DinoV2 dual visual encoders) + Llama 2 7B
Data: Open X-Embodiment dataset
Action representation: Discretized tokens (256 bins), reusing LLM token prediction
Key contributions: - Open-source 7B-scale VLA, fine-tunable on consumer GPUs - Demonstrated that VLM architectures can effectively transfer to robot control
Fine-tuning: LoRA efficient fine-tuning, requiring only small amounts of data on new robots

pi0 (Physical Intelligence, 2024)

Architecture: 3B pretrained VLM + Flow Matching action head
Data: Large-scale dataset across multiple robot platforms
Action representation: Flow Matching generates continuous action sequences (action chunks)
Key contributions: - Flow Matching action head can model multimodal action distributions - Cross-embodiment transfer: same model controls multiple different robots - Action chunks (predicting multiple future steps at once) improve temporal consistency
Control frequency: ~50Hz (GPU inference + action chunk)

pi0.5 (Physical Intelligence, 2025)

Architecture: Dual-layer structure — high-level VLM for sub-task planning + low-level pi0 for fine-grained control
Key contributions: - Hierarchical Task Decomposition - High-level model understands long-horizon complex tasks - Low-level model executes fine manipulation actions - Completes end-to-end laundry, cleaning, and other long-sequence tasks in real home environments

GR-1 (Fourier Intelligence, 2024)

Architecture: GPT-style Transformer, simultaneously predicts video frames and actions
Data: Humanoid robot manipulation data + human video data
Key contributions: - First humanoid robot-specific VLA model - Multi-task learning of video prediction + action prediction - Can learn from human videos, then transfer to humanoid robots

GR-2 (Fourier Intelligence, 2025)

Architecture: 3B+ parameters, includes world model component
Key contributions: - Scale increased to 3B+ parameters - Introduces world model component to predict future visual states - Supports more complex humanoid robot whole-body manipulation

HPT (MIT, 2024)

Architecture: Heterogeneous Pretrained Transformer, unified processing of sensor inputs with different dimensions
Key contributions: - Solves the problem of inconsistent sensor dimensions across robots - Uses modality alignment layers (stems) to map heterogeneous inputs to a unified space - Pretrained on Open X-Embodiment + DROID

RDT (Tsinghua, 2024)

Architecture: Diffusion Transformer (DiT-style), focused on bimanual manipulation
Data: Bimanual manipulation datasets
Key contributions: - Introduces DiT architecture to robot action generation - Natively supports high-dimensional bimanual action spaces (14+ DoF) - Diffusion process can model the complex action distributions of bimanual coordination

4. Core Technical Deep Dive

4.1 Action Chunking

Action chunking is a key technique in VLA models. Instead of predicting a single action per step, it predicts a sequence of \(H\) future actions at once:

\[\hat{\mathbf{a}}_{t:t+H} = \pi_\theta(\mathbf{o}_t, \mathbf{l})\]

where \(H\) is the chunk size (typically 16–100 steps).

Benefits:

Temporal consistency: Avoids jitter and incoherence in step-by-step prediction
Reduced inference calls: Model inference needed only every \(H\) steps
Implicit planning: The model learns short-term action planning

Execution strategy: Typically, instead of executing the entire chunk before re-predicting, the model re-predicts every \(k < H\) steps, fusing old and new chunks via exponential weighted averaging:

\[\mathbf{a}_t^{\text{exec}} = w \cdot \hat{\mathbf{a}}_t^{\text{new}} + (1-w) \cdot \hat{\mathbf{a}}_t^{\text{old}}\]

This design line did not appear out of nowhere. The key bridge model is ACT Model: ACT turned chunked action prediction into a reproducible and interpretable policy paradigm, and many later VLA horizon, chunk execution, and temporal smoothing designs inherit intuition from that line.

4.2 Co-fine-tuning Strategy

Co-fine-tuning, proposed by RT-2, is a key training technique:

\[\mathcal{L}_{\text{total}} = \lambda_{\text{robot}} \mathcal{L}_{\text{robot}} + \lambda_{\text{web}} \mathcal{L}_{\text{web}}\]

During the fine-tuning phase, the original VLM training data is not entirely discarded; instead, robot data and web data are mixed for training. This approach:

Preserves the VLM's original visual understanding and language capabilities
Prevents catastrophic forgetting
Allows web knowledge to continuously influence robot policy learning

4.3 Challenges of Cross-Embodiment Transfer

Key differences between robots:

Difference Dimension	Example
Observation space	Single camera vs. dual cameras vs. wrist camera
Action space	6DoF end-effector vs. 7DoF joint vs. 14DoF bimanual
Action range	Tabletop manipulation vs. whole-body movement
Control frequency	3Hz vs. 50Hz vs. 200Hz
Dynamics	Light load vs. heavy load

Handling strategies:

Action space standardization (Octo): Normalize all actions to a unified range
Modality alignment layers (HPT): Use learnable stems to map heterogeneous inputs to a unified space
Task-specific fine-tuning heads: Share the backbone, fine-tune output heads for different robots

5. Model Comparison Summary

Model	Year	Institution	Parameters	Action Repr.	Data Scale	Open Source
RT-1	2022	Google	35M	Discrete Token	130K ep	No
RT-2	2023	Google DeepMind	12–55B	Discrete Token	130K ep + Web	No
Octo	2023	Berkeley	93M	Continuous/Diffusion	800K+ ep	Yes
OpenVLA	2024	Stanford/Berkeley	7B	Discrete Token	970K+ ep	Yes
pi0	2024	Physical Intelligence	3B	Flow Matching	Large-scale	Yes
pi0.5	2025	Physical Intelligence	3B+	Flow Matching	Large-scale	No
GR-1	2024	Fourier	~1B	Continuous Regression	Humanoid data	Partial
GR-2	2025	Fourier	3B+	Continuous Regression	Humanoid data	No
HPT	2024	MIT	~300M	Continuous/Diffusion	Multi-source	Yes
RDT	2024	Tsinghua	~1B	Diffusion	Bimanual data	Yes

6. Future Trends

Evolution of action heads: From discrete tokens → continuous regression → diffusion/Flow Matching, trending toward higher precision and multimodal modeling
Hierarchical design: The high-level planning + low-level control paradigm of pi0.5 may become mainstream
Training efficiency: LoRA, QLoRA, and other efficient fine-tuning methods reduce VLA adaptation costs
Real-time performance: Model distillation, quantization, action chunking, and other techniques improve inference speed
Data flywheel: Data collected from deployed VLAs feeds back into model training, forming a positive cycle

References:

Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", RSS 2023
Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL 2023
Team et al., "Octo: An Open-Source Generalist Robot Policy", RSS 2024
Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
Physical Intelligence, "pi0.5: a Vision-Language-Action Model with Open-World Generalization", 2025
Wu et al., "GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", 2024
Liang et al., "HPT: Heterogeneous Pre-trained Transformers", 2024
Liu et al., "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", 2024