Skip to content

VLA Models (Vision-Language-Action Models)

VLA (Vision-Language-Action) models are one of the most important model paradigms in embodied intelligence today: they receive visual observations and language instructions, and directly output robot actions. This article systematically reviews VLA architecture design, action representation methods, and the development trajectory from RT-1 to pi0.5.

Related notes: Model Roadmap | ACT Model | Imitation Learning | Diffusion Policy | Foundation Models for Robotics | Open-Source Model Summary

If you want the broader model evolution before zooming into the VLA mainline, start with Model Roadmap.


1. VLA Model Definition

1.1 What Is a VLA

The core definition of a VLA model:

\[\pi_\theta: (\mathbf{o}_{\text{visual}}, \mathbf{l}_{\text{language}}) \mapsto \mathbf{a}_{\text{action}}\]

where:

  • \(\mathbf{o}_{\text{visual}}\): Visual observation (RGB images, depth maps, point clouds, etc.)
  • \(\mathbf{l}_{\text{language}}\): Natural language task instruction (e.g., "pick up the red cup")
  • \(\mathbf{a}_{\text{action}}\): Robot action (end-effector pose, joint angles, etc.)

What distinguishes VLA from other paradigms is that it does not merely use vision and language for task understanding, but directly outputs executable low-level actions, achieving end-to-end perception-to-action mapping.

1.2 Why VLA Is Needed

Traditional robot learning methods (such as behavioral cloning) typically only accept specific observation formats and lack language understanding capabilities. Pure LLMs/VLMs, on the other hand, cannot directly output robot actions. VLA unifies both:

  • Inherited from VLM: Visual understanding, language reasoning, commonsense knowledge
  • New capabilities: Action output, physical interaction, real-time control

2. General Architecture

2.1 Three Core Components

All VLA models follow a basic three-component architecture:

graph LR
    subgraph Input
        IMG[RGB Image] --> VE
        LANG[Language Instruction] --> LT
        PROP[Proprioception] --> PE
    end

    subgraph Encoding
        VE[Visual Encoder<br/>ViT / SigLIP / DINOv2]
        LT[Language Tokenizer<br/>SentencePiece / BPE]
        PE[Proprioceptive Encoder<br/>MLP]
    end

    subgraph Backbone
        VE --> TF[Transformer Backbone<br/>Llama / PaLM / Custom]
        LT --> TF
        PE --> TF
    end

    subgraph ActionOutput["Action Output"]
        TF --> AH[Action Head]
        AH --> ACT[Robot Action<br/>Δx,Δy,Δz,Δrx,Δry,Δrz,gripper]
    end

    style Encoding fill:#e3f2fd
    style Backbone fill:#fff3e0
    style ActionOutput fill:#e8f5e9

Visual encoder choices:

Encoder Pretraining Method Parameters Used by
ViT-B/16 ImageNet-21K 86M RT-1
ViT-G JFT-4B 1.8B RT-2 (PaLI-X)
SigLIP WebLI contrastive learning 400M OpenVLA, pi0
DINOv2 Self-supervised 300M HPT

2.2 Action Representation Methods

How VLA models output actions is a core design choice. There are currently three main approaches:

(a) Discrete Tokenization

Uniformly discretize the continuous action space into tokens:

\[a_d^{\text{token}} = \text{round}\left(\frac{a_d - a_{\min}}{a_{\max} - a_{\min}} \cdot (K-1)\right), \quad K=256\]

Representatives: RT-2, OpenVLA

Advantages: Can directly reuse the language model's token prediction mechanism

Disadvantages: Discretization loses precision; difficult to express multimodal action distributions

(b) Continuous Regression

The action head directly outputs continuous values:

\[\hat{\mathbf{a}} = \text{MLP}(\mathbf{h}_{\text{transformer}})\]

Training loss is typically MSE:

\[\mathcal{L} = \|\hat{\mathbf{a}} - \mathbf{a}^*\|^2\]

Representatives: RT-1, Octo (optional)

Advantages: Simple, direct, high precision

Disadvantages: MSE loss assumes a unimodal Gaussian distribution; cannot model multimodal actions

(c) Diffusion / Flow Matching

Uses generative models to model the action distribution:

\[\mathbf{a} \sim p_\theta(\mathbf{a} | \mathbf{o}, \mathbf{l})\]

Samples actions from noise via iterative denoising or flow matching:

\[\mathbf{a}_1 = \mathbf{a}_0 + \int_0^1 v_\theta(\mathbf{a}_t, t, c) \, dt, \quad \mathbf{a}_0 \sim \mathcal{N}(0, I)\]

Representatives: pi0, RDT, Octo (diffusion head option)

Advantages: Can model complex multimodal action distributions; highest precision

Disadvantages: Inference requires multiple denoising steps; slower

More on diffusion policies: Diffusion Policy


3. Model Development Timeline

3.1 Timeline Overview

timeline
    title VLA Model Development Timeline
    2022 : RT-1 (Google)
         : First large-scale robot Transformer
    2023 : RT-2 (Google DeepMind)
         : VLM first directly outputs actions
         : Octo (Berkeley)
         : Open-source multi-embodiment foundation model
    2024 : OpenVLA (Stanford/Berkeley)
         : Open-source 7B VLA
         : pi0 (Physical Intelligence)
         : Flow matching action head
         : GR-1 (Fourier Intelligence)
         : Humanoid-specific VLA
         : HPT (MIT)
         : Heterogeneous sensor pretraining
         : RDT (Tsinghua)
         : Diffusion Transformer bimanual manipulation
    2025 : pi0.5 (Physical Intelligence)
         : Hierarchical task decomposition
         : GR-2 (Fourier Intelligence)
         : World model component

3.2 Detailed Model Cards

RT-1 (Google, 2022)

  • Architecture: EfficientNet-B3 visual encoding + TokenLearner compression + Transformer decoder
  • Data: 130K real robot episodes, 700+ tasks, 13 Everyday Robots
  • Action space: Discretized tokens (256 bins per dimension), outputs 7DoF end-effector pose + termination signal + mobile base
  • Control frequency: 3Hz
  • Key contribution: Demonstrated that large-scale real data-trained Transformers can generalize to novel objects and instructions
  • Limitation: Only supports a single robot platform; generalization limited to within training distribution

RT-2 (Google DeepMind, 2023)

  • Architecture: PaLI-X (55B) or PaLM-E (12B) as backbone, co-fine-tuned
  • Data: Robot data + Web-scale vision-language data
  • Action representation: Actions encoded as text tokens "1 128 91 241 5 101 127"
  • Key contributions: - First demonstration that a VLM can be fine-tuned into a VLA - Emergent reasoning: understanding "move apple to bowl with matching color" - Symbolic reasoning: understanding logic and relations in language
  • Limitation: Enormous model (55B), inference speed only 1–3Hz

Octo (Berkeley, 2023)

  • Architecture: Pure Transformer, supports multiple observation and action spaces
  • Data: Open X-Embodiment dataset (800K+ episodes)
  • Action head: Supports both continuous regression and diffusion modes
  • Key contributions: - First open-source general robot foundation model - Flexible architecture supporting different robots - Supports both language and goal image task specification
  • Parameters: 93M (Base)

OpenVLA (Stanford/Berkeley, 2024)

  • Architecture: Prismatic VLM (SigLIP + DinoV2 dual visual encoders) + Llama 2 7B
  • Data: Open X-Embodiment dataset
  • Action representation: Discretized tokens (256 bins), reusing LLM token prediction
  • Key contributions: - Open-source 7B-scale VLA, fine-tunable on consumer GPUs - Demonstrated that VLM architectures can effectively transfer to robot control
  • Fine-tuning: LoRA efficient fine-tuning, requiring only small amounts of data on new robots

pi0 (Physical Intelligence, 2024)

  • Architecture: 3B pretrained VLM + Flow Matching action head
  • Data: Large-scale dataset across multiple robot platforms
  • Action representation: Flow Matching generates continuous action sequences (action chunks)
  • Key contributions: - Flow Matching action head can model multimodal action distributions - Cross-embodiment transfer: same model controls multiple different robots - Action chunks (predicting multiple future steps at once) improve temporal consistency
  • Control frequency: ~50Hz (GPU inference + action chunk)

pi0.5 (Physical Intelligence, 2025)

  • Architecture: Dual-layer structure — high-level VLM for sub-task planning + low-level pi0 for fine-grained control
  • Key contributions: - Hierarchical Task Decomposition - High-level model understands long-horizon complex tasks - Low-level model executes fine manipulation actions - Completes end-to-end laundry, cleaning, and other long-sequence tasks in real home environments

GR-1 (Fourier Intelligence, 2024)

  • Architecture: GPT-style Transformer, simultaneously predicts video frames and actions
  • Data: Humanoid robot manipulation data + human video data
  • Key contributions: - First humanoid robot-specific VLA model - Multi-task learning of video prediction + action prediction - Can learn from human videos, then transfer to humanoid robots

GR-2 (Fourier Intelligence, 2025)

  • Architecture: 3B+ parameters, includes world model component
  • Key contributions: - Scale increased to 3B+ parameters - Introduces world model component to predict future visual states - Supports more complex humanoid robot whole-body manipulation

HPT (MIT, 2024)

  • Architecture: Heterogeneous Pretrained Transformer, unified processing of sensor inputs with different dimensions
  • Key contributions: - Solves the problem of inconsistent sensor dimensions across robots - Uses modality alignment layers (stems) to map heterogeneous inputs to a unified space - Pretrained on Open X-Embodiment + DROID

RDT (Tsinghua, 2024)

  • Architecture: Diffusion Transformer (DiT-style), focused on bimanual manipulation
  • Data: Bimanual manipulation datasets
  • Key contributions: - Introduces DiT architecture to robot action generation - Natively supports high-dimensional bimanual action spaces (14+ DoF) - Diffusion process can model the complex action distributions of bimanual coordination

4. Core Technical Deep Dive

4.1 Action Chunking

Action chunking is a key technique in VLA models. Instead of predicting a single action per step, it predicts a sequence of \(H\) future actions at once:

\[\hat{\mathbf{a}}_{t:t+H} = \pi_\theta(\mathbf{o}_t, \mathbf{l})\]

where \(H\) is the chunk size (typically 16–100 steps).

Benefits:

  1. Temporal consistency: Avoids jitter and incoherence in step-by-step prediction
  2. Reduced inference calls: Model inference needed only every \(H\) steps
  3. Implicit planning: The model learns short-term action planning

Execution strategy: Typically, instead of executing the entire chunk before re-predicting, the model re-predicts every \(k < H\) steps, fusing old and new chunks via exponential weighted averaging:

\[\mathbf{a}_t^{\text{exec}} = w \cdot \hat{\mathbf{a}}_t^{\text{new}} + (1-w) \cdot \hat{\mathbf{a}}_t^{\text{old}}\]

This design line did not appear out of nowhere. The key bridge model is ACT Model: ACT turned chunked action prediction into a reproducible and interpretable policy paradigm, and many later VLA horizon, chunk execution, and temporal smoothing designs inherit intuition from that line.

4.2 Co-fine-tuning Strategy

Co-fine-tuning, proposed by RT-2, is a key training technique:

\[\mathcal{L}_{\text{total}} = \lambda_{\text{robot}} \mathcal{L}_{\text{robot}} + \lambda_{\text{web}} \mathcal{L}_{\text{web}}\]

During the fine-tuning phase, the original VLM training data is not entirely discarded; instead, robot data and web data are mixed for training. This approach:

  • Preserves the VLM's original visual understanding and language capabilities
  • Prevents catastrophic forgetting
  • Allows web knowledge to continuously influence robot policy learning

4.3 Challenges of Cross-Embodiment Transfer

Key differences between robots:

Difference Dimension Example
Observation space Single camera vs. dual cameras vs. wrist camera
Action space 6DoF end-effector vs. 7DoF joint vs. 14DoF bimanual
Action range Tabletop manipulation vs. whole-body movement
Control frequency 3Hz vs. 50Hz vs. 200Hz
Dynamics Light load vs. heavy load

Handling strategies:

  1. Action space standardization (Octo): Normalize all actions to a unified range
  2. Modality alignment layers (HPT): Use learnable stems to map heterogeneous inputs to a unified space
  3. Task-specific fine-tuning heads: Share the backbone, fine-tune output heads for different robots

5. Model Comparison Summary

Model Year Institution Parameters Action Repr. Data Scale Open Source
RT-1 2022 Google 35M Discrete Token 130K ep No
RT-2 2023 Google DeepMind 12–55B Discrete Token 130K ep + Web No
Octo 2023 Berkeley 93M Continuous/Diffusion 800K+ ep Yes
OpenVLA 2024 Stanford/Berkeley 7B Discrete Token 970K+ ep Yes
pi0 2024 Physical Intelligence 3B Flow Matching Large-scale Yes
pi0.5 2025 Physical Intelligence 3B+ Flow Matching Large-scale No
GR-1 2024 Fourier ~1B Continuous Regression Humanoid data Partial
GR-2 2025 Fourier 3B+ Continuous Regression Humanoid data No
HPT 2024 MIT ~300M Continuous/Diffusion Multi-source Yes
RDT 2024 Tsinghua ~1B Diffusion Bimanual data Yes

  1. Evolution of action heads: From discrete tokens → continuous regression → diffusion/Flow Matching, trending toward higher precision and multimodal modeling
  2. Hierarchical design: The high-level planning + low-level control paradigm of pi0.5 may become mainstream
  3. Training efficiency: LoRA, QLoRA, and other efficient fine-tuning methods reduce VLA adaptation costs
  4. Real-time performance: Model distillation, quantization, action chunking, and other techniques improve inference speed
  5. Data flywheel: Data collected from deployed VLAs feeds back into model training, forming a positive cycle

References:

  • Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", RSS 2023
  • Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", CoRL 2023
  • Team et al., "Octo: An Open-Source Generalist Robot Policy", RSS 2024
  • Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
  • Black et al., "pi0: A Vision-Language-Action Flow Model for General Robot Control", 2024
  • Physical Intelligence, "pi0.5: a Vision-Language-Action Model with Open-World Generalization", 2025
  • Wu et al., "GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation", 2024
  • Liang et al., "HPT: Heterogeneous Pre-trained Transformers", 2024
  • Liu et al., "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation", 2024

评论 #