Model Roadmap
Embodied AI models did not evolve along a single line. They advanced along two axes at the same time:
- a timeline axis: from classic imitation learning to sequence modeling, then VLAs, world models, and agentic planning
- a paradigm axis: from "learn actions" to "learn action distributions," then to "unify vision, language, and action," and finally to "learn a world and plan inside it"
This note does not replace single-model notes. Its job is to provide the missing map for the whole 05_Models section. After reading it, you should know:
- what stages robot models have gone through
- where ACT sits in the overall lineage
- which branches VLA, Diffusion Policy, world models, and LLM planning belong to
- what to read next
Related notes: Foundation Models for Robotics | VLA Models | ACT Model | LLM-Driven Robotics | World Models & Video Generation
1. Why a Model Roadmap Is Needed
If you only memorize models by publication year, two mistakes are common:
- assuming newer models fully replace older ones
- assuming all works compete on the same technical line
In practice:
- ACT is not a foundation model, but it is a key bridge for the
action chunkingline - Diffusion Policy is not a VLA, but it defines a major branch of generative action modeling
- RT-2 / OpenVLA / pi0 belong to the VLA mainline, but use different action heads
- SayCan / Code as Policies mainly solve planning and interfaces, not low-level action generation
- Dreamer / UniSim / Cosmos are closer to world modeling and training in imagination
So the right organization is not just chronology and not just taxonomy. You need timeline + paradigm tree together.
2. Two Axes: Timeline and Paradigm Tree
2.1 What the timeline tells you
The timeline answers: when did the key turning points happen, and which later ideas inherit from them?
2.2 What the paradigm tree tells you
The paradigm tree answers: which layer of the robotics stack a model is mainly trying to solve.
| Dimension | Core Question | Representative Models |
|---|---|---|
| Classic policy learning | How do we learn actions from demonstrations? | BC, DAgger, GAIL |
| Sequence / generative policies | How do we model multi-step, multimodal actions? | Decision Transformer, BeT, ACT, Diffusion Policy |
| VLA / robot foundation models | How do we unify vision, language, and action? | RT-1, RT-2, Octo, OpenVLA, pi0 |
| World models | How do we predict futures and train inside a model? | Dreamer, UniSim, Genie, Cosmos |
| LLM planning | How do we do long-horizon reasoning, task decomposition, and tool use? | SayCan, Code as Policies, VoxPoser |
3. Stage One: Classic Robot Learning Models
The core goal here was simple: teach the robot to imitate actions from data.
Typical representatives:
- BC: cast policy learning as supervised learning
- DAgger: fix distribution shift at deployment time
- IRL / MaxEnt IRL: infer rewards from demonstrations
- GAIL: formulate imitation learning as adversarial training
The signature of this stage:
- usually narrow tasks
- relatively small datasets
- mostly single-step action mapping
- weak language and weak cross-task generalization
This material mainly lives in Imitation Learning.
4. Stage Two: Sequence Modeling and Generative Policies
Once researchers realized that single-step action regression suffers from jitter, multimodality, and long-horizon accumulation error, a second stage emerged.
4.1 The core shift
From:
to:
In other words, models stopped predicting only the current action and started predicting an action sequence, or a full action distribution.
4.2 Representative models
- Decision Transformer: control as sequence modeling
- BeT: behavior modeling with discrete latent action tokens
- ACT:
CVAE + Transformerfor action chunk prediction - Diffusion Policy: diffusion models for multimodal action distributions
4.3 Why ACT is a bridge node
ACT matters not because it is the biggest model, but because it made the following line explicit:
graph LR
IL[Classic imitation learning] --> CHUNK[Action chunking]
CHUNK --> ACT[ACT]
ACT --> GEN[Generative action modeling]
GEN --> DP[Diffusion Policy / RDT]
ACT --> VLA_HEAD[Later VLA chunk / token / horizon design]
VLA_HEAD --> OPENVLA[OpenVLA / pi0 / FAST]
style ACT fill:#e8f5e9
style CHUNK fill:#fff3e0
style VLA_HEAD fill:#e3f2fd
So ACT links:
- the left side: Imitation Learning
- the right side: larger action-modeling lines such as VLA Models and later action tokenization work
If you read only VLA papers without understanding ACT, you miss a large part of the motivation behind chunked actions.
5. Stage Three: VLAs and Robot Foundation Models
The central question became:
Can we unify vision, language, and action in one model, and borrow web-scale pretraining for robot generalization?
5.1 Mainline representatives
- RT-1: large-scale real-robot Robotics Transformer
- RT-2: VLM-to-VLA transfer from web knowledge
- Octo: open-source, multi-embodiment, unified action interface
- OpenVLA: open 7B VLA for community reproduction
- HPT: unifying heterogeneous sensor inputs
- RDT: diffusion Transformer for high-dimensional bimanual action generation
- pi0: a flow-matching-style VLA mainline
5.2 What changed in this stage
| Change | Earlier policy models | VLA / foundation-model stage |
|---|---|---|
| Inputs | state, limited images | multi-view vision + language + proprioception |
| Data | single task or single robot | multi-task, multi-robot, multi-source |
| Objective | learn a specific skill | learn a general vision-language-action mapping |
| Output | single-step or short-horizon action | token / chunk / diffusion / flow |
| Source of generalization | demo coverage | pretraining + cross-embodiment data + fine-tuning |
For details, see Foundation Models for Robotics and VLA Models.
6. Stage Four: World Models and Training in Imagination
As action models became stronger, the research focus moved one level up:
If a model can predict not only actions but also what will happen next, can we trial-and-error inside the model first?
This branch includes:
- Dreamer family: train policies inside a latent world model
- UniSim: replace large amounts of real interaction with generated worlds
- Genie / interactive video world models: learn controllable dynamics from video
- Cosmos / Genesis: combine large-scale video generation, simulation, and physical AI data
The emphasis is no longer only "give me an action," but also:
- predict future visual states
- predict latent dynamics
- evaluate or train policies with imagined rollouts
The matching note is World Models & Video Generation.
7. Stage Five: LLM Planning and Embodied Agents
There is another parallel line that does not primarily optimize low-level actions. Instead, it studies:
How can large models do long-horizon task understanding, tool use, decomposition, and interface orchestration?
Representative works:
- SayCan: combine LLM likelihood with affordance scores
- Code as Policies: generate code that calls robot APIs
- VoxPoser: use 3D value maps and spatial representations for manipulation
This line is not a replacement for VLA. It sits at a different level:
- VLA is closer to the unified low-level policy
- LLM agents are closer to the high-level planner
See LLM-Driven Robotics.
8. Key Technical Turning Points
The following turning points largely define recent evolution:
| Turning Point | Problem Solved | Representative Work |
|---|---|---|
| Action discretization | Let LLMs/VLMs emit action tokens | RT-1, RT-2, OpenVLA |
| Action chunking | Reduce jitter and improve temporal consistency | ACT, pi0 |
| Diffusion / Flow Matching | Model multimodal continuous action distributions | Diffusion Policy, RDT, pi0 |
| Web pretraining | Import semantic knowledge and reasoning | RT-2, OpenVLA, pi0 |
| Cross-embodiment | One model across multiple robots | Octo, HPT, OpenVLA |
| Open-source training stack | Lower reproduction and fine-tuning cost | Octo, OpenVLA, LeRobot |
| High-frequency action tokenization | Make token-based control viable at higher rates | FAST, newer tokenized policies |
The historical role of ACT can be summarized in one sentence:
It is not the endpoint of robot foundation models, but it turned short-horizon action chunks into a clear modeling paradigm.
9. 2022-2026 Timeline Overview
timeline
title Robot Model Roadmap (2022-2026)
2022 : RT-1
: SayCan
: LLM planning starts entering robotics
2023 : RT-2
: Octo
: ACT
: Diffusion Policy
: Sequence modeling, generative policies, and VLA all accelerate together
2024 : OpenVLA
: pi0
: HPT
: RDT-1B
: Genie / Cosmos / Genesis directions heat up
2025 : pi0.5
: FAST
: Hierarchical VLA and action tokenization continue to evolve
2026 : Current focus
: Better deployability
: Faster action representations
: Fusion of VLA + world model + agents
The last line is a trend summary as of 2026-04, not a list of one-off paper titles.
10. Paradigm Tree and What to Read First
graph TD
ROOT[Embodied AI models] --> IL[Classic imitation learning]
ROOT --> SEQ[Sequence and generative policies]
ROOT --> VLA[VLA / robot foundation models]
ROOT --> WM[World models]
ROOT --> PLAN[LLM planning]
IL --> BC[BC / DAgger / GAIL]
SEQ --> ACT[ACT]
SEQ --> DP[Diffusion Policy]
VLA --> RT[RT-1 / RT-2]
VLA --> OCTO[Octo / OpenVLA / HPT]
VLA --> PI[pi0 / RDT]
WM --> DREAMER[Dreamer / UniSim / Cosmos]
PLAN --> SAYCAN[SayCan / Code as Policies / VoxPoser]
ACT --> VLA
style ACT fill:#e8f5e9
style VLA fill:#e3f2fd
style WM fill:#fff3e0
Recommended reading order
| Your question | Read first | Then read |
|---|---|---|
| I do not know how robot models are organized | Model Roadmap | Foundation Models for Robotics |
| I only care about mainstream VLAs | VLA Models | Open-Source Model Summary |
| I want to understand why ACT keeps coming up | ACT Model | Imitation Learning |
| I work on bimanual or dexterous manipulation | ACT Model | Diffusion Policy |
| I care about world models and generated simulation | World Models & Video Generation | Simulation World Building & Physics Rules |
| I care about open-source reproduction and training entrypoints | Open-Source Model Summary | Open-Source Frameworks |
11. Which Models Deserve Standalone Pages
A model should usually get its own page only if it satisfies at least one of these:
- it opens a clear technical branch
- it is still frequently used in engineering practice
- it is a bridge for understanding later models
Using that rule, the current recommendation is:
| Model / Direction | Recommendation |
|---|---|
| ACT | Already promoted to a standalone page because its bridge role is unusually clear |
| VLA mainline | Already has its own page because the family and timeline are rich enough |
| Diffusion Policy | Keep in Diffusion Policy, because it still fits better as a method branch |
| RT-2 / OpenVLA / pi0 | Keep inside the VLA overview for now; split later if the content grows further |
| World models | Keep as a direction page first; no need to explode into many single-model pages yet |
12. References
- Brohan et al., RT-1: Robotics Transformer for Real-World Control at Scale, RSS 2023
- Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, CoRL 2023
- Octo Team, Octo: An Open-Source Generalist Robot Policy, RSS 2024
- Kim et al., OpenVLA: An Open-Source Vision-Language-Action Model, 2024
- Zhao et al., Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023
- Chi et al., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023
- Black et al., pi0: A Vision-Language-Action Flow Model for General Robot Control, 2024
- Physical Intelligence, FAST: Efficient Robot Action Tokenization, 2025
- Hafner et al., DreamerV3, 2023
- Official pages:
- OpenVLA: https://openvla.github.io/
- Octo: https://octo-models.github.io/
- Tony Zhao / ALOHA + ACT: https://tonyzhaozh.github.io/
- pi0 PDF: https://www.physicalintelligence.company/download/pi0.pdf
- FAST: https://www.physicalintelligence.company/research/fast