Skip to content

ACT Model

ACT (Action Chunking with Transformers) is one of the most representative models in embodied intelligence: it is not the biggest or newest foundation model, but it made action chunking into a clear paradigm, which is why it still matters for understanding later VLA action heads, generative policies, and low-cost bimanual manipulation.

This note treats ACT not as "a subsection inside imitation learning," but as a model-level milestone: what problem it solved, why it mattered in 2023, and how it relates to Diffusion Policy, OpenVLA, and pi0.

Related notes: Model Roadmap | Imitation Learning | VLA Models | Open-Source Model Summary | Teleoperation and Data Collection


1. What ACT Is and Why It Deserves Its Own Page

One-sentence definition:

Instead of predicting a single action for the current step, ACT predicts a short future action chunk and executes it with temporal smoothing.

Formally, it rewrites classic behavioral cloning from

\[ \pi(o_t) \rightarrow a_t \]

to

\[ \pi(o_t, s_t) \rightarrow \hat{\mathbf{a}}_{t:t+H-1} \]

where:

  • \(o_t\) is the visual observation
  • \(s_t\) is the robot state
  • \(H\) is the chunk horizon
  • the output is a sequence of future actions rather than one action

It deserves a standalone note for three reasons:

  1. Its historical position is unusual: it is the bridge from classic imitation learning to chunk-based policies.
  2. Its engineering value is strong: few demonstrations, low-cost hardware, and bimanual precision tasks still make it highly reproducible.
  3. It influenced later model design: many later VLA, tokenized-action, and fast-inference lines keep the same "predict a chunk, not a single step" intuition.

2. Problem Setting

ACT was built for a very specific but difficult class of problems:

  • fine-grained bimanual manipulation
  • few demonstrations
  • low-cost hardware, especially ALOHA-style bimanual teleoperation
  • high-frequency control with temporal consistency

The main weaknesses of standard single-step BC are:

Problem Manifestation
Single-step jitter Each action is predicted independently, making trajectories noisy
Multimodality Multiple valid action paths may exist for the same observation
Long-horizon accumulation Small step errors snowball over time
Hard bimanual coordination The two arms need both temporal continuity and spatial synchronization

ACT answers with:

  • CVAE for multimodality
  • action chunking for short-horizon prediction
  • temporal ensembling for smoothing overlapping predictions

3. Core Ideas

3.1 Action Chunking

Instead of predicting only \(a_t\), ACT predicts a short-horizon action block:

\[ \hat{\mathbf{a}}_{t:t+H-1} = \left(\hat{a}_t, \hat{a}_{t+1}, \ldots, \hat{a}_{t+H-1}\right) \]

This has three direct benefits:

  1. the model explicitly learns short-term temporal structure
  2. inference calls are reduced
  3. actions become more temporally coherent

3.2 Temporal Ensembling

The model repeatedly predicts future chunks at consecutive times, so a given time step can have multiple overlapping predictions. ACT fuses them with exponentially decayed averaging:

\[ a_t^{\text{exec}} = \frac{\sum_i w_i \hat{a}_t^{(i)}}{\sum_i w_i}, \quad w_i = \exp(-m \Delta t_i) \]

Newer predictions get larger weights; older ones fade out.

3.3 CVAE Latent

ACT does not use purely deterministic regression. It introduces a latent variable \(z\) to encode an action style or mode:

\[ z \sim q_\phi(z \mid o_t, s_t, \mathbf{a}_{t:t+H-1}) \]

During training, the encoder sees the observation and the future action chunk and compresses that mode into the latent; during inference, the model falls back to a stable prior-based latent and outputs a chunk directly.


4. Model Architecture

4.1 Overall structure

graph LR
    subgraph Inputs["Inputs"]
        IMG[Multi-view images] --> ENC
        STATE[Robot state] --> ENC
    end

    subgraph TrainEncoder["CVAE encoder during training"]
        GT[Future action chunk] --> LAT
        ENC --> LAT
        LAT[Infer q(z|obs, actions)] --> Z[z]
    end

    subgraph Decoder["Transformer decoder"]
        ENC[Vision + state encoding] --> TF[Transformer]
        Z --> TF
        TF --> CHUNK[Predicted action chunk]
    end

    style Inputs fill:#e3f2fd
    style TrainEncoder fill:#fff3e0
    style Decoder fill:#e8f5e9

4.2 Inputs and outputs

Typical ACT inputs include:

  • multi-view RGB images
  • arm joint state or end-effector state
  • gripper state

The output is:

  • a future action chunk of length \(H\)
  • often bimanual joint or end-effector control commands

4.3 Division of labor

Component Role
Visual encoder Extract features from multiple camera views
State encoder Represent proprioception
CVAE encoder During training, infer a latent style variable from the future expert actions
Transformer decoder Produce a future action chunk conditioned on observation and latent

5. Mathematical Objective

5.1 Chunk prediction

ACT learns:

\[ p_\theta(\mathbf{a}_{t:t+H-1} \mid o_t, s_t, z) \]

instead of only:

\[ p_\theta(a_t \mid o_t) \]

5.2 Reconstruction loss

The decoded chunk should match the expert future chunk:

\[ \mathcal{L}_{\text{recon}} = \sum_{j=0}^{H-1} \left\| \hat{a}_{t+j} - a^*_{t+j} \right\|_2^2 \]

5.3 KL regularization

The CVAE posterior is kept close to a standard normal prior:

\[ \mathcal{L}_{\text{KL}} = D_{\text{KL}}\left(q_\phi(z \mid o_t, s_t, \mathbf{a}^*) \,\|\, \mathcal{N}(0, I)\right) \]

5.4 Total loss

The training objective is typically:

\[ \mathcal{L}_{\text{ACT}} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}} \]

where \(\beta\) controls the strength of the latent regularization.


6. Inference Procedure

Training and inference are not symmetric in ACT, which is one of the most important things to understand about it.

graph TD
    T1[Training] --> T2[Observation + future expert actions]
    T2 --> T3[CVAE encoder infers z]
    T3 --> T4[Transformer outputs chunk]
    T4 --> T5[Reconstruction loss + KL]

    I1[Inference] --> I2[Current observation only]
    I2 --> I3[No future expert actions available]
    I3 --> I4[Use prior / fixed latent]
    I4 --> I5[Transformer outputs chunk]
    I5 --> I6[Temporal ensembling for smooth execution]

    style T1 fill:#fff3e0
    style I1 fill:#e8f5e9

6.1 Why inference often uses z = 0

Because training pulls the posterior toward a standard normal prior, inference can use the prior mean as a stable default:

\[ z_{\text{test}} = 0 \]

This gives the model a deterministic and stable default action style at test time.

6.2 Why temporal ensembling is still needed

Even with chunk prediction, contact-rich tasks still suffer from vision noise, slight state mismatch, and occlusions. Overlapping chunk fusion helps reduce:

  • jitter
  • arm desynchronization
  • small oscillations near contact boundaries

7. Why ACT Mattered in 2023

7.1 It showed that few high-quality demos plus the right structure can solve precision tasks

What made ACT memorable was not parameter count, but that it achieved strong success on fine-grained bimanual tasks on ALOHA with roughly 50 demonstrations.

7.2 It turned chunk-based policy learning into a clear line

After ACT, predicting a short future chunk stopped feeling exotic. More and more later work treated the following as natural:

  • predict several future actions at once
  • treat time horizon as an explicit design variable
  • represent actions as chunks, tokens, or generated trajectories

7.3 Together with ALOHA, it formed a low-cost embodied research stack

ACT's impact came not only from the paper, but from the fact that it sat inside a practical loop of low-cost hardware, teleoperation, demonstrations, and code. For many labs, it became one of the first bimanual manipulation systems they could realistically reproduce.


8. Limitations of ACT

ACT is important, but its boundaries are equally important:

Limitation Explanation
Narrower task scope Best suited for tabletop, short-horizon manipulation-heavy tasks
Limited generalization It is not a web-scale pretrained foundation model
Sensitive to the data distribution Camera placement, action definitions, and teleop style matter a lot
Weak semantics It does not inherit strong language or world knowledge like a VLA
Still requires control tuning Chunk size, replanning frequency, and temporal weights matter

9. Relationship to Neighboring Methods

9.1 ACT vs BC

  • BC usually predicts a single step
  • ACT predicts an action chunk and models multimodality with a latent variable

9.2 ACT vs Diffusion Policy

  • ACT uses CVAE + Transformer and is usually lighter at inference
  • Diffusion Policy models richer action distributions but pays more inference cost

9.3 ACT vs BeT / tokenized actions

  • ACT predicts continuous action chunks
  • BeT and later tokenizers push harder on turning actions into discrete tokens for Transformer-native modeling

9.4 ACT vs VLA / OpenVLA / pi0 / FAST

Dimension ACT Diffusion Policy OpenVLA pi0
Main role Fine manipulation policy Generative manipulation policy Open VLA Flow-based VLA
Inputs image + state image + state image + language + state image + language + state
Output continuous action chunk diffusion-generated action sequence discrete action tokens continuous chunk / flow
Language ability weak weak strong strong
Typical data scale tens to hundreds of demos medium offline imitation datasets close to one million episodes large multi-robot data
Core value establishes the chunked policy paradigm stronger multimodal action modeling reproducible VLA mainline stronger continuous action generation and cross-embodiment transfer

The key relationship is not "who replaced whom," but:

  • ACT made chunked action prediction mainstream
  • Diffusion Policy pushed generative action modeling further
  • OpenVLA and pi0 moved action modeling into larger vision-language-action foundation models

10. Engineering Considerations

10.1 Data organization

ACT depends heavily on high-quality demonstrations. In practice you need:

  • synchronized multi-camera streams
  • tight alignment between robot state and image frames
  • a consistent bimanual action definition
  • stable teleoperation style

10.2 Key hyperparameters

Hyperparameter Purpose If increased If decreased
chunk_size / horizon predicted future length stronger short-term planning, more stability demands closer to single-step control
beta KL regularization strength cleaner latent space, possible underfitting more flexibility, more overfitting risk
batch_size optimization stability smoother gradients, more memory noisier updates
camera_views visual coverage fewer occlusions, more engineering overhead easier setup, less information
replan interval how often to predict a new chunk more responsive, more compute cheaper, less adaptive

10.3 Training cost

One practical strength of ACT is that it is much cheaper to train than 7B-scale VLAs. It is a strong choice as:

  • a first baseline for small labs doing bimanual precision manipulation
  • the first model on top of a LeRobot / ALOHA-style data stack
  • a diagnosis tool for separating data problems from model problems

11. Good and Bad Use Cases

Better fit

  • bimanual tabletop precision tasks
  • short- to medium-horizon manipulation
  • small but high-quality demonstration datasets
  • tasks where temporal consistency matters but diffusion inference is too expensive

Worse fit

  • strong language understanding and open-vocabulary instructions
  • direct zero-shot transfer across many robot platforms
  • long-horizon high-level planning
  • open-world foundation-model ambitions

12. Open-Source Ecosystem and Reproduction

A major reason ACT still deserves attention is that it still has a clear open-source path:

  • ALOHA / project page: low-cost bimanual teleoperation + ACT
  • official code repo: tonyzhaozh/act
  • LeRobot: keeps ACT as an important baseline inside a unified training framework

A practical engineering path is often:

  1. validate the pipeline first with LeRobot or simulation
  2. move to real ALOHA-style bimanual collection
  3. only then decide whether to upgrade to Diffusion Policy or a larger VLA

13. Why It Matters but Is Not the Endpoint

ACT's historical value is that it made "the action chunk" into a first-class prediction object;
but it did not solve the biggest foundation-model-era problems: large-scale cross-embodiment pretraining, strong language generalization, or open-world long-horizon planning.

So the most accurate positioning is:

  • not the endpoint
  • not an obsolete baseline
  • a bridge

If you are building a mental model of the field, read it together with:


14. References

  • Zhao et al., Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, RSS 2023
  • ACT repository: https://github.com/tonyzhaozh/act
  • Tony Zhao / ALOHA + ACT project page: https://tonyzhaozh.github.io/
  • LeRobot ACT docs: https://huggingface.co/docs/lerobot/act
  • Chi et al., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023
  • Kim et al., OpenVLA: An Open-Source Vision-Language-Action Model, 2024
  • Black et al., pi0: A Vision-Language-Action Flow Model for General Robot Control, 2024
  • Physical Intelligence, FAST: Efficient Robot Action Tokenization, 2025

评论 #