Skip to content

Open-Source Model Summary

This article compiles the major open-source models in embodied intelligence, organizes them by category, and provides quick-start guides for core entries. For practitioners, choosing the right open-source starting point can significantly accelerate R&D.

Related notes: Model Roadmap | VLA Models | ACT Model | Foundation Models for Robotics | Open-Source Frameworks


1. VLA Models (Vision-Language-Action)

Model Year Institution Parameters Supported Robots Action Repr. License GitHub
RT-1 2022 Google 35M Everyday Robots Discrete Token Apache 2.0 google-research/robotics_transformer
Octo 2023 UC Berkeley 93M Multi-embodiment Continuous/Diffusion MIT octo-models/octo
OpenVLA 2024 Stanford/Berkeley 7B Multi-embodiment Discrete Token MIT openvla/openvla
pi0 2024 Physical Intelligence 3B Multi-embodiment Flow Matching Apache 2.0 physical-intelligence/openpi
RoboFlamingo 2023 ByteDance 3B Tabletop manipulation Continuous Regression MIT RoboFlamingo/RoboFlamingo
GR-1 2024 Fourier Intelligence ~1B Humanoid robots Continuous Regression Partially open bytedance/GR-1
HPT 2024 MIT ~300M Multi-embodiment Continuous/Diffusion MIT liruiw/HPT
RDT-1B 2024 Tsinghua 1.2B Bimanual manipulation Diffusion Apache 2.0 thu-ml/RoboticsDiffusionTransformer

2. Policy Models

Model Year Institution Method Use Case License GitHub
Diffusion Policy 2023 Columbia/MIT Diffusion denoising General manipulation MIT real-stanford/diffusion_policy
ACT 2023 Stanford/Google CVAE + Transformer Bimanual fine manipulation MIT tonyzhaozh/act
ACT++ 2024 Stanford Improved ACT Bimanual manipulation MIT Same as ACT repo
DP3 2024 PKU 3D Diffusion Policy 3D manipulation MIT YanjieZe/3D-Diffusion-Policy
DexCap 2024 Stanford Dexterous hand IL Dexterous manipulation MIT j96w/DexCap
BeT 2023 UC Berkeley Behavior Transformer Multimodal actions MIT notmahi/bet

3. Planning and Reasoning Models

Model Year Institution Method Use Case License GitHub
SayCan 2022 Google LLM + Affordance Mobile manipulation Apache 2.0 google-research/saycan
VoxPoser 2023 Stanford LLM → 3D Value Maps Tabletop manipulation MIT huangwl18/VoxPoser
Code as Policies 2023 Google LLM code generation General Apache 2.0 google-research/google-research
ManipLLM 2024 HKU Multimodal manipulation Manipulation reasoning MIT clorislili/ManipLLM
SpatialVLM 2024 Google Spatial reasoning VLM Spatial understanding Apache 2.0 Paper only

4. World Models

Model Year Institution Method Use Case License GitHub
Dreamer v3 2023 DeepMind/Danijar RSSM + Actor-Critic General RL MIT danijar/dreamerv3
Cosmos 2025 NVIDIA Video Diffusion/AR Physical AI simulation NVIDIA open source NVIDIA/Cosmos
Genesis 2024 Community Differentiable physics Robot simulation Apache 2.0 Genesis-Embodied-AI/Genesis

5. Unified Frameworks and Tools

Tool Year Institution Purpose License GitHub
LeRobot 2024 Hugging Face Unified robot learning framework Apache 2.0 huggingface/lerobot
droid_policy_learning 2024 DROID Team DROID dataset policy learning MIT droid-dataset/droid_policy_learning
robomimic 2021 Stanford Imitation learning framework MIT ARISE-Initiative/robomimic

6. Quick-Start Guides

6.1 Octo: Open-Source Multi-Embodiment Foundation Model

Octo is one of the most beginner-friendly open-source VLAs, maintained by the UC Berkeley team.

Installation:

# Create conda environment
conda create -n octo python=3.10
conda activate octo

# Install Octo
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
git clone https://github.com/octo-models/octo.git
cd octo
pip install -e .

Load pretrained model:

from octo.model.octo_model import OctoModel

# Load pretrained model (automatically downloads from HuggingFace)
model = OctoModel.pretrained("hf://rail-berkeley/octo-base-1.5")

# Inference
import numpy as np
observation = {
    "image_primary": np.random.randn(1, 256, 256, 3),  # Primary camera image
    "timestep_pad_mask": np.array([[True]]),
}
task = model.create_tasks(texts=["pick up the red cup"])
action = model.sample_actions(observation, task, rng=jax.random.PRNGKey(0))
print(action.shape)  # (1, horizon, action_dim)

Fine-tune on custom data:

# Using RLDS format dataset
python scripts/finetune.py \
    --config.pretrained_path="hf://rail-berkeley/octo-base-1.5" \
    --config.data.dataset_name="your_dataset" \
    --config.training.batch_size=128 \
    --config.training.num_steps=50000

Key parameters:

Parameter Recommended Value Notes
batch_size 128–256 As large as GPU memory allows
learning_rate 3e-4 May reduce for fine-tuning
num_steps 50K–200K Depends on data volume
Action head diffusion Diffusion head typically performs better

6.2 OpenVLA: 7B-Scale VLA

Installation:

conda create -n openvla python=3.10
conda activate openvla

git clone https://github.com/openvla/openvla.git
cd openvla
pip install -e .

Inference:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model (requires ~16GB GPU memory)
processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b", trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Inference
image = Image.open("robot_obs.png")
prompt = "In: What action should the robot take to pick up the cup?\nOut:"
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = model.predict_action(**inputs)  # 7-dimensional action vector

LoRA fine-tuning (consumer GPU):

torchrun --nproc_per_node=1 vla-scripts/finetune.py \
    --vla_path "openvla/openvla-7b" \
    --data_root_dir /path/to/data \
    --dataset_name "your_dataset" \
    --use_lora True \
    --lora_rank 32 \
    --batch_size 8 \
    --learning_rate 2e-5

Hardware requirements:

Operation Minimum GPU Recommended GPU
Inference 1x A100 40GB 1x A100 80GB
LoRA fine-tuning 1x A100 40GB 2x A100 80GB
Full fine-tuning 4x A100 80GB 8x A100 80GB

6.3 LeRobot: Unified Robot Learning Framework

LeRobot is a unified framework released by Hugging Face, supporting multiple policies (ACT, Diffusion Policy, VQ-BeT, etc.), and is one of the best choices for getting started with robot learning.

Installation:

conda create -n lerobot python=3.10
conda activate lerobot

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[all]"

Train ACT policy:

python lerobot/scripts/train.py \
    policy=act \
    env=aloha \
    dataset_repo_id=lerobot/aloha_sim_insertion_human \
    training.num_epochs=2000 \
    training.batch_size=8

Train Diffusion Policy:

python lerobot/scripts/train.py \
    policy=diffusion \
    env=pusht \
    dataset_repo_id=lerobot/pusht \
    training.num_epochs=200

Visualize training results:

python lerobot/scripts/visualize_dataset.py \
    --repo-id lerobot/aloha_sim_insertion_human \
    --episode-index 0

Policies supported by LeRobot:

Policy Action Repr. Use Case Features
ACT CVAE generation Bimanual manipulation Good temporal consistency
Diffusion Policy DDPM diffusion General manipulation Multimodal actions
VQ-BeT VQ-VAE + GPT General Discretized actions
TDMPC MPC + World Model Tasks requiring planning Online optimization

7. Model Selection Guide

7.1 Selection by Scenario

graph TB
    START[Select Open-Source Model] --> Q1{Task Type?}

    Q1 -->|"Single-arm tabletop"| Q2a{Need language instructions?}
    Q1 -->|"Bimanual manipulation"| ACT_RDT["ACT / RDT-1B"]
    Q1 -->|"Humanoid robot"| GR1["GR-1"]
    Q1 -->|"Mobile manipulation"| Q2b{Need planning?}

    Q2a -->|"Yes"| Q3a{Budget?}
    Q2a -->|"No"| DP["Diffusion Policy"]

    Q3a -->|"Consumer GPU"| OCTO["Octo (93M)"]
    Q3a -->|"Multi-card A100"| OPENVLA["OpenVLA (7B)"]

    Q2b -->|"Yes"| SAYCAN["SayCan / Code as Policies"]
    Q2b -->|"No"| PI0["pi0"]

    style START fill:#e3f2fd
    style ACT_RDT fill:#e8f5e9
    style GR1 fill:#e8f5e9
    style OCTO fill:#e8f5e9
    style OPENVLA fill:#e8f5e9
    style DP fill:#e8f5e9
    style PI0 fill:#e8f5e9

7.2 Selection by Resources

Available Resources Recommended Model Notes
1x RTX 3090/4090 Diffusion Policy, ACT Policy models, no large model needed
1x A100 40GB Octo, HPT Medium-scale VLA
1x A100 80GB OpenVLA (LoRA) 7B model LoRA fine-tuning
4x+ A100 OpenVLA (full), pi0 Full parameter training
CPU only LeRobot (simulation) Validate in simulation first

7.3 Rapid Prototyping vs. Production Deployment

Phase Recommended Approach Rationale
Proof of concept LeRobot + ACT / DP Well-developed framework, out of the box
Research experiments Octo / OpenVLA Reproducible, active community
Product prototype pi0 / Custom model Best performance, but requires more engineering
Mass production Distilled/quantized small model Real-time performance and cost

If your target is "low-cost, fine-grained, bimanual, few-shot demonstrations," ACT is still one of the strongest baselines to try first. Not because it is the newest model, but because its data loop, reproduction cost, and temporal-consistency design are unusually clear. See ACT Model.


8. Important Notes

8.1 Reproduction Tips

  1. Version pinning: Pay attention to fixing dependency versions (especially JAX, PyTorch, transformers)
  2. Data format: Different models use different data formats (RLDS, HDF5, LeRobot format); conversion is needed
  3. Action space: Make sure to verify the action space definition used during model training (absolute vs. relative, end-effector vs. joint)
  4. Camera calibration: Image resolution, cropping method, camera intrinsics/extrinsics must match training time

8.2 Common Issues

  • Unstable actions: Check whether action normalization parameters are correct
  • Poor generalization: Increase data diversity, try data augmentation
  • Slow inference: Use action chunking to reduce inference calls, or model quantization
  • Large Sim2Real gap: Increase domain randomization, or fine-tune on real data

References:

  • Octo Team, "Octo: An Open-Source Generalist Robot Policy", 2023
  • Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
  • Cadene et al., "LeRobot: State-of-the-art Machine Learning for Real-World Robotics", 2024
  • Chi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", RSS 2023
  • Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ACT), RSS 2023

评论 #