Open-Source Model Summary

This article compiles the major open-source models in embodied intelligence, organizes them by category, and provides quick-start guides for core entries. For practitioners, choosing the right open-source starting point can significantly accelerate R&D.

Related notes: Model Roadmap | VLA Models | ACT Model | Foundation Models for Robotics | Open-Source Frameworks

1. VLA Models (Vision-Language-Action)

Model	Year	Institution	Parameters	Supported Robots	Action Repr.	License	GitHub
RT-1	2022	Google	35M	Everyday Robots	Discrete Token	Apache 2.0	google-research/robotics_transformer
Octo	2023	UC Berkeley	93M	Multi-embodiment	Continuous/Diffusion	MIT	octo-models/octo
OpenVLA	2024	Stanford/Berkeley	7B	Multi-embodiment	Discrete Token	MIT	openvla/openvla
pi0	2024	Physical Intelligence	3B	Multi-embodiment	Flow Matching	Apache 2.0	physical-intelligence/openpi
RoboFlamingo	2023	ByteDance	3B	Tabletop manipulation	Continuous Regression	MIT	RoboFlamingo/RoboFlamingo
GR-1	2024	Fourier Intelligence	~1B	Humanoid robots	Continuous Regression	Partially open	bytedance/GR-1
HPT	2024	MIT	~300M	Multi-embodiment	Continuous/Diffusion	MIT	liruiw/HPT
RDT-1B	2024	Tsinghua	1.2B	Bimanual manipulation	Diffusion	Apache 2.0	thu-ml/RoboticsDiffusionTransformer

2. Policy Models

Model	Year	Institution	Method	Use Case	License	GitHub
Diffusion Policy	2023	Columbia/MIT	Diffusion denoising	General manipulation	MIT	real-stanford/diffusion_policy
ACT	2023	Stanford/Google	CVAE + Transformer	Bimanual fine manipulation	MIT	tonyzhaozh/act
ACT++	2024	Stanford	Improved ACT	Bimanual manipulation	MIT	Same as ACT repo
DP3	2024	PKU	3D Diffusion Policy	3D manipulation	MIT	YanjieZe/3D-Diffusion-Policy
DexCap	2024	Stanford	Dexterous hand IL	Dexterous manipulation	MIT	j96w/DexCap
BeT	2023	UC Berkeley	Behavior Transformer	Multimodal actions	MIT	notmahi/bet

3. Planning and Reasoning Models

Model	Year	Institution	Method	Use Case	License	GitHub
SayCan	2022	Google	LLM + Affordance	Mobile manipulation	Apache 2.0	google-research/saycan
VoxPoser	2023	Stanford	LLM → 3D Value Maps	Tabletop manipulation	MIT	huangwl18/VoxPoser
Code as Policies	2023	Google	LLM code generation	General	Apache 2.0	google-research/google-research
ManipLLM	2024	HKU	Multimodal manipulation	Manipulation reasoning	MIT	clorislili/ManipLLM
SpatialVLM	2024	Google	Spatial reasoning VLM	Spatial understanding	Apache 2.0	Paper only

4. World Models

Model	Year	Institution	Method	Use Case	License	GitHub
Dreamer v3	2023	DeepMind/Danijar	RSSM + Actor-Critic	General RL	MIT	danijar/dreamerv3
Cosmos	2025	NVIDIA	Video Diffusion/AR	Physical AI simulation	NVIDIA open source	NVIDIA/Cosmos
Genesis	2024	Community	Differentiable physics	Robot simulation	Apache 2.0	Genesis-Embodied-AI/Genesis

5. Unified Frameworks and Tools

Tool	Year	Institution	Purpose	License	GitHub
LeRobot	2024	Hugging Face	Unified robot learning framework	Apache 2.0	huggingface/lerobot
droid_policy_learning	2024	DROID Team	DROID dataset policy learning	MIT	droid-dataset/droid_policy_learning
robomimic	2021	Stanford	Imitation learning framework	MIT	ARISE-Initiative/robomimic

6. Quick-Start Guides

6.1 Octo: Open-Source Multi-Embodiment Foundation Model

Octo is one of the most beginner-friendly open-source VLAs, maintained by the UC Berkeley team.

Installation:

# Create conda environment
conda create -n octo python=3.10
conda activate octo

# Install Octo
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
git clone https://github.com/octo-models/octo.git
cd octo
pip install -e .

Load pretrained model:

from octo.model.octo_model import OctoModel

# Load pretrained model (automatically downloads from HuggingFace)
model = OctoModel.pretrained("hf://rail-berkeley/octo-base-1.5")

# Inference
import numpy as np
observation = {
    "image_primary": np.random.randn(1, 256, 256, 3),  # Primary camera image
    "timestep_pad_mask": np.array([[True]]),
}
task = model.create_tasks(texts=["pick up the red cup"])
action = model.sample_actions(observation, task, rng=jax.random.PRNGKey(0))
print(action.shape)  # (1, horizon, action_dim)

Fine-tune on custom data:

# Using RLDS format dataset
python scripts/finetune.py \
    --config.pretrained_path="hf://rail-berkeley/octo-base-1.5" \
    --config.data.dataset_name="your_dataset" \
    --config.training.batch_size=128 \
    --config.training.num_steps=50000

Key parameters:

Parameter	Recommended Value	Notes
batch_size	128–256	As large as GPU memory allows
learning_rate	3e-4	May reduce for fine-tuning
num_steps	50K–200K	Depends on data volume
Action head	diffusion	Diffusion head typically performs better

6.2 OpenVLA: 7B-Scale VLA

Installation:

conda create -n openvla python=3.10
conda activate openvla

git clone https://github.com/openvla/openvla.git
cd openvla
pip install -e .

Inference:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model (requires ~16GB GPU memory)
processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b", trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Inference
image = Image.open("robot_obs.png")
prompt = "In: What action should the robot take to pick up the cup?\nOut:"
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = model.predict_action(**inputs)  # 7-dimensional action vector

LoRA fine-tuning (consumer GPU):

torchrun --nproc_per_node=1 vla-scripts/finetune.py \
    --vla_path "openvla/openvla-7b" \
    --data_root_dir /path/to/data \
    --dataset_name "your_dataset" \
    --use_lora True \
    --lora_rank 32 \
    --batch_size 8 \
    --learning_rate 2e-5

Hardware requirements:

Operation	Minimum GPU	Recommended GPU
Inference	1x A100 40GB	1x A100 80GB
LoRA fine-tuning	1x A100 40GB	2x A100 80GB
Full fine-tuning	4x A100 80GB	8x A100 80GB

6.3 LeRobot: Unified Robot Learning Framework

LeRobot is a unified framework released by Hugging Face, supporting multiple policies (ACT, Diffusion Policy, VQ-BeT, etc.), and is one of the best choices for getting started with robot learning.

Installation:

conda create -n lerobot python=3.10
conda activate lerobot

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[all]"

Train ACT policy:

python lerobot/scripts/train.py \
    policy=act \
    env=aloha \
    dataset_repo_id=lerobot/aloha_sim_insertion_human \
    training.num_epochs=2000 \
    training.batch_size=8

Train Diffusion Policy:

python lerobot/scripts/train.py \
    policy=diffusion \
    env=pusht \
    dataset_repo_id=lerobot/pusht \
    training.num_epochs=200

Visualize training results:

python lerobot/scripts/visualize_dataset.py \
    --repo-id lerobot/aloha_sim_insertion_human \
    --episode-index 0

Policies supported by LeRobot:

Policy	Action Repr.	Use Case	Features
ACT	CVAE generation	Bimanual manipulation	Good temporal consistency
Diffusion Policy	DDPM diffusion	General manipulation	Multimodal actions
VQ-BeT	VQ-VAE + GPT	General	Discretized actions
TDMPC	MPC + World Model	Tasks requiring planning	Online optimization

7. Model Selection Guide

7.1 Selection by Scenario

graph TB
    START[Select Open-Source Model] --> Q1{Task Type?}

    Q1 -->|"Single-arm tabletop"| Q2a{Need language instructions?}
    Q1 -->|"Bimanual manipulation"| ACT_RDT["ACT / RDT-1B"]
    Q1 -->|"Humanoid robot"| GR1["GR-1"]
    Q1 -->|"Mobile manipulation"| Q2b{Need planning?}

    Q2a -->|"Yes"| Q3a{Budget?}
    Q2a -->|"No"| DP["Diffusion Policy"]

    Q3a -->|"Consumer GPU"| OCTO["Octo (93M)"]
    Q3a -->|"Multi-card A100"| OPENVLA["OpenVLA (7B)"]

    Q2b -->|"Yes"| SAYCAN["SayCan / Code as Policies"]
    Q2b -->|"No"| PI0["pi0"]

    style START fill:#e3f2fd
    style ACT_RDT fill:#e8f5e9
    style GR1 fill:#e8f5e9
    style OCTO fill:#e8f5e9
    style OPENVLA fill:#e8f5e9
    style DP fill:#e8f5e9
    style PI0 fill:#e8f5e9

7.2 Selection by Resources

Available Resources	Recommended Model	Notes
1x RTX 3090/4090	Diffusion Policy, ACT	Policy models, no large model needed
1x A100 40GB	Octo, HPT	Medium-scale VLA
1x A100 80GB	OpenVLA (LoRA)	7B model LoRA fine-tuning
4x+ A100	OpenVLA (full), pi0	Full parameter training
CPU only	LeRobot (simulation)	Validate in simulation first

7.3 Rapid Prototyping vs. Production Deployment

Phase	Recommended Approach	Rationale
Proof of concept	LeRobot + ACT / DP	Well-developed framework, out of the box
Research experiments	Octo / OpenVLA	Reproducible, active community
Product prototype	pi0 / Custom model	Best performance, but requires more engineering
Mass production	Distilled/quantized small model	Real-time performance and cost

If your target is "low-cost, fine-grained, bimanual, few-shot demonstrations," ACT is still one of the strongest baselines to try first. Not because it is the newest model, but because its data loop, reproduction cost, and temporal-consistency design are unusually clear. See ACT Model.

8. Important Notes

8.1 Reproduction Tips

Version pinning: Pay attention to fixing dependency versions (especially JAX, PyTorch, transformers)
Data format: Different models use different data formats (RLDS, HDF5, LeRobot format); conversion is needed
Action space: Make sure to verify the action space definition used during model training (absolute vs. relative, end-effector vs. joint)
Camera calibration: Image resolution, cropping method, camera intrinsics/extrinsics must match training time

8.2 Common Issues

Unstable actions: Check whether action normalization parameters are correct
Poor generalization: Increase data diversity, try data augmentation
Slow inference: Use action chunking to reduce inference calls, or model quantization
Large Sim2Real gap: Increase domain randomization, or fine-tune on real data

References:

Octo Team, "Octo: An Open-Source Generalist Robot Policy", 2023
Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
Cadene et al., "LeRobot: State-of-the-art Machine Learning for Real-World Robotics", 2024
Chi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", RSS 2023
Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ACT), RSS 2023

Open-Source Model Summary

1. VLA Models (Vision-Language-Action)

2. Policy Models

3. Planning and Reasoning Models

4. World Models

5. Unified Frameworks and Tools

6. Quick-Start Guides

6.1 Octo: Open-Source Multi-Embodiment Foundation Model

6.2 OpenVLA: 7B-Scale VLA

6.3 LeRobot: Unified Robot Learning Framework

7. Model Selection Guide

7.1 Selection by Scenario

7.2 Selection by Resources

7.3 Rapid Prototyping vs. Production Deployment

8. Important Notes

8.1 Reproduction Tips

8.2 Common Issues

评论 #