Open-Source Model Summary
This article compiles the major open-source models in embodied intelligence, organizes them by category, and provides quick-start guides for core entries. For practitioners, choosing the right open-source starting point can significantly accelerate R&D.
Related notes: Model Roadmap | VLA Models | ACT Model | Foundation Models for Robotics | Open-Source Frameworks
1. VLA Models (Vision-Language-Action)
| Model | Year | Institution | Parameters | Supported Robots | Action Repr. | License | GitHub |
|---|---|---|---|---|---|---|---|
| RT-1 | 2022 | 35M | Everyday Robots | Discrete Token | Apache 2.0 | google-research/robotics_transformer | |
| Octo | 2023 | UC Berkeley | 93M | Multi-embodiment | Continuous/Diffusion | MIT | octo-models/octo |
| OpenVLA | 2024 | Stanford/Berkeley | 7B | Multi-embodiment | Discrete Token | MIT | openvla/openvla |
| pi0 | 2024 | Physical Intelligence | 3B | Multi-embodiment | Flow Matching | Apache 2.0 | physical-intelligence/openpi |
| RoboFlamingo | 2023 | ByteDance | 3B | Tabletop manipulation | Continuous Regression | MIT | RoboFlamingo/RoboFlamingo |
| GR-1 | 2024 | Fourier Intelligence | ~1B | Humanoid robots | Continuous Regression | Partially open | bytedance/GR-1 |
| HPT | 2024 | MIT | ~300M | Multi-embodiment | Continuous/Diffusion | MIT | liruiw/HPT |
| RDT-1B | 2024 | Tsinghua | 1.2B | Bimanual manipulation | Diffusion | Apache 2.0 | thu-ml/RoboticsDiffusionTransformer |
2. Policy Models
| Model | Year | Institution | Method | Use Case | License | GitHub |
|---|---|---|---|---|---|---|
| Diffusion Policy | 2023 | Columbia/MIT | Diffusion denoising | General manipulation | MIT | real-stanford/diffusion_policy |
| ACT | 2023 | Stanford/Google | CVAE + Transformer | Bimanual fine manipulation | MIT | tonyzhaozh/act |
| ACT++ | 2024 | Stanford | Improved ACT | Bimanual manipulation | MIT | Same as ACT repo |
| DP3 | 2024 | PKU | 3D Diffusion Policy | 3D manipulation | MIT | YanjieZe/3D-Diffusion-Policy |
| DexCap | 2024 | Stanford | Dexterous hand IL | Dexterous manipulation | MIT | j96w/DexCap |
| BeT | 2023 | UC Berkeley | Behavior Transformer | Multimodal actions | MIT | notmahi/bet |
3. Planning and Reasoning Models
| Model | Year | Institution | Method | Use Case | License | GitHub |
|---|---|---|---|---|---|---|
| SayCan | 2022 | LLM + Affordance | Mobile manipulation | Apache 2.0 | google-research/saycan | |
| VoxPoser | 2023 | Stanford | LLM → 3D Value Maps | Tabletop manipulation | MIT | huangwl18/VoxPoser |
| Code as Policies | 2023 | LLM code generation | General | Apache 2.0 | google-research/google-research | |
| ManipLLM | 2024 | HKU | Multimodal manipulation | Manipulation reasoning | MIT | clorislili/ManipLLM |
| SpatialVLM | 2024 | Spatial reasoning VLM | Spatial understanding | Apache 2.0 | Paper only |
4. World Models
| Model | Year | Institution | Method | Use Case | License | GitHub |
|---|---|---|---|---|---|---|
| Dreamer v3 | 2023 | DeepMind/Danijar | RSSM + Actor-Critic | General RL | MIT | danijar/dreamerv3 |
| Cosmos | 2025 | NVIDIA | Video Diffusion/AR | Physical AI simulation | NVIDIA open source | NVIDIA/Cosmos |
| Genesis | 2024 | Community | Differentiable physics | Robot simulation | Apache 2.0 | Genesis-Embodied-AI/Genesis |
5. Unified Frameworks and Tools
| Tool | Year | Institution | Purpose | License | GitHub |
|---|---|---|---|---|---|
| LeRobot | 2024 | Hugging Face | Unified robot learning framework | Apache 2.0 | huggingface/lerobot |
| droid_policy_learning | 2024 | DROID Team | DROID dataset policy learning | MIT | droid-dataset/droid_policy_learning |
| robomimic | 2021 | Stanford | Imitation learning framework | MIT | ARISE-Initiative/robomimic |
6. Quick-Start Guides
6.1 Octo: Open-Source Multi-Embodiment Foundation Model
Octo is one of the most beginner-friendly open-source VLAs, maintained by the UC Berkeley team.
Installation:
# Create conda environment
conda create -n octo python=3.10
conda activate octo
# Install Octo
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
git clone https://github.com/octo-models/octo.git
cd octo
pip install -e .
Load pretrained model:
from octo.model.octo_model import OctoModel
# Load pretrained model (automatically downloads from HuggingFace)
model = OctoModel.pretrained("hf://rail-berkeley/octo-base-1.5")
# Inference
import numpy as np
observation = {
"image_primary": np.random.randn(1, 256, 256, 3), # Primary camera image
"timestep_pad_mask": np.array([[True]]),
}
task = model.create_tasks(texts=["pick up the red cup"])
action = model.sample_actions(observation, task, rng=jax.random.PRNGKey(0))
print(action.shape) # (1, horizon, action_dim)
Fine-tune on custom data:
# Using RLDS format dataset
python scripts/finetune.py \
--config.pretrained_path="hf://rail-berkeley/octo-base-1.5" \
--config.data.dataset_name="your_dataset" \
--config.training.batch_size=128 \
--config.training.num_steps=50000
Key parameters:
| Parameter | Recommended Value | Notes |
|---|---|---|
| batch_size | 128–256 | As large as GPU memory allows |
| learning_rate | 3e-4 | May reduce for fine-tuning |
| num_steps | 50K–200K | Depends on data volume |
| Action head | diffusion | Diffusion head typically performs better |
6.2 OpenVLA: 7B-Scale VLA
Installation:
conda create -n openvla python=3.10
conda activate openvla
git clone https://github.com/openvla/openvla.git
cd openvla
pip install -e .
Inference:
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
# Load model (requires ~16GB GPU memory)
processor = AutoProcessor.from_pretrained(
"openvla/openvla-7b", trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Inference
image = Image.open("robot_obs.png")
prompt = "In: What action should the robot take to pick up the cup?\nOut:"
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = model.predict_action(**inputs) # 7-dimensional action vector
LoRA fine-tuning (consumer GPU):
torchrun --nproc_per_node=1 vla-scripts/finetune.py \
--vla_path "openvla/openvla-7b" \
--data_root_dir /path/to/data \
--dataset_name "your_dataset" \
--use_lora True \
--lora_rank 32 \
--batch_size 8 \
--learning_rate 2e-5
Hardware requirements:
| Operation | Minimum GPU | Recommended GPU |
|---|---|---|
| Inference | 1x A100 40GB | 1x A100 80GB |
| LoRA fine-tuning | 1x A100 40GB | 2x A100 80GB |
| Full fine-tuning | 4x A100 80GB | 8x A100 80GB |
6.3 LeRobot: Unified Robot Learning Framework
LeRobot is a unified framework released by Hugging Face, supporting multiple policies (ACT, Diffusion Policy, VQ-BeT, etc.), and is one of the best choices for getting started with robot learning.
Installation:
conda create -n lerobot python=3.10
conda activate lerobot
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[all]"
Train ACT policy:
python lerobot/scripts/train.py \
policy=act \
env=aloha \
dataset_repo_id=lerobot/aloha_sim_insertion_human \
training.num_epochs=2000 \
training.batch_size=8
Train Diffusion Policy:
python lerobot/scripts/train.py \
policy=diffusion \
env=pusht \
dataset_repo_id=lerobot/pusht \
training.num_epochs=200
Visualize training results:
python lerobot/scripts/visualize_dataset.py \
--repo-id lerobot/aloha_sim_insertion_human \
--episode-index 0
Policies supported by LeRobot:
| Policy | Action Repr. | Use Case | Features |
|---|---|---|---|
| ACT | CVAE generation | Bimanual manipulation | Good temporal consistency |
| Diffusion Policy | DDPM diffusion | General manipulation | Multimodal actions |
| VQ-BeT | VQ-VAE + GPT | General | Discretized actions |
| TDMPC | MPC + World Model | Tasks requiring planning | Online optimization |
7. Model Selection Guide
7.1 Selection by Scenario
graph TB
START[Select Open-Source Model] --> Q1{Task Type?}
Q1 -->|"Single-arm tabletop"| Q2a{Need language instructions?}
Q1 -->|"Bimanual manipulation"| ACT_RDT["ACT / RDT-1B"]
Q1 -->|"Humanoid robot"| GR1["GR-1"]
Q1 -->|"Mobile manipulation"| Q2b{Need planning?}
Q2a -->|"Yes"| Q3a{Budget?}
Q2a -->|"No"| DP["Diffusion Policy"]
Q3a -->|"Consumer GPU"| OCTO["Octo (93M)"]
Q3a -->|"Multi-card A100"| OPENVLA["OpenVLA (7B)"]
Q2b -->|"Yes"| SAYCAN["SayCan / Code as Policies"]
Q2b -->|"No"| PI0["pi0"]
style START fill:#e3f2fd
style ACT_RDT fill:#e8f5e9
style GR1 fill:#e8f5e9
style OCTO fill:#e8f5e9
style OPENVLA fill:#e8f5e9
style DP fill:#e8f5e9
style PI0 fill:#e8f5e9
7.2 Selection by Resources
| Available Resources | Recommended Model | Notes |
|---|---|---|
| 1x RTX 3090/4090 | Diffusion Policy, ACT | Policy models, no large model needed |
| 1x A100 40GB | Octo, HPT | Medium-scale VLA |
| 1x A100 80GB | OpenVLA (LoRA) | 7B model LoRA fine-tuning |
| 4x+ A100 | OpenVLA (full), pi0 | Full parameter training |
| CPU only | LeRobot (simulation) | Validate in simulation first |
7.3 Rapid Prototyping vs. Production Deployment
| Phase | Recommended Approach | Rationale |
|---|---|---|
| Proof of concept | LeRobot + ACT / DP | Well-developed framework, out of the box |
| Research experiments | Octo / OpenVLA | Reproducible, active community |
| Product prototype | pi0 / Custom model | Best performance, but requires more engineering |
| Mass production | Distilled/quantized small model | Real-time performance and cost |
If your target is "low-cost, fine-grained, bimanual, few-shot demonstrations," ACT is still one of the strongest baselines to try first. Not because it is the newest model, but because its data loop, reproduction cost, and temporal-consistency design are unusually clear. See ACT Model.
8. Important Notes
8.1 Reproduction Tips
- Version pinning: Pay attention to fixing dependency versions (especially JAX, PyTorch, transformers)
- Data format: Different models use different data formats (RLDS, HDF5, LeRobot format); conversion is needed
- Action space: Make sure to verify the action space definition used during model training (absolute vs. relative, end-effector vs. joint)
- Camera calibration: Image resolution, cropping method, camera intrinsics/extrinsics must match training time
8.2 Common Issues
- Unstable actions: Check whether action normalization parameters are correct
- Poor generalization: Increase data diversity, try data augmentation
- Slow inference: Use action chunking to reduce inference calls, or model quantization
- Large Sim2Real gap: Increase domain randomization, or fine-tune on real data
References:
- Octo Team, "Octo: An Open-Source Generalist Robot Policy", 2023
- Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
- Cadene et al., "LeRobot: State-of-the-art Machine Learning for Real-World Robotics", 2024
- Chi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", RSS 2023
- Zhao et al., "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware" (ACT), RSS 2023