Datasets and Benchmarks
Progress in robot learning depends on high-quality datasets and standardized evaluation benchmarks. However, compared to NLP and CV, the cost of acquiring robot data is extremely high and the scale is extremely small. This article systematically reviews the major robot datasets, data format standards, and evaluation benchmarks.
Related notes: Teleoperation and Data Collection | VLA Models | Open-Source Model Summary
1. The Scarcity of Robot Data
1.1 Scale Comparison
Comparing robot data with data scales from other AI domains provides an intuitive sense of the gap:
| Domain |
Representative Dataset |
Data Scale |
Acquisition Method |
| Language models |
Common Crawl |
~15T tokens |
Web crawling |
| Image recognition |
LAION-5B |
5 billion image-text pairs |
Web crawling |
| Video understanding |
HD-VILA |
100 million video clips |
Web crawling |
| Autonomous driving |
nuScenes |
1.4 million frames |
Vehicle-mounted sensors |
| Robot manipulation |
Open X-Embodiment |
~1 million episodes |
Teleoperation/RL |
| Single lab |
Typical scale |
1K–100K episodes |
Teleoperation |
The fundamental reason for robot data scarcity:
\[\text{Data Cost} = \frac{\text{Hardware Cost} + \text{Labor Cost} + \text{Time Cost}}{\text{Collection Speed}}\]
- Hardware cost: A single robot arm costs approximately \(5K–\)50K
- Labor cost: Requires operators for teleoperation (VR/teach pendant/force feedback)
- Time cost: A single operation typically takes 30 seconds to 5 minutes
- Collection speed: One person collects approximately 100–500 episodes per day
1.2 Strategies for Addressing Data Scarcity
graph TB
PROBLEM[Robot Data Scarcity] --> S1[More Efficient Collection]
PROBLEM --> S2[Data Augmentation and Generation]
PROBLEM --> S3[Cross-Source Data Aggregation]
PROBLEM --> S4[Learning from Non-Robot Data]
S1 --> S1a[Better Teleoperation Systems<br/>ALOHA, UMI, Bunny-VisionPro]
S1 --> S1b[Autonomous Data Collection<br/>Reset-free RL]
S2 --> S2a[Simulation Data Generation<br/>Randomization + Sim2Real]
S2 --> S2b[Image Augmentation<br/>Color/Texture/Viewpoint Transforms]
S2 --> S2c[Video Generation Models<br/>UniSim Data Augmentation]
S3 --> S3a[Open X-Embodiment<br/>Multi-Institution Data Aggregation]
S3 --> S3b[DROID<br/>Standardized Collection Protocol]
S4 --> S4a[Learning from Human Videos<br/>R3M, VIP]
S4 --> S4b[Web Data Pretraining<br/>RT-2]
2. Major Datasets
2.1 Large-Scale Aggregated Datasets
Open X-Embodiment (Google DeepMind, 2023)
| Attribute |
Value |
| Scale |
1 million+ episodes |
| Sources |
33 sub-datasets from 21 institutions |
| Robot types |
22 different morphologies |
| Task descriptions |
160,000+ types |
| Data format |
RLDS (TensorFlow Datasets) |
| Storage size |
~1.3TB |
Included sub-datasets (partial):
| Sub-dataset |
Episodes |
Robot |
Tasks |
| RT-1 Robot Action |
130K |
Everyday Robots |
Tabletop manipulation |
| Bridge V2 |
60K |
WidowX |
Kitchen manipulation |
| Language Table |
442K |
xArm |
Language-guided block pushing |
| TACO-RL |
6K |
Franka |
Manipulation + RL |
| BC-Z |
26K |
Google Robot |
Multi-task |
| Cable Routing |
1K |
UR5 |
Cable routing |
DROID (2024)
| Attribute |
Value |
| Scale |
76,000 episodes |
| Sources |
Standardized collection from 13 institutions |
| Robot |
Franka Emika Panda |
| Collection method |
SpaceMouse teleoperation |
| Scenes |
564 independent scenes |
| Annotations |
Natural language instructions + manipulation type labels |
Core value of DROID:
- Unified collection protocol ensures data quality consistency
- Multi-scene data covers real-world diversity
- Standardized data format facilitates model training
2.2 Domain-Specific Datasets
Bridge V2 (UC Berkeley, 2023)
| Attribute |
Value |
| Scale |
60,096 episodes |
| Robot |
WidowX 250 6DoF |
| Scenes |
24 kitchen/tabletop environments |
| Objects |
100+ everyday items |
| Control |
End-effector pose control |
| Frequency |
5Hz |
RH20T (Tsinghua, 2023)
| Attribute |
Value |
| Scale |
110,000+ episodes |
| Robots |
Multiple (Franka, UR5, xArm, etc.) |
| Tasks |
147 manipulation tasks |
| Features |
Rich multimodal annotations |
| Sensors |
RGB + Depth + Force/Torque + Tactile |
AgiBot World (AgiBot, 2025)
| Attribute |
Value |
| Scale |
1 million+ episodes (target) |
| Robot |
AgiBot proprietary platform |
| Scenes |
Industrial + Home |
| Features |
First large-scale open-source robot dataset from China |
| Data quality |
Automated quality filtering pipeline |
RoboTurk (Stanford, 2018)
| Attribute |
Value |
| Scale |
2,000+ demonstrations |
| Collection method |
Cloud crowdsourcing (browser-based teleoperation) |
| Robot |
Sawyer |
| Contribution |
First exploration of crowdsourced robot data collection |
Different datasets use different storage formats; format conversion is a common pain point in practice.
| Format |
Used by |
Basis |
Features |
Suitable for |
| RLDS |
Open X-Embodiment, Octo |
TensorFlow Datasets |
Standardized episode structure |
Large-scale pretraining |
| LeRobot format |
LeRobot, HuggingFace |
Parquet + Video |
Small size, HuggingFace ecosystem |
Rapid prototyping |
| HDF5 |
robomimic, RH20T |
HDF5 |
Flexible nested structure |
Research experiments |
| zarr |
Diffusion Policy |
Zarr |
Chunked storage, parallelism-friendly |
Large-scale training |
| rosbag |
ROS ecosystem |
ROS |
Raw sensor recordings |
Data collection |
RLDS (Reinforcement Learning Datasets) is the standard format adopted by Open X-Embodiment:
# Typical structure of an RLDS episode
episode = {
"steps": [
{
"observation": {
"image": np.array([256, 256, 3]), # RGB image
"wrist_image": np.array([128, 128, 3]), # Wrist camera (optional)
"state": np.array([7]), # Proprioceptive state
},
"action": np.array([7]), # 7DoF: dx,dy,dz,drx,dry,drz,gripper
"reward": 0.0,
"is_terminal": False,
"is_first": True,
"language_instruction": "pick up the red cup",
},
# ... subsequent steps
]
}
LeRobot adopts a more modern data format based on Parquet and video files:
dataset/
├── meta/
│ ├── info.json # Dataset metadata
│ ├── episodes.jsonl # Episode-level metadata
│ └── stats.json # Statistics (mean, standard deviation)
├── data/
│ ├── chunk-000/
│ │ ├── episode_000000.parquet # Structured data
│ │ ├── episode_000001.parquet
│ │ └── ...
├── videos/
│ ├── chunk-000/
│ │ ├── observation.images.top/
│ │ │ ├── episode_000000.mp4
│ │ │ └── ...
│ │ └── observation.images.wrist/
│ │ └── ...
Advantages of the LeRobot format:
- Video compression dramatically reduces storage space (compared to raw image frames)
- Seamless integration with HuggingFace Hub
- Parquet format supports efficient columnar queries
In practice, conversion between formats is frequently needed:
# RLDS → LeRobot (LeRobot provides official tools)
python lerobot/scripts/push_dataset_to_hub.py \
--raw-dir /path/to/rlds_dataset \
--raw-format rlds \
--repo-id your-hf-username/dataset-name
# HDF5 → LeRobot
python lerobot/scripts/push_dataset_to_hub.py \
--raw-dir /path/to/hdf5_data \
--raw-format robomimic \
--repo-id your-hf-username/dataset-name
4. Benchmarks and Evaluation
4.1 Simulation Benchmarks
SIMPLER (Google DeepMind, 2024)
| Attribute |
Value |
| Positioning |
Simulation alternative for evaluating real robot policies |
| Core value |
Simulation evaluation scores highly correlate with real robot performance |
| Tasks |
Tabletop manipulation based on Google Robot and WidowX |
| Features |
Can evaluate VLA performance without real robots |
LIBERO (UT Austin, 2023)
| Attribute |
Value |
| Positioning |
Lifelong Learning benchmark |
| Platform |
MuJoCo simulation |
| Task suites |
5 suites, 10 tasks each |
| Evaluation dimensions |
Spatial generalization, object generalization, goal generalization, long-horizon tasks |
LIBERO's 5 task suites:
| Suite |
Evaluation Dimension |
Difficulty |
| LIBERO-Spatial |
Same objects, different spatial relations |
Low |
| LIBERO-Object |
Same task, different objects |
Medium |
| LIBERO-Goal |
Same scene, different goals |
Medium |
| LIBERO-Long |
Long-sequence composite tasks |
High |
| LIBERO-100 |
100 diverse tasks |
High |
RLBench (Imperial College, 2020)
| Attribute |
Value |
| Platform |
CoppeliaSim + PyRep |
| Tasks |
100+ carefully designed manipulation tasks |
| Observations |
RGB, depth, joint state, end-effector pose |
| Features |
Each task provides variants for generalization testing |
| Attribute |
Value |
| Platform |
MuJoCo |
| Tasks |
50 tabletop manipulation tasks |
| Positioning |
Multi-task learning and meta-learning evaluation |
| Evaluation modes |
ML1 (single-task), ML10 (10 tasks), ML45 (45 tasks), MT10, MT50 |
ManiSkill (UCSD/Hillbot, 2023)
| Attribute |
Value |
| Platform |
SAPIEN |
| Versions |
ManiSkill2, ManiSkill3 |
| Tasks |
20+ manipulation task categories |
| Features |
GPU-parallel environments, extremely fast |
| Objects |
Uses interactable objects from PartNet-Mobility |
4.2 Benchmark Comparison
graph LR
subgraph LowFidelityHighSpeed["Low Fidelity - High Speed"]
MW[MetaWorld<br/>50 tasks<br/>MuJoCo]
MS[ManiSkill<br/>20+ tasks<br/>SAPIEN/GPU]
end
subgraph MediumFidelity["Medium Fidelity"]
LB[LIBERO<br/>5 suites<br/>MuJoCo]
RB[RLBench<br/>100+ tasks<br/>CoppeliaSim]
end
subgraph HighFidelityLowSpeed["High Fidelity - Low Speed"]
SP[SIMPLER<br/>Real-matched<br/>MuJoCo]
REAL[Real Robot Evaluation<br/>Gold Standard]
end
MW --> LB
MS --> LB
LB --> SP
RB --> SP
SP --> REAL
4.3 Evaluation Metrics
| Metric |
Definition |
Applicable Scenario |
| Success Rate |
Proportion of successfully completed tasks |
Most commonly used primary metric |
| Partial Success |
Partial completion (e.g., grasped but not placed correctly) |
Long-sequence tasks |
| Generalization Gap |
Difference in success rates between in- and out-of-distribution |
Generalization capability evaluation |
| Sample Efficiency |
Data needed to reach threshold success rate |
Data efficiency evaluation |
| Inference Latency |
Time for a single inference |
Real-time performance evaluation |
| Cross-Embodiment Transfer |
Zero-shot/few-shot performance on new robots |
Transfer capability evaluation |
5. Data Quality and Annotation
5.1 Key Dimensions of Data Quality
| Dimension |
Description |
Impact |
| Demonstration quality |
Operator's proficiency level |
Directly affects imitation learning ceiling |
| Diversity |
Scene, object, lighting variations |
Determines generalization capability |
| Annotation accuracy |
Correspondence between language instructions and actions |
Affects language-conditioned policies |
| Temporal alignment |
Timestamp synchronization between images and actions |
Critical for causal modeling |
| Calibration accuracy |
Camera intrinsic/extrinsic parameter accuracy |
Affects 3D-related tasks |
5.2 Automated Quality Filtering
Recent work has begun exploring automated data quality assessment:
- Success rate filtering: Remove failed demonstrations
- Consistency checks: Detect temporal consistency of actions and observations
- VLM scoring: Use VLMs to evaluate the semantic correctness of demonstrations
- Outlier detection: Remove anomalous values in the action distribution
6. Future Directions
6.1 Paths for Data Scale Expansion
| Path |
Representative |
Feasibility |
Scale Ceiling |
| More teleoperation |
DROID, AgiBot World |
High |
Tens of millions of episodes |
| Simulation generation |
ManiSkill, Isaac Gym |
High |
Hundreds of millions of episodes |
| Video generation models |
UniSim |
Medium |
Theoretically unlimited |
| Autonomous exploration |
Reset-free RL |
Low (currently) |
Depends on algorithm progress |
| Internet video |
RT-2 pretraining |
High |
Billions of videos |
6.2 Standardization Trends
- RLDS and LeRobot formats are becoming de facto standards
- HuggingFace Hub as a unified data distribution platform
- Dataset cards (Datasheets) recording collection conditions, biases, and usage limitations
6.3 Key Challenges
- Long-tail problem: Rare but important manipulation scenarios have very little data
- Negative samples: Most datasets contain only successful demonstrations, lacking failure cases
- Cross-embodiment standardization: Observation and action spaces differ greatly across robots
- Privacy and security: Data containing real environments may involve privacy concerns
- Evaluation fairness: Different models use different training data, making cross-comparison difficult
References:
- Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
- Khazatsky et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset", RSS 2024
- Walke et al., "BridgeData V2: A Dataset for Robot Learning at Scale", CoRL 2023
- Fang et al., "RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot", 2023
- Li et al., "SIMPLER: Simulated Manipulation Policy Evaluation for Real Robot Setups", 2024
- Liu et al., "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning", NeurIPS 2023
- James et al., "RLBench: The Robot Learning Benchmark", RA-L 2020
- Yu et al., "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning", CoRL 2020
- Gu et al., "ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills", ICLR 2023