Datasets and Benchmarks

Progress in robot learning depends on high-quality datasets and standardized evaluation benchmarks. However, compared to NLP and CV, the cost of acquiring robot data is extremely high and the scale is extremely small. This article systematically reviews the major robot datasets, data format standards, and evaluation benchmarks.

Related notes: Teleoperation and Data Collection | VLA Models | Open-Source Model Summary

1. The Scarcity of Robot Data

1.1 Scale Comparison

Comparing robot data with data scales from other AI domains provides an intuitive sense of the gap:

Domain	Representative Dataset	Data Scale	Acquisition Method
Language models	Common Crawl	~15T tokens	Web crawling
Image recognition	LAION-5B	5 billion image-text pairs	Web crawling
Video understanding	HD-VILA	100 million video clips	Web crawling
Autonomous driving	nuScenes	1.4 million frames	Vehicle-mounted sensors
Robot manipulation	Open X-Embodiment	~1 million episodes	Teleoperation/RL
Single lab	Typical scale	1K–100K episodes	Teleoperation

The fundamental reason for robot data scarcity:

\[\text{Data Cost} = \frac{\text{Hardware Cost} + \text{Labor Cost} + \text{Time Cost}}{\text{Collection Speed}}\]

Hardware cost: A single robot arm costs approximately \(5K–\)50K
Labor cost: Requires operators for teleoperation (VR/teach pendant/force feedback)
Time cost: A single operation typically takes 30 seconds to 5 minutes
Collection speed: One person collects approximately 100–500 episodes per day

1.2 Strategies for Addressing Data Scarcity

graph TB
    PROBLEM[Robot Data Scarcity] --> S1[More Efficient Collection]
    PROBLEM --> S2[Data Augmentation and Generation]
    PROBLEM --> S3[Cross-Source Data Aggregation]
    PROBLEM --> S4[Learning from Non-Robot Data]

    S1 --> S1a[Better Teleoperation Systems<br/>ALOHA, UMI, Bunny-VisionPro]
    S1 --> S1b[Autonomous Data Collection<br/>Reset-free RL]

    S2 --> S2a[Simulation Data Generation<br/>Randomization + Sim2Real]
    S2 --> S2b[Image Augmentation<br/>Color/Texture/Viewpoint Transforms]
    S2 --> S2c[Video Generation Models<br/>UniSim Data Augmentation]

    S3 --> S3a[Open X-Embodiment<br/>Multi-Institution Data Aggregation]
    S3 --> S3b[DROID<br/>Standardized Collection Protocol]

    S4 --> S4a[Learning from Human Videos<br/>R3M, VIP]
    S4 --> S4b[Web Data Pretraining<br/>RT-2]

2. Major Datasets

2.1 Large-Scale Aggregated Datasets

Open X-Embodiment (Google DeepMind, 2023)

Attribute	Value
Scale	1 million+ episodes
Sources	33 sub-datasets from 21 institutions
Robot types	22 different morphologies
Task descriptions	160,000+ types
Data format	RLDS (TensorFlow Datasets)
Storage size	~1.3TB

Included sub-datasets (partial):

Sub-dataset	Episodes	Robot	Tasks
RT-1 Robot Action	130K	Everyday Robots	Tabletop manipulation
Bridge V2	60K	WidowX	Kitchen manipulation
Language Table	442K	xArm	Language-guided block pushing
TACO-RL	6K	Franka	Manipulation + RL
BC-Z	26K	Google Robot	Multi-task
Cable Routing	1K	UR5	Cable routing

DROID (2024)

Attribute	Value
Scale	76,000 episodes
Sources	Standardized collection from 13 institutions
Robot	Franka Emika Panda
Collection method	SpaceMouse teleoperation
Scenes	564 independent scenes
Annotations	Natural language instructions + manipulation type labels

Core value of DROID:

Unified collection protocol ensures data quality consistency
Multi-scene data covers real-world diversity
Standardized data format facilitates model training

2.2 Domain-Specific Datasets

Bridge V2 (UC Berkeley, 2023)

Attribute	Value
Scale	60,096 episodes
Robot	WidowX 250 6DoF
Scenes	24 kitchen/tabletop environments
Objects	100+ everyday items
Control	End-effector pose control
Frequency	5Hz

RH20T (Tsinghua, 2023)

Attribute	Value
Scale	110,000+ episodes
Robots	Multiple (Franka, UR5, xArm, etc.)
Tasks	147 manipulation tasks
Features	Rich multimodal annotations
Sensors	RGB + Depth + Force/Torque + Tactile

AgiBot World (AgiBot, 2025)

Attribute	Value
Scale	1 million+ episodes (target)
Robot	AgiBot proprietary platform
Scenes	Industrial + Home
Features	First large-scale open-source robot dataset from China
Data quality	Automated quality filtering pipeline

RoboTurk (Stanford, 2018)

Attribute	Value
Scale	2,000+ demonstrations
Collection method	Cloud crowdsourcing (browser-based teleoperation)
Robot	Sawyer
Contribution	First exploration of crowdsourced robot data collection

3. Data Format Standards

Different datasets use different storage formats; format conversion is a common pain point in practice.

3.1 Major Format Comparison

Format	Used by	Basis	Features	Suitable for
RLDS	Open X-Embodiment, Octo	TensorFlow Datasets	Standardized episode structure	Large-scale pretraining
LeRobot format	LeRobot, HuggingFace	Parquet + Video	Small size, HuggingFace ecosystem	Rapid prototyping
HDF5	robomimic, RH20T	HDF5	Flexible nested structure	Research experiments
zarr	Diffusion Policy	Zarr	Chunked storage, parallelism-friendly	Large-scale training
rosbag	ROS ecosystem	ROS	Raw sensor recordings	Data collection

3.2 RLDS Format Details

RLDS (Reinforcement Learning Datasets) is the standard format adopted by Open X-Embodiment:

# Typical structure of an RLDS episode
episode = {
    "steps": [
        {
            "observation": {
                "image": np.array([256, 256, 3]),      # RGB image
                "wrist_image": np.array([128, 128, 3]), # Wrist camera (optional)
                "state": np.array([7]),                 # Proprioceptive state
            },
            "action": np.array([7]),  # 7DoF: dx,dy,dz,drx,dry,drz,gripper
            "reward": 0.0,
            "is_terminal": False,
            "is_first": True,
            "language_instruction": "pick up the red cup",
        },
        # ... subsequent steps
    ]
}

3.3 LeRobot Format Details

LeRobot adopts a more modern data format based on Parquet and video files:

dataset/
├── meta/
│   ├── info.json           # Dataset metadata
│   ├── episodes.jsonl      # Episode-level metadata
│   └── stats.json          # Statistics (mean, standard deviation)
├── data/
│   ├── chunk-000/
│   │   ├── episode_000000.parquet  # Structured data
│   │   ├── episode_000001.parquet
│   │   └── ...
├── videos/
│   ├── chunk-000/
│   │   ├── observation.images.top/
│   │   │   ├── episode_000000.mp4
│   │   │   └── ...
│   │   └── observation.images.wrist/
│   │       └── ...

Advantages of the LeRobot format:

Video compression dramatically reduces storage space (compared to raw image frames)
Seamless integration with HuggingFace Hub
Parquet format supports efficient columnar queries

3.4 Format Conversion

In practice, conversion between formats is frequently needed:

# RLDS → LeRobot (LeRobot provides official tools)
python lerobot/scripts/push_dataset_to_hub.py \
    --raw-dir /path/to/rlds_dataset \
    --raw-format rlds \
    --repo-id your-hf-username/dataset-name

# HDF5 → LeRobot
python lerobot/scripts/push_dataset_to_hub.py \
    --raw-dir /path/to/hdf5_data \
    --raw-format robomimic \
    --repo-id your-hf-username/dataset-name

4. Benchmarks and Evaluation

4.1 Simulation Benchmarks

SIMPLER (Google DeepMind, 2024)

Attribute	Value
Positioning	Simulation alternative for evaluating real robot policies
Core value	Simulation evaluation scores highly correlate with real robot performance
Tasks	Tabletop manipulation based on Google Robot and WidowX
Features	Can evaluate VLA performance without real robots

LIBERO (UT Austin, 2023)

Attribute	Value
Positioning	Lifelong Learning benchmark
Platform	MuJoCo simulation
Task suites	5 suites, 10 tasks each
Evaluation dimensions	Spatial generalization, object generalization, goal generalization, long-horizon tasks

LIBERO's 5 task suites:

Suite	Evaluation Dimension	Difficulty
LIBERO-Spatial	Same objects, different spatial relations	Low
LIBERO-Object	Same task, different objects	Medium
LIBERO-Goal	Same scene, different goals	Medium
LIBERO-Long	Long-sequence composite tasks	High
LIBERO-100	100 diverse tasks	High

RLBench (Imperial College, 2020)

Attribute	Value
Platform	CoppeliaSim + PyRep
Tasks	100+ carefully designed manipulation tasks
Observations	RGB, depth, joint state, end-effector pose
Features	Each task provides variants for generalization testing

MetaWorld (Stanford, 2020)

Attribute	Value
Platform	MuJoCo
Tasks	50 tabletop manipulation tasks
Positioning	Multi-task learning and meta-learning evaluation
Evaluation modes	ML1 (single-task), ML10 (10 tasks), ML45 (45 tasks), MT10, MT50

ManiSkill (UCSD/Hillbot, 2023)

Attribute	Value
Platform	SAPIEN
Versions	ManiSkill2, ManiSkill3
Tasks	20+ manipulation task categories
Features	GPU-parallel environments, extremely fast
Objects	Uses interactable objects from PartNet-Mobility

4.2 Benchmark Comparison

graph LR
    subgraph LowFidelityHighSpeed["Low Fidelity - High Speed"]
        MW[MetaWorld<br/>50 tasks<br/>MuJoCo]
        MS[ManiSkill<br/>20+ tasks<br/>SAPIEN/GPU]
    end

    subgraph MediumFidelity["Medium Fidelity"]
        LB[LIBERO<br/>5 suites<br/>MuJoCo]
        RB[RLBench<br/>100+ tasks<br/>CoppeliaSim]
    end

    subgraph HighFidelityLowSpeed["High Fidelity - Low Speed"]
        SP[SIMPLER<br/>Real-matched<br/>MuJoCo]
        REAL[Real Robot Evaluation<br/>Gold Standard]
    end

    MW --> LB
    MS --> LB
    LB --> SP
    RB --> SP
    SP --> REAL

4.3 Evaluation Metrics

Metric	Definition	Applicable Scenario
Success Rate	Proportion of successfully completed tasks	Most commonly used primary metric
Partial Success	Partial completion (e.g., grasped but not placed correctly)	Long-sequence tasks
Generalization Gap	Difference in success rates between in- and out-of-distribution	Generalization capability evaluation
Sample Efficiency	Data needed to reach threshold success rate	Data efficiency evaluation
Inference Latency	Time for a single inference	Real-time performance evaluation
Cross-Embodiment Transfer	Zero-shot/few-shot performance on new robots	Transfer capability evaluation

5. Data Quality and Annotation

5.1 Key Dimensions of Data Quality

Dimension	Description	Impact
Demonstration quality	Operator's proficiency level	Directly affects imitation learning ceiling
Diversity	Scene, object, lighting variations	Determines generalization capability
Annotation accuracy	Correspondence between language instructions and actions	Affects language-conditioned policies
Temporal alignment	Timestamp synchronization between images and actions	Critical for causal modeling
Calibration accuracy	Camera intrinsic/extrinsic parameter accuracy	Affects 3D-related tasks

5.2 Automated Quality Filtering

Recent work has begun exploring automated data quality assessment:

Success rate filtering: Remove failed demonstrations
Consistency checks: Detect temporal consistency of actions and observations
VLM scoring: Use VLMs to evaluate the semantic correctness of demonstrations
Outlier detection: Remove anomalous values in the action distribution

6. Future Directions

6.1 Paths for Data Scale Expansion

Path	Representative	Feasibility	Scale Ceiling
More teleoperation	DROID, AgiBot World	High	Tens of millions of episodes
Simulation generation	ManiSkill, Isaac Gym	High	Hundreds of millions of episodes
Video generation models	UniSim	Medium	Theoretically unlimited
Autonomous exploration	Reset-free RL	Low (currently)	Depends on algorithm progress
Internet video	RT-2 pretraining	High	Billions of videos

6.2 Standardization Trends

RLDS and LeRobot formats are becoming de facto standards
HuggingFace Hub as a unified data distribution platform
Dataset cards (Datasheets) recording collection conditions, biases, and usage limitations

6.3 Key Challenges

Long-tail problem: Rare but important manipulation scenarios have very little data
Negative samples: Most datasets contain only successful demonstrations, lacking failure cases
Cross-embodiment standardization: Observation and action spaces differ greatly across robots
Privacy and security: Data containing real environments may involve privacy concerns
Evaluation fairness: Different models use different training data, making cross-comparison difficult

References:

Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
Khazatsky et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset", RSS 2024
Walke et al., "BridgeData V2: A Dataset for Robot Learning at Scale", CoRL 2023
Fang et al., "RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot", 2023
Li et al., "SIMPLER: Simulated Manipulation Policy Evaluation for Real Robot Setups", 2024
Liu et al., "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning", NeurIPS 2023
James et al., "RLBench: The Robot Learning Benchmark", RA-L 2020
Yu et al., "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning", CoRL 2020
Gu et al., "ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills", ICLR 2023