Skip to content

Datasets and Benchmarks

Progress in robot learning depends on high-quality datasets and standardized evaluation benchmarks. However, compared to NLP and CV, the cost of acquiring robot data is extremely high and the scale is extremely small. This article systematically reviews the major robot datasets, data format standards, and evaluation benchmarks.

Related notes: Teleoperation and Data Collection | VLA Models | Open-Source Model Summary


1. The Scarcity of Robot Data

1.1 Scale Comparison

Comparing robot data with data scales from other AI domains provides an intuitive sense of the gap:

Domain Representative Dataset Data Scale Acquisition Method
Language models Common Crawl ~15T tokens Web crawling
Image recognition LAION-5B 5 billion image-text pairs Web crawling
Video understanding HD-VILA 100 million video clips Web crawling
Autonomous driving nuScenes 1.4 million frames Vehicle-mounted sensors
Robot manipulation Open X-Embodiment ~1 million episodes Teleoperation/RL
Single lab Typical scale 1K–100K episodes Teleoperation

The fundamental reason for robot data scarcity:

\[\text{Data Cost} = \frac{\text{Hardware Cost} + \text{Labor Cost} + \text{Time Cost}}{\text{Collection Speed}}\]
  • Hardware cost: A single robot arm costs approximately \(5K–\)50K
  • Labor cost: Requires operators for teleoperation (VR/teach pendant/force feedback)
  • Time cost: A single operation typically takes 30 seconds to 5 minutes
  • Collection speed: One person collects approximately 100–500 episodes per day

1.2 Strategies for Addressing Data Scarcity

graph TB
    PROBLEM[Robot Data Scarcity] --> S1[More Efficient Collection]
    PROBLEM --> S2[Data Augmentation and Generation]
    PROBLEM --> S3[Cross-Source Data Aggregation]
    PROBLEM --> S4[Learning from Non-Robot Data]

    S1 --> S1a[Better Teleoperation Systems<br/>ALOHA, UMI, Bunny-VisionPro]
    S1 --> S1b[Autonomous Data Collection<br/>Reset-free RL]

    S2 --> S2a[Simulation Data Generation<br/>Randomization + Sim2Real]
    S2 --> S2b[Image Augmentation<br/>Color/Texture/Viewpoint Transforms]
    S2 --> S2c[Video Generation Models<br/>UniSim Data Augmentation]

    S3 --> S3a[Open X-Embodiment<br/>Multi-Institution Data Aggregation]
    S3 --> S3b[DROID<br/>Standardized Collection Protocol]

    S4 --> S4a[Learning from Human Videos<br/>R3M, VIP]
    S4 --> S4b[Web Data Pretraining<br/>RT-2]

2. Major Datasets

2.1 Large-Scale Aggregated Datasets

Open X-Embodiment (Google DeepMind, 2023)

Attribute Value
Scale 1 million+ episodes
Sources 33 sub-datasets from 21 institutions
Robot types 22 different morphologies
Task descriptions 160,000+ types
Data format RLDS (TensorFlow Datasets)
Storage size ~1.3TB

Included sub-datasets (partial):

Sub-dataset Episodes Robot Tasks
RT-1 Robot Action 130K Everyday Robots Tabletop manipulation
Bridge V2 60K WidowX Kitchen manipulation
Language Table 442K xArm Language-guided block pushing
TACO-RL 6K Franka Manipulation + RL
BC-Z 26K Google Robot Multi-task
Cable Routing 1K UR5 Cable routing

DROID (2024)

Attribute Value
Scale 76,000 episodes
Sources Standardized collection from 13 institutions
Robot Franka Emika Panda
Collection method SpaceMouse teleoperation
Scenes 564 independent scenes
Annotations Natural language instructions + manipulation type labels

Core value of DROID:

  • Unified collection protocol ensures data quality consistency
  • Multi-scene data covers real-world diversity
  • Standardized data format facilitates model training

2.2 Domain-Specific Datasets

Bridge V2 (UC Berkeley, 2023)

Attribute Value
Scale 60,096 episodes
Robot WidowX 250 6DoF
Scenes 24 kitchen/tabletop environments
Objects 100+ everyday items
Control End-effector pose control
Frequency 5Hz

RH20T (Tsinghua, 2023)

Attribute Value
Scale 110,000+ episodes
Robots Multiple (Franka, UR5, xArm, etc.)
Tasks 147 manipulation tasks
Features Rich multimodal annotations
Sensors RGB + Depth + Force/Torque + Tactile

AgiBot World (AgiBot, 2025)

Attribute Value
Scale 1 million+ episodes (target)
Robot AgiBot proprietary platform
Scenes Industrial + Home
Features First large-scale open-source robot dataset from China
Data quality Automated quality filtering pipeline

RoboTurk (Stanford, 2018)

Attribute Value
Scale 2,000+ demonstrations
Collection method Cloud crowdsourcing (browser-based teleoperation)
Robot Sawyer
Contribution First exploration of crowdsourced robot data collection

3. Data Format Standards

Different datasets use different storage formats; format conversion is a common pain point in practice.

3.1 Major Format Comparison

Format Used by Basis Features Suitable for
RLDS Open X-Embodiment, Octo TensorFlow Datasets Standardized episode structure Large-scale pretraining
LeRobot format LeRobot, HuggingFace Parquet + Video Small size, HuggingFace ecosystem Rapid prototyping
HDF5 robomimic, RH20T HDF5 Flexible nested structure Research experiments
zarr Diffusion Policy Zarr Chunked storage, parallelism-friendly Large-scale training
rosbag ROS ecosystem ROS Raw sensor recordings Data collection

3.2 RLDS Format Details

RLDS (Reinforcement Learning Datasets) is the standard format adopted by Open X-Embodiment:

# Typical structure of an RLDS episode
episode = {
    "steps": [
        {
            "observation": {
                "image": np.array([256, 256, 3]),      # RGB image
                "wrist_image": np.array([128, 128, 3]), # Wrist camera (optional)
                "state": np.array([7]),                 # Proprioceptive state
            },
            "action": np.array([7]),  # 7DoF: dx,dy,dz,drx,dry,drz,gripper
            "reward": 0.0,
            "is_terminal": False,
            "is_first": True,
            "language_instruction": "pick up the red cup",
        },
        # ... subsequent steps
    ]
}

3.3 LeRobot Format Details

LeRobot adopts a more modern data format based on Parquet and video files:

dataset/
├── meta/
│   ├── info.json           # Dataset metadata
│   ├── episodes.jsonl      # Episode-level metadata
│   └── stats.json          # Statistics (mean, standard deviation)
├── data/
│   ├── chunk-000/
│   │   ├── episode_000000.parquet  # Structured data
│   │   ├── episode_000001.parquet
│   │   └── ...
├── videos/
│   ├── chunk-000/
│   │   ├── observation.images.top/
│   │   │   ├── episode_000000.mp4
│   │   │   └── ...
│   │   └── observation.images.wrist/
│   │       └── ...

Advantages of the LeRobot format:

  • Video compression dramatically reduces storage space (compared to raw image frames)
  • Seamless integration with HuggingFace Hub
  • Parquet format supports efficient columnar queries

3.4 Format Conversion

In practice, conversion between formats is frequently needed:

# RLDS → LeRobot (LeRobot provides official tools)
python lerobot/scripts/push_dataset_to_hub.py \
    --raw-dir /path/to/rlds_dataset \
    --raw-format rlds \
    --repo-id your-hf-username/dataset-name

# HDF5 → LeRobot
python lerobot/scripts/push_dataset_to_hub.py \
    --raw-dir /path/to/hdf5_data \
    --raw-format robomimic \
    --repo-id your-hf-username/dataset-name

4. Benchmarks and Evaluation

4.1 Simulation Benchmarks

SIMPLER (Google DeepMind, 2024)

Attribute Value
Positioning Simulation alternative for evaluating real robot policies
Core value Simulation evaluation scores highly correlate with real robot performance
Tasks Tabletop manipulation based on Google Robot and WidowX
Features Can evaluate VLA performance without real robots

LIBERO (UT Austin, 2023)

Attribute Value
Positioning Lifelong Learning benchmark
Platform MuJoCo simulation
Task suites 5 suites, 10 tasks each
Evaluation dimensions Spatial generalization, object generalization, goal generalization, long-horizon tasks

LIBERO's 5 task suites:

Suite Evaluation Dimension Difficulty
LIBERO-Spatial Same objects, different spatial relations Low
LIBERO-Object Same task, different objects Medium
LIBERO-Goal Same scene, different goals Medium
LIBERO-Long Long-sequence composite tasks High
LIBERO-100 100 diverse tasks High

RLBench (Imperial College, 2020)

Attribute Value
Platform CoppeliaSim + PyRep
Tasks 100+ carefully designed manipulation tasks
Observations RGB, depth, joint state, end-effector pose
Features Each task provides variants for generalization testing

MetaWorld (Stanford, 2020)

Attribute Value
Platform MuJoCo
Tasks 50 tabletop manipulation tasks
Positioning Multi-task learning and meta-learning evaluation
Evaluation modes ML1 (single-task), ML10 (10 tasks), ML45 (45 tasks), MT10, MT50

ManiSkill (UCSD/Hillbot, 2023)

Attribute Value
Platform SAPIEN
Versions ManiSkill2, ManiSkill3
Tasks 20+ manipulation task categories
Features GPU-parallel environments, extremely fast
Objects Uses interactable objects from PartNet-Mobility

4.2 Benchmark Comparison

graph LR
    subgraph LowFidelityHighSpeed["Low Fidelity - High Speed"]
        MW[MetaWorld<br/>50 tasks<br/>MuJoCo]
        MS[ManiSkill<br/>20+ tasks<br/>SAPIEN/GPU]
    end

    subgraph MediumFidelity["Medium Fidelity"]
        LB[LIBERO<br/>5 suites<br/>MuJoCo]
        RB[RLBench<br/>100+ tasks<br/>CoppeliaSim]
    end

    subgraph HighFidelityLowSpeed["High Fidelity - Low Speed"]
        SP[SIMPLER<br/>Real-matched<br/>MuJoCo]
        REAL[Real Robot Evaluation<br/>Gold Standard]
    end

    MW --> LB
    MS --> LB
    LB --> SP
    RB --> SP
    SP --> REAL

4.3 Evaluation Metrics

Metric Definition Applicable Scenario
Success Rate Proportion of successfully completed tasks Most commonly used primary metric
Partial Success Partial completion (e.g., grasped but not placed correctly) Long-sequence tasks
Generalization Gap Difference in success rates between in- and out-of-distribution Generalization capability evaluation
Sample Efficiency Data needed to reach threshold success rate Data efficiency evaluation
Inference Latency Time for a single inference Real-time performance evaluation
Cross-Embodiment Transfer Zero-shot/few-shot performance on new robots Transfer capability evaluation

5. Data Quality and Annotation

5.1 Key Dimensions of Data Quality

Dimension Description Impact
Demonstration quality Operator's proficiency level Directly affects imitation learning ceiling
Diversity Scene, object, lighting variations Determines generalization capability
Annotation accuracy Correspondence between language instructions and actions Affects language-conditioned policies
Temporal alignment Timestamp synchronization between images and actions Critical for causal modeling
Calibration accuracy Camera intrinsic/extrinsic parameter accuracy Affects 3D-related tasks

5.2 Automated Quality Filtering

Recent work has begun exploring automated data quality assessment:

  1. Success rate filtering: Remove failed demonstrations
  2. Consistency checks: Detect temporal consistency of actions and observations
  3. VLM scoring: Use VLMs to evaluate the semantic correctness of demonstrations
  4. Outlier detection: Remove anomalous values in the action distribution

6. Future Directions

6.1 Paths for Data Scale Expansion

Path Representative Feasibility Scale Ceiling
More teleoperation DROID, AgiBot World High Tens of millions of episodes
Simulation generation ManiSkill, Isaac Gym High Hundreds of millions of episodes
Video generation models UniSim Medium Theoretically unlimited
Autonomous exploration Reset-free RL Low (currently) Depends on algorithm progress
Internet video RT-2 pretraining High Billions of videos
  • RLDS and LeRobot formats are becoming de facto standards
  • HuggingFace Hub as a unified data distribution platform
  • Dataset cards (Datasheets) recording collection conditions, biases, and usage limitations

6.3 Key Challenges

  1. Long-tail problem: Rare but important manipulation scenarios have very little data
  2. Negative samples: Most datasets contain only successful demonstrations, lacking failure cases
  3. Cross-embodiment standardization: Observation and action spaces differ greatly across robots
  4. Privacy and security: Data containing real environments may involve privacy concerns
  5. Evaluation fairness: Different models use different training data, making cross-comparison difficult

References:

  • Open X-Embodiment Collaboration, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models", 2023
  • Khazatsky et al., "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset", RSS 2024
  • Walke et al., "BridgeData V2: A Dataset for Robot Learning at Scale", CoRL 2023
  • Fang et al., "RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot", 2023
  • Li et al., "SIMPLER: Simulated Manipulation Policy Evaluation for Real Robot Setups", 2024
  • Liu et al., "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning", NeurIPS 2023
  • James et al., "RLBench: The Robot Learning Benchmark", RA-L 2020
  • Yu et al., "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning", CoRL 2020
  • Gu et al., "ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills", ICLR 2023

评论 #