Skip to content

Teleoperation and Data Collection

Overview

Data is the fuel of robot learning. However, unlike natural language processing (trillions of tokens) and computer vision (billions of images), robot manipulation data is extremely scarce. Acquiring high-quality robot interaction data requires physical hardware and human operators, with costs far exceeding those of web-crawled data.

This article systematically introduces the core method for robot data collection — teleoperation — and frontier strategies for data scaling.


The Data Bottleneck: Scale Comparison

Domain Dataset Scale Acquisition Method
NLP Common Crawl ~15T tokens Web crawling
CV LAION-5B 5.85B image-text pairs Web crawling
Autonomous Driving nuScenes 1000 scenes Vehicle-mounted sensors
Robotics Open X-Embodiment ~1M trajectories 22 robot platforms
Robotics DROID 76K trajectories Teleoperation collection
Typical Lab 50–500 trajectories Manual teleoperation

Key Insight: Robot data is 7 orders of magnitude less than NLP data. This is not an engineering problem but a fundamental bottleneck — every robot trajectory requires physical time and human effort.

Data Collection Efficiency Estimate

Assuming a lab with 1 robot and 1 operator:

  • Single trajectory duration: ~30 seconds (operation) + ~15 seconds (reset) = 45 seconds
  • Trajectories per hour: ~80
  • Per day (8 hours): ~640
  • Per month (22 days): ~14,000

Collecting 1 million trajectories would take ~72 months (6 years), assuming everything goes smoothly.


Classification of Teleoperation Methods

Method Overview

graph TD
    A[Teleoperation Methods] --> B[Handheld Controllers]
    A --> C[Leader-Follower Arms]
    A --> D[Exoskeletons/Gloves]
    A --> E[Low-cost Solutions]

    B --> B1[VR Controllers<br/>Meta Quest / HTC Vive]
    B --> B2[Space Mouse<br/>6-DOF Input]
    B --> B3[Game Controllers<br/>Low precision but cheap]

    C --> C1[ALOHA<br/>Low-cost Bimanual]
    C --> C2[Industrial Leader-Follower<br/>Sigma.7 / Omega]
    C --> C3[Da Vinci Surgical Robot<br/>High precision]

    D --> D1[Data Gloves<br/>Cyberglove / MANUS]
    D --> D2[Arm Exoskeletons<br/>Motion Capture]
    D --> D3[Full-body MoCap<br/>OptiTrack / Xsens]

    E --> E1[UMI<br/>Handheld Gripper + iPhone]
    E --> E2[GELLO<br/>Joint-space Teleoperation]
    E --> E3[Phone App<br/>Simple Control]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#f3e5f5
    style E fill:#fce4ec

Key Design Dimensions

Dimension Options Trade-offs
Control Space Task space \((x, y, z, R)\) vs. Joint space \(q\) Task space is intuitive but has singularities
Force Feedback Bilateral vs. Unilateral Force feedback improves precision but increases cost
Operation Latency <10ms (local) vs. 50–200ms (remote) Latency causes operational instability
Control Mode Position control vs. Impedance control Impedance control is safer

ALOHA System

System Architecture

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation, Zhao et al., 2023) is currently the most influential low-cost bimanual teleoperation platform.

Core Design: Leader-follower homomorphism — the leader and follower arms use the same model of robotic arm (ViperX 300 6DOF). The operator controls the leader arms, and the follower arms replicate the motion in real time.

Hardware Components:

Component Specification Unit Cost (USD)
Follower arms x2 ViperX 300 6DOF $5,750 x2
Leader arms x2 ViperX 300 6DOF (unloaded) $5,750 x2
Grippers x2 Custom 1-DOF gripper ~$200 x2
Cameras x4 Logitech C922 ~$70 x4
Controller Laptop ~$1,500
Total ~$24,480

Compared to industrial teleoperation systems (\(100K–\)1M), ALOHA reduces cost by 1–2 orders of magnitude.

Data Format

ALOHA collects data containing:

Each timestep t:
  - Joint positions: q_leader ∈ ℝ^{14} (7 DOF x 2 arms)
  - Joint positions: q_follower ∈ ℝ^{14}
  - Images: {cam_top, cam_left, cam_right, cam_wrist} each 480x640x3
  - Timestamp: t
  - Gripper state: {open, closed} x 2

Control frequency is 50 Hz; each 30-second trajectory produces ~1500 timesteps.

Leader-Follower Mapping

ALOHA uses direct joint-space mapping:

\[ q_{\text{follower}}^{\text{target}}(t) = q_{\text{leader}}(t) \]

Since the leader and follower arms are homomorphic, joint angles correspond directly. The follower arm uses PD control to track the target:

\[ \tau = K_p (q_{\text{target}} - q_{\text{actual}}) + K_d (\dot{q}_{\text{target}} - \dot{q}_{\text{actual}}) \]

Mobile ALOHA

Extending to Mobile Manipulation

Mobile ALOHA (Fu et al., 2024) adds a mobile base to ALOHA, enabling the robot to perform tasks in larger spaces.

Additional Hardware:

Component Specification
Mobile base AgileX Tracer (differential drive)
Base speed Up to 1.6 m/s
Additional camera Wide-angle camera in front of the base
Additional cost ~$8,000

Operation Mode: The operator controls the base motion by pushing the entire teleoperation platform, while simultaneously manipulating the leader arms.

Co-training

The key innovation of Mobile ALOHA is co-training: leveraging large amounts of static ALOHA data (without locomotion) to augment training on a smaller amount of mobile manipulation data.

Let \(\mathcal{D}_{\text{mobile}}\) be the mobile manipulation data (small), and \(\mathcal{D}_{\text{static}}\) be the static manipulation data (large). The training loss is:

\[ \mathcal{L} = \mathbb{E}_{(o,a) \sim \mathcal{D}_{\text{mobile}}} [\ell(\pi_\theta(o), a)] + \alpha \mathbb{E}_{(o,a) \sim \mathcal{D}_{\text{static}}} [\ell(\pi_\theta(o), a)] \]

Experiments show that co-training improves success rates by 30–50% compared to training with mobile data alone.


UMI: Universal Manipulation Interface

Design Philosophy

The core idea of UMI (Chi et al., 2024) is to remove the robot — use handheld devices to directly collect human manipulation data in real environments, then deploy to different robots.

Hardware Design

Handheld collection device:

  • 3D-printed parallel gripper shell
  • Embedded GoPro camera (wrist viewpoint)
  • iPhone Pro (LiDAR + SLAM localization)
  • Total cost < $500

Data collection workflow:

  1. Operator holds the UMI gripper and performs the task
  2. GoPro records wrist images (30 FPS)
  3. iPhone SLAM provides the end-effector SE(3) pose trajectory
  4. Gripper open/close state is recorded

Cross-Embodiment Transfer

UMI's collected data format is end-effector trajectories \((p_t, R_t, g_t) \in SE(3) \times \{0, 1\}\), independent of the specific robot. During deployment, the end-effector trajectory is mapped to different robots via inverse kinematics:

\[ q_t = \text{IK}(p_t, R_t; \text{robot\_model}) \]

This allows the same batch of data to be used on different robot arms such as Franka and UR5.

Relative Pose Representation

UMI uses a pose representation relative to the gripper rather than a global coordinate frame:

\[ \Delta T_t = T_t^{-1} \cdot T_{t+1} \]

This representation is invariant to different robot mounting positions and orientations, further enhancing generalization.


GELLO: Joint-Space Teleoperation

Design Principles

GELLO (Wu et al., 2023) adopts a scaled homomorphic design: building a teleoperation device with the same kinematic structure as the target robot but at a smaller scale.

Key Advantages:

  • One-to-one joint space correspondence, no inverse kinematics needed
  • Good operational intuitiveness — the operator can directly feel the angle of each joint
  • Customizable for different robots

Joint Mapping:

\[ q_{\text{robot}} = \alpha \cdot q_{\text{GELLO}} + \beta \]

where \(\alpha\) is the scaling factor (compensating for size differences) and \(\beta\) is the offset.

Cost Comparison

System Cost Bimanual Force Feedback Skill Barrier
Industrial leader-follower (Sigma.7) >$100K Yes Yes Low
ALOHA ~$24K Yes No Medium
UMI ~$500 No Natural Low
GELLO ~$3K Optional No Medium
VR controllers ~$400 Yes Vibration High

Data Scaling Strategies

Open X-Embodiment

Open X-Embodiment (Collaboration, 2023) is currently the largest consortium project for robot manipulation datasets.

Scale:

  • 22 robot platforms
  • 527 skills
  • ~1M trajectories
  • From 21 institutions

Standardized Format: All data is unified into the RLDS (Reinforcement Learning Datasets) format:

{
  "observation": {
    "image": Tensor[H, W, 3],      # RGB image
    "state": Tensor[D],             # Proprioceptive state
    "language_instruction": String,  # Language instruction
  },
  "action": Tensor[A],              # Action vector
  "reward": Float,                   # Reward (if available)
}

DROID Dataset

DROID (Khazatsky et al., 2024) is a large-scale, diverse teleoperation dataset:

  • Scale: 76,000 trajectories
  • Scenes: 564 different scenes
  • Objects: 86 categories
  • Collection method: Franka + SpaceMouse, distributed across multiple labs

Data Augmentation Strategies

Beyond direct collection, there are various data augmentation methods:

Viewpoint Augmentation:

  • Random cropping, color jittering, affine transforms
  • Novel view synthesis (NeRF-based)

Action Augmentation:

  • Trajectory perturbation: \(a_t' = a_t + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)\)
  • Time scaling: Changing execution speed
  • Mirror flipping: Doubling data for symmetric tasks

Simulation-assisted:

  • Rendering real-scanned objects in simulation
  • Synthesizing training images with real backgrounds
  • Generating diverse trajectories through physical randomization

Data Quality Assessment

Key Metrics

Metric Definition Target
Success rate Proportion of task completion in demonstrations >95%
Action smoothness \(\sum_t \|\ddot{a}_t\|^2\) Lower is better
Diversity Variance/coverage of trajectories Moderate (neither too low nor too high)
Consistency Consistency across different demonstrations of the same task Moderate

Data Filtering

Automatic filtering of low-quality demonstrations:

  1. Length anomaly: Timestep count deviates more than 2 standard deviations from the mean
  2. Action anomaly: Joint velocity/acceleration exceeds safety limits
  3. Task failure: Final state does not satisfy success conditions
  4. Operator marking: Operator proactively flags failed demonstrations

Future Directions

Autonomous Data Collection

Reducing dependence on human operators:

  1. RL Exploration: Using RL policies for autonomous data collection
  2. Predictive replay: After policy execution, humans only need to correct at critical moments
  3. Active learning: Policies automatically identify regions needing more data

Leveraging Internet Data

Extracting manipulation knowledge from YouTube and similar videos:

  1. Hand detection and tracking: Extracting human hand trajectories
  2. Object state estimation: Inferring object changes
  3. Action retargeting: Mapping human hand actions to robots
  4. Representative works: R3M, VIP, Voltron

Connections to Other Chapters

  • Imitation learning: Imitation Learning is the primary consumer of teleoperation data
  • Diffusion policy: Diffusion Policy and other advanced algorithms have higher requirements for data quality and diversity
  • Hardware: Hardware Platforms introduces the physical infrastructure for data collection including robot arms and sensors

References

  1. Zhao, T., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS.
  2. Fu, Z., et al. (2024). Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. CoRL.
  3. Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS.
  4. Wu, Y., et al. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework. arXiv.
  5. Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv.
  6. Khazatsky, A., et al. (2024). DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset. RSS.

评论 #