Teleoperation and Data Collection

Overview

Data is the fuel of robot learning. However, unlike natural language processing (trillions of tokens) and computer vision (billions of images), robot manipulation data is extremely scarce. Acquiring high-quality robot interaction data requires physical hardware and human operators, with costs far exceeding those of web-crawled data.

This article systematically introduces the core method for robot data collection — teleoperation — and frontier strategies for data scaling.

The Data Bottleneck: Scale Comparison

Domain	Dataset	Scale	Acquisition Method
NLP	Common Crawl	~15T tokens	Web crawling
CV	LAION-5B	5.85B image-text pairs	Web crawling
Autonomous Driving	nuScenes	1000 scenes	Vehicle-mounted sensors
Robotics	Open X-Embodiment	~1M trajectories	22 robot platforms
Robotics	DROID	76K trajectories	Teleoperation collection
Typical Lab	—	50–500 trajectories	Manual teleoperation

Key Insight: Robot data is 7 orders of magnitude less than NLP data. This is not an engineering problem but a fundamental bottleneck — every robot trajectory requires physical time and human effort.

Data Collection Efficiency Estimate

Assuming a lab with 1 robot and 1 operator:

Single trajectory duration: ~30 seconds (operation) + ~15 seconds (reset) = 45 seconds
Trajectories per hour: ~80
Per day (8 hours): ~640
Per month (22 days): ~14,000

Collecting 1 million trajectories would take ~72 months (6 years), assuming everything goes smoothly.

Classification of Teleoperation Methods

Method Overview

graph TD
    A[Teleoperation Methods] --> B[Handheld Controllers]
    A --> C[Leader-Follower Arms]
    A --> D[Exoskeletons/Gloves]
    A --> E[Low-cost Solutions]

    B --> B1[VR Controllers<br/>Meta Quest / HTC Vive]
    B --> B2[Space Mouse<br/>6-DOF Input]
    B --> B3[Game Controllers<br/>Low precision but cheap]

    C --> C1[ALOHA<br/>Low-cost Bimanual]
    C --> C2[Industrial Leader-Follower<br/>Sigma.7 / Omega]
    C --> C3[Da Vinci Surgical Robot<br/>High precision]

    D --> D1[Data Gloves<br/>Cyberglove / MANUS]
    D --> D2[Arm Exoskeletons<br/>Motion Capture]
    D --> D3[Full-body MoCap<br/>OptiTrack / Xsens]

    E --> E1[UMI<br/>Handheld Gripper + iPhone]
    E --> E2[GELLO<br/>Joint-space Teleoperation]
    E --> E3[Phone App<br/>Simple Control]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#f3e5f5
    style E fill:#fce4ec

Key Design Dimensions

Dimension	Options	Trade-offs
Control Space	Task space $(x, y, z, R)$ vs. Joint space $q$	Task space is intuitive but has singularities
Force Feedback	Bilateral vs. Unilateral	Force feedback improves precision but increases cost
Operation Latency	<10ms (local) vs. 50–200ms (remote)	Latency causes operational instability
Control Mode	Position control vs. Impedance control	Impedance control is safer

ALOHA System

System Architecture

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation, Zhao et al., 2023) is currently the most influential low-cost bimanual teleoperation platform.

Core Design: Leader-follower homomorphism — the leader and follower arms use the same model of robotic arm (ViperX 300 6DOF). The operator controls the leader arms, and the follower arms replicate the motion in real time.

Hardware Components:

Component	Specification	Unit Cost (USD)
Follower arms x2	ViperX 300 6DOF	$5,750 x2
Leader arms x2	ViperX 300 6DOF (unloaded)	$5,750 x2
Grippers x2	Custom 1-DOF gripper	~$200 x2
Cameras x4	Logitech C922	~$70 x4
Controller	Laptop	~$1,500
Total		~$24,480

Compared to industrial teleoperation systems ($100K–$1M), ALOHA reduces cost by 1–2 orders of magnitude.

Data Format

ALOHA collects data containing:

Each timestep t:
  - Joint positions: q_leader ∈ ℝ^{14} (7 DOF x 2 arms)
  - Joint positions: q_follower ∈ ℝ^{14}
  - Images: {cam_top, cam_left, cam_right, cam_wrist} each 480x640x3
  - Timestamp: t
  - Gripper state: {open, closed} x 2

Control frequency is 50 Hz; each 30-second trajectory produces ~1500 timesteps.

Leader-Follower Mapping

ALOHA uses direct joint-space mapping:

\[ q_{\text{follower}}^{\text{target}}(t) = q_{\text{leader}}(t) \]

Since the leader and follower arms are homomorphic, joint angles correspond directly. The follower arm uses PD control to track the target:

\[ \tau = K_p (q_{\text{target}} - q_{\text{actual}}) + K_d (\dot{q}_{\text{target}} - \dot{q}_{\text{actual}}) \]

Mobile ALOHA

Extending to Mobile Manipulation

Mobile ALOHA (Fu et al., 2024) adds a mobile base to ALOHA, enabling the robot to perform tasks in larger spaces.

Additional Hardware:

Component	Specification
Mobile base	AgileX Tracer (differential drive)
Base speed	Up to 1.6 m/s
Additional camera	Wide-angle camera in front of the base
Additional cost	~$8,000

Operation Mode: The operator controls the base motion by pushing the entire teleoperation platform, while simultaneously manipulating the leader arms.

Co-training

The key innovation of Mobile ALOHA is co-training: leveraging large amounts of static ALOHA data (without locomotion) to augment training on a smaller amount of mobile manipulation data.

Let $\mathcal{D}_{\text{mobile}}$ be the mobile manipulation data (small), and $\mathcal{D}_{\text{static}}$ be the static manipulation data (large). The training loss is:

\[ \mathcal{L} = \mathbb{E}_{(o,a) \sim \mathcal{D}_{\text{mobile}}} [\ell(\pi_\theta(o), a)] + \alpha \mathbb{E}_{(o,a) \sim \mathcal{D}_{\text{static}}} [\ell(\pi_\theta(o), a)] \]

Experiments show that co-training improves success rates by 30–50% compared to training with mobile data alone.

UMI: Universal Manipulation Interface

Design Philosophy

The core idea of UMI (Chi et al., 2024) is to remove the robot — use handheld devices to directly collect human manipulation data in real environments, then deploy to different robots.

Hardware Design

Handheld collection device:

3D-printed parallel gripper shell
Embedded GoPro camera (wrist viewpoint)
iPhone Pro (LiDAR + SLAM localization)
Total cost < $500

Data collection workflow:

Operator holds the UMI gripper and performs the task
GoPro records wrist images (30 FPS)
iPhone SLAM provides the end-effector SE(3) pose trajectory
Gripper open/close state is recorded

Cross-Embodiment Transfer

UMI's collected data format is end-effector trajectories $(p_t, R_t, g_t) \in SE(3) \times \{0, 1\}$, independent of the specific robot. During deployment, the end-effector trajectory is mapped to different robots via inverse kinematics:

\[ q_t = \text{IK}(p_t, R_t; \text{robot\_model}) \]

This allows the same batch of data to be used on different robot arms such as Franka and UR5.

Relative Pose Representation

UMI uses a pose representation relative to the gripper rather than a global coordinate frame:

\[ \Delta T_t = T_t^{-1} \cdot T_{t+1} \]

This representation is invariant to different robot mounting positions and orientations, further enhancing generalization.

GELLO: Joint-Space Teleoperation

Design Principles

GELLO (Wu et al., 2023) adopts a scaled homomorphic design: building a teleoperation device with the same kinematic structure as the target robot but at a smaller scale.

Key Advantages:

One-to-one joint space correspondence, no inverse kinematics needed
Good operational intuitiveness — the operator can directly feel the angle of each joint
Customizable for different robots

Joint Mapping:

\[ q_{\text{robot}} = \alpha \cdot q_{\text{GELLO}} + \beta \]

where $\alpha$ is the scaling factor (compensating for size differences) and $\beta$ is the offset.

Cost Comparison

System	Cost	Bimanual	Force Feedback	Skill Barrier
Industrial leader-follower (Sigma.7)	>$100K	Yes	Yes	Low
ALOHA	~$24K	Yes	No	Medium
UMI	~$500	No	Natural	Low
GELLO	~$3K	Optional	No	Medium
VR controllers	~$400	Yes	Vibration	High

Data Scaling Strategies

Open X-Embodiment

Open X-Embodiment (Collaboration, 2023) is currently the largest consortium project for robot manipulation datasets.

Scale:

22 robot platforms
527 skills
~1M trajectories
From 21 institutions

Standardized Format: All data is unified into the RLDS (Reinforcement Learning Datasets) format:

{
  "observation": {
    "image": Tensor[H, W, 3],      # RGB image
    "state": Tensor[D],             # Proprioceptive state
    "language_instruction": String,  # Language instruction
  },
  "action": Tensor[A],              # Action vector
  "reward": Float,                   # Reward (if available)
}

DROID Dataset

DROID (Khazatsky et al., 2024) is a large-scale, diverse teleoperation dataset:

Scale: 76,000 trajectories
Scenes: 564 different scenes
Objects: 86 categories
Collection method: Franka + SpaceMouse, distributed across multiple labs

Data Augmentation Strategies

Beyond direct collection, there are various data augmentation methods:

Viewpoint Augmentation:

Random cropping, color jittering, affine transforms
Novel view synthesis (NeRF-based)

Action Augmentation:

Trajectory perturbation: $a_t' = a_t + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)$
Time scaling: Changing execution speed
Mirror flipping: Doubling data for symmetric tasks

Simulation-assisted:

Rendering real-scanned objects in simulation
Synthesizing training images with real backgrounds
Generating diverse trajectories through physical randomization

Data Quality Assessment

Key Metrics

Metric	Definition	Target
Success rate	Proportion of task completion in demonstrations	>95%
Action smoothness	$\sum_t \\|\ddot{a}_t\\|^2$	Lower is better
Diversity	Variance/coverage of trajectories	Moderate (neither too low nor too high)
Consistency	Consistency across different demonstrations of the same task	Moderate

Data Filtering

Automatic filtering of low-quality demonstrations:

Length anomaly: Timestep count deviates more than 2 standard deviations from the mean
Action anomaly: Joint velocity/acceleration exceeds safety limits
Task failure: Final state does not satisfy success conditions
Operator marking: Operator proactively flags failed demonstrations

Future Directions

Autonomous Data Collection

Reducing dependence on human operators:

RL Exploration: Using RL policies for autonomous data collection
Predictive replay: After policy execution, humans only need to correct at critical moments
Active learning: Policies automatically identify regions needing more data

Leveraging Internet Data

Extracting manipulation knowledge from YouTube and similar videos:

Hand detection and tracking: Extracting human hand trajectories
Object state estimation: Inferring object changes
Action retargeting: Mapping human hand actions to robots
Representative works: R3M, VIP, Voltron

Connections to Other Chapters

Imitation learning: Imitation Learning is the primary consumer of teleoperation data
Diffusion policy: Diffusion Policy and other advanced algorithms have higher requirements for data quality and diversity
Hardware: Hardware Platforms introduces the physical infrastructure for data collection including robot arms and sensors

References

Zhao, T., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS.
Fu, Z., et al. (2024). Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. CoRL.
Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS.
Wu, Y., et al. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework. arXiv.
Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv.
Khazatsky, A., et al. (2024). DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset. RSS.

Dimension	Options	Trade-offs
Control Space	Task space \((x, y, z, R)\) vs. Joint space \(q\)	Task space is intuitive but has singularities
Force Feedback	Bilateral vs. Unilateral	Force feedback improves precision but increases cost
Operation Latency	<10ms (local) vs. 50–200ms (remote)	Latency causes operational instability
Control Mode	Position control vs. Impedance control	Impedance control is safer