Teleoperation and Data Collection
Overview
Data is the fuel of robot learning. However, unlike natural language processing (trillions of tokens) and computer vision (billions of images), robot manipulation data is extremely scarce. Acquiring high-quality robot interaction data requires physical hardware and human operators, with costs far exceeding those of web-crawled data.
This article systematically introduces the core method for robot data collection — teleoperation — and frontier strategies for data scaling.
The Data Bottleneck: Scale Comparison
| Domain | Dataset | Scale | Acquisition Method |
|---|---|---|---|
| NLP | Common Crawl | ~15T tokens | Web crawling |
| CV | LAION-5B | 5.85B image-text pairs | Web crawling |
| Autonomous Driving | nuScenes | 1000 scenes | Vehicle-mounted sensors |
| Robotics | Open X-Embodiment | ~1M trajectories | 22 robot platforms |
| Robotics | DROID | 76K trajectories | Teleoperation collection |
| Typical Lab | — | 50–500 trajectories | Manual teleoperation |
Key Insight: Robot data is 7 orders of magnitude less than NLP data. This is not an engineering problem but a fundamental bottleneck — every robot trajectory requires physical time and human effort.
Data Collection Efficiency Estimate
Assuming a lab with 1 robot and 1 operator:
- Single trajectory duration: ~30 seconds (operation) + ~15 seconds (reset) = 45 seconds
- Trajectories per hour: ~80
- Per day (8 hours): ~640
- Per month (22 days): ~14,000
Collecting 1 million trajectories would take ~72 months (6 years), assuming everything goes smoothly.
Classification of Teleoperation Methods
Method Overview
graph TD
A[Teleoperation Methods] --> B[Handheld Controllers]
A --> C[Leader-Follower Arms]
A --> D[Exoskeletons/Gloves]
A --> E[Low-cost Solutions]
B --> B1[VR Controllers<br/>Meta Quest / HTC Vive]
B --> B2[Space Mouse<br/>6-DOF Input]
B --> B3[Game Controllers<br/>Low precision but cheap]
C --> C1[ALOHA<br/>Low-cost Bimanual]
C --> C2[Industrial Leader-Follower<br/>Sigma.7 / Omega]
C --> C3[Da Vinci Surgical Robot<br/>High precision]
D --> D1[Data Gloves<br/>Cyberglove / MANUS]
D --> D2[Arm Exoskeletons<br/>Motion Capture]
D --> D3[Full-body MoCap<br/>OptiTrack / Xsens]
E --> E1[UMI<br/>Handheld Gripper + iPhone]
E --> E2[GELLO<br/>Joint-space Teleoperation]
E --> E3[Phone App<br/>Simple Control]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#e8f5e9
style D fill:#f3e5f5
style E fill:#fce4ec
Key Design Dimensions
| Dimension | Options | Trade-offs |
|---|---|---|
| Control Space | Task space \((x, y, z, R)\) vs. Joint space \(q\) | Task space is intuitive but has singularities |
| Force Feedback | Bilateral vs. Unilateral | Force feedback improves precision but increases cost |
| Operation Latency | <10ms (local) vs. 50–200ms (remote) | Latency causes operational instability |
| Control Mode | Position control vs. Impedance control | Impedance control is safer |
ALOHA System
System Architecture
ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation, Zhao et al., 2023) is currently the most influential low-cost bimanual teleoperation platform.
Core Design: Leader-follower homomorphism — the leader and follower arms use the same model of robotic arm (ViperX 300 6DOF). The operator controls the leader arms, and the follower arms replicate the motion in real time.
Hardware Components:
| Component | Specification | Unit Cost (USD) |
|---|---|---|
| Follower arms x2 | ViperX 300 6DOF | $5,750 x2 |
| Leader arms x2 | ViperX 300 6DOF (unloaded) | $5,750 x2 |
| Grippers x2 | Custom 1-DOF gripper | ~$200 x2 |
| Cameras x4 | Logitech C922 | ~$70 x4 |
| Controller | Laptop | ~$1,500 |
| Total | ~$24,480 |
Compared to industrial teleoperation systems (\(100K–\)1M), ALOHA reduces cost by 1–2 orders of magnitude.
Data Format
ALOHA collects data containing:
Each timestep t:
- Joint positions: q_leader ∈ ℝ^{14} (7 DOF x 2 arms)
- Joint positions: q_follower ∈ ℝ^{14}
- Images: {cam_top, cam_left, cam_right, cam_wrist} each 480x640x3
- Timestamp: t
- Gripper state: {open, closed} x 2
Control frequency is 50 Hz; each 30-second trajectory produces ~1500 timesteps.
Leader-Follower Mapping
ALOHA uses direct joint-space mapping:
Since the leader and follower arms are homomorphic, joint angles correspond directly. The follower arm uses PD control to track the target:
Mobile ALOHA
Extending to Mobile Manipulation
Mobile ALOHA (Fu et al., 2024) adds a mobile base to ALOHA, enabling the robot to perform tasks in larger spaces.
Additional Hardware:
| Component | Specification |
|---|---|
| Mobile base | AgileX Tracer (differential drive) |
| Base speed | Up to 1.6 m/s |
| Additional camera | Wide-angle camera in front of the base |
| Additional cost | ~$8,000 |
Operation Mode: The operator controls the base motion by pushing the entire teleoperation platform, while simultaneously manipulating the leader arms.
Co-training
The key innovation of Mobile ALOHA is co-training: leveraging large amounts of static ALOHA data (without locomotion) to augment training on a smaller amount of mobile manipulation data.
Let \(\mathcal{D}_{\text{mobile}}\) be the mobile manipulation data (small), and \(\mathcal{D}_{\text{static}}\) be the static manipulation data (large). The training loss is:
Experiments show that co-training improves success rates by 30–50% compared to training with mobile data alone.
UMI: Universal Manipulation Interface
Design Philosophy
The core idea of UMI (Chi et al., 2024) is to remove the robot — use handheld devices to directly collect human manipulation data in real environments, then deploy to different robots.
Hardware Design
Handheld collection device:
- 3D-printed parallel gripper shell
- Embedded GoPro camera (wrist viewpoint)
- iPhone Pro (LiDAR + SLAM localization)
- Total cost < $500
Data collection workflow:
- Operator holds the UMI gripper and performs the task
- GoPro records wrist images (30 FPS)
- iPhone SLAM provides the end-effector SE(3) pose trajectory
- Gripper open/close state is recorded
Cross-Embodiment Transfer
UMI's collected data format is end-effector trajectories \((p_t, R_t, g_t) \in SE(3) \times \{0, 1\}\), independent of the specific robot. During deployment, the end-effector trajectory is mapped to different robots via inverse kinematics:
This allows the same batch of data to be used on different robot arms such as Franka and UR5.
Relative Pose Representation
UMI uses a pose representation relative to the gripper rather than a global coordinate frame:
This representation is invariant to different robot mounting positions and orientations, further enhancing generalization.
GELLO: Joint-Space Teleoperation
Design Principles
GELLO (Wu et al., 2023) adopts a scaled homomorphic design: building a teleoperation device with the same kinematic structure as the target robot but at a smaller scale.
Key Advantages:
- One-to-one joint space correspondence, no inverse kinematics needed
- Good operational intuitiveness — the operator can directly feel the angle of each joint
- Customizable for different robots
Joint Mapping:
where \(\alpha\) is the scaling factor (compensating for size differences) and \(\beta\) is the offset.
Cost Comparison
| System | Cost | Bimanual | Force Feedback | Skill Barrier |
|---|---|---|---|---|
| Industrial leader-follower (Sigma.7) | >$100K | Yes | Yes | Low |
| ALOHA | ~$24K | Yes | No | Medium |
| UMI | ~$500 | No | Natural | Low |
| GELLO | ~$3K | Optional | No | Medium |
| VR controllers | ~$400 | Yes | Vibration | High |
Data Scaling Strategies
Open X-Embodiment
Open X-Embodiment (Collaboration, 2023) is currently the largest consortium project for robot manipulation datasets.
Scale:
- 22 robot platforms
- 527 skills
- ~1M trajectories
- From 21 institutions
Standardized Format: All data is unified into the RLDS (Reinforcement Learning Datasets) format:
{
"observation": {
"image": Tensor[H, W, 3], # RGB image
"state": Tensor[D], # Proprioceptive state
"language_instruction": String, # Language instruction
},
"action": Tensor[A], # Action vector
"reward": Float, # Reward (if available)
}
DROID Dataset
DROID (Khazatsky et al., 2024) is a large-scale, diverse teleoperation dataset:
- Scale: 76,000 trajectories
- Scenes: 564 different scenes
- Objects: 86 categories
- Collection method: Franka + SpaceMouse, distributed across multiple labs
Data Augmentation Strategies
Beyond direct collection, there are various data augmentation methods:
Viewpoint Augmentation:
- Random cropping, color jittering, affine transforms
- Novel view synthesis (NeRF-based)
Action Augmentation:
- Trajectory perturbation: \(a_t' = a_t + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)\)
- Time scaling: Changing execution speed
- Mirror flipping: Doubling data for symmetric tasks
Simulation-assisted:
- Rendering real-scanned objects in simulation
- Synthesizing training images with real backgrounds
- Generating diverse trajectories through physical randomization
Data Quality Assessment
Key Metrics
| Metric | Definition | Target |
|---|---|---|
| Success rate | Proportion of task completion in demonstrations | >95% |
| Action smoothness | \(\sum_t \|\ddot{a}_t\|^2\) | Lower is better |
| Diversity | Variance/coverage of trajectories | Moderate (neither too low nor too high) |
| Consistency | Consistency across different demonstrations of the same task | Moderate |
Data Filtering
Automatic filtering of low-quality demonstrations:
- Length anomaly: Timestep count deviates more than 2 standard deviations from the mean
- Action anomaly: Joint velocity/acceleration exceeds safety limits
- Task failure: Final state does not satisfy success conditions
- Operator marking: Operator proactively flags failed demonstrations
Future Directions
Autonomous Data Collection
Reducing dependence on human operators:
- RL Exploration: Using RL policies for autonomous data collection
- Predictive replay: After policy execution, humans only need to correct at critical moments
- Active learning: Policies automatically identify regions needing more data
Leveraging Internet Data
Extracting manipulation knowledge from YouTube and similar videos:
- Hand detection and tracking: Extracting human hand trajectories
- Object state estimation: Inferring object changes
- Action retargeting: Mapping human hand actions to robots
- Representative works: R3M, VIP, Voltron
Connections to Other Chapters
- Imitation learning: Imitation Learning is the primary consumer of teleoperation data
- Diffusion policy: Diffusion Policy and other advanced algorithms have higher requirements for data quality and diversity
- Hardware: Hardware Platforms introduces the physical infrastructure for data collection including robot arms and sensors
References
- Zhao, T., et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. RSS.
- Fu, Z., et al. (2024). Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. CoRL.
- Chi, C., et al. (2024). Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. RSS.
- Wu, Y., et al. (2023). GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework. arXiv.
- Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv.
- Khazatsky, A., et al. (2024). DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset. RSS.