Reinforcement Learning in Robotics
Overview
Reinforcement Learning (RL) provides robots with the ability to learn optimal behaviors through trial and error. Unlike imitation learning, RL does not require expert demonstrations but instead explores and optimizes policies autonomously through environmental reward signals. However, applying RL to real robots presents significant challenges: low sample efficiency, difficult reward design, and strict safety constraints.
This article systematically reviews the core techniques and key achievements of RL in robotics.
Formal Framework
Robot RL problems are typically modeled as Partially Observable Markov Decision Processes (POMDPs):
- \(\mathcal{S}\): State space (e.g., complete physical state including object poses, contact forces, etc.)
- \(\mathcal{A}\): Action space (e.g., joint torques \(\tau \in \mathbb{R}^n\) or desired joint angles \(q_{\text{des}} \in \mathbb{R}^n\))
- \(\mathcal{O}\): Observation space (sensor inputs such as RGB images, proprioception)
- \(T(s'|s, a)\): State transition probability (governed by physical laws)
- \(O(o|s)\): Observation function
- \(r(s, a)\): Reward function
- \(\gamma \in [0, 1)\): Discount factor
Objective: Find the policy \(\pi^*(a|o)\) that maximizes the expected cumulative reward:
Reward Engineering
Why Reward Design Is So Difficult
Reward Engineering is the most time-consuming aspect of robot RL. An ideal reward function must satisfy:
- Informativeness: Provide sufficient gradient signals to guide learning
- Intent alignment: The optimal policy under the reward should match the desired behavior
- Stability: Should not lead to reward hacking
Sparse vs. Dense Rewards
Sparse rewards: Only provide a reward upon task completion, e.g., \(r = \mathbb{1}[\text{task success}]\).
- Pros: Simple to define, no alignment needed
- Cons: Exploration is difficult; most trajectories receive zero reward
Dense rewards: Provide intermediate feedback at each step. For robotic grasping:
where \(p_{\text{ee}}\) is the end-effector position, \(p_{\text{obj}}\) is the target object position, and \(h_{\text{obj}}\) is the object's lifting height.
Differences in Reward Design: Manipulation vs. Locomotion
| Dimension | Manipulation Tasks | Locomotion Tasks |
|---|---|---|
| Success Metric | Object reaches target pose | Sustained movement at target velocity |
| Typical Reward | Position error + contact + completion | Velocity tracking + energy penalty + stability |
| Safety Constraints | Force/torque limits | Joint limits + self-collision |
| Difficulty | Nonlinear contact dynamics | Balance + multi-gait switching |
Typical quadruped locomotion reward (ANYmal style):
Sample Efficiency Problem
Why Model-Free RL Requires Massive Samples
The sample efficiency of Model-Free RL (e.g., PPO, SAC) is limited by the following factors:
High variance of Monte Carlo estimates: The policy gradient estimate \(\hat{g} = \mathbb{E}_\tau [\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t]\) has high variance, requiring large numbers of trajectories for accurate gradients.
Exploring sparse reward spaces: In a \(d\)-dimensional continuous action space, the probability of finding sparse rewards through random exploration decays exponentially with dimension:
where \(\delta\) is the effective action range and \(\Delta a\) is the action space range.
Empirical data:
| Task | Algorithm | Required Environment Steps | Wall-clock Time |
|---|---|---|---|
| Simple grasping | SAC | ~1M steps | ~100 hours |
| Dexterous manipulation | PPO | ~10B steps | ~1000 years |
| Quadruped walking | PPO | ~100M steps | ~10 years |
Clearly, running model-free RL directly on real robots is infeasible.
Solution: Massively Parallel Training in Simulation
Core idea: Run thousands of environments in parallel on GPU-accelerated simulators, compressing training time from years to hours.
Massively Parallel Training: Isaac Gym/Lab
Architecture Principles
NVIDIA Isaac Gym (Makoviychuk et al., 2021) implements end-to-end GPU simulation, eliminating the CPU-GPU data transfer bottleneck:
Traditional pipeline:
CPU Physics → CPU→GPU Transfer → GPU Neural Network → GPU→CPU Transfer → CPU Physics
Isaac Gym pipeline:
GPU Physics → GPU Neural Network → GPU Physics (all on GPU)
Mathematics of Parallel Environments
Let there be \(K\) parallel environments, each producing data \((o_t^k, a_t^k, r_t^k, o_{t+1}^k)\) at time step \(t\). The PPO policy gradient estimate becomes:
Variance decreases by \(1/K\); with \(K = 4096\), variance is reduced by approximately 4000x.
Performance Comparison
| Simulator | Parallel Environments | Physics Engine | Frame Rate (4096 envs) |
|---|---|---|---|
| MuJoCo (CPU) | 1–32 | CPU | ~500 FPS |
| Isaac Gym | 4096+ | GPU (PhysX) | ~100,000 FPS |
| Isaac Lab | 4096+ | GPU (PhysX/MJX) | ~200,000 FPS |
| Genesis | 4096+ | GPU | ~400,000 FPS |
Isaac Gym achieves approximately 200x speedup, compressing training that would take days into minutes.
Asymmetric Actor-Critic
Core Idea
During simulation training, we have access to the complete physical state (privileged information), but only sensor observations are available during real deployment. Asymmetric Actor-Critic exploits this information asymmetry:
- Critic (used only during training): Receives privileged state \(s\) (full physical information)
- Actor (used during both training and deployment): Receives only available observations \(o\)
Teacher-Student Distillation
A more advanced approach is two-stage training:
Stage 1 — Teacher Training:
Train a "teacher" policy \(\pi_{\text{teacher}}(a|s)\) using privileged information, where \(s\) includes:
- Terrain height maps
- Precise object poses
- Contact forces
- Friction coefficients
- Other simulator internal variables
Stage 2 — Student Distillation:
Train a "student" policy \(\pi_{\text{student}}(a|o)\) to imitate the teacher, where \(o\) contains only information available from real sensors:
flowchart LR
subgraph Stage1["Stage 1: Teacher Training"]
S1[Privileged State s] --> T[Teacher Policy<br/>pi_teacher]
T --> A1[Action a]
A1 --> E1[Simulation Environment]
E1 --> R[Reward r]
R --> RL[RL Algorithm<br/>PPO]
RL --> T
end
subgraph Stage2["Stage 2: Student Distillation"]
S2[Sensor Observation o] --> ST[Student Policy<br/>pi_student]
ST --> A2[Action a]
T2[Teacher Policy<br/>Frozen] --> KL[KL Divergence Loss]
ST --> KL
KL --> ST
end
Stage1 --> Stage2
style Stage1 fill:#e8f5e9
style Stage2 fill:#e3f2fd
Key Achievements
OpenAI Rubik's Cube (2019)
Task: Rotate a Rubik's cube in-hand using the Shadow Hand dexterous manipulator.
Technical Stack:
- Algorithm: PPO + Automatic Domain Randomization (ADR)
- Simulation: MuJoCo, ~13,000 CPUs + 8 GPUs
- Training: Equivalent to tens of thousands of years of experience
- Randomized parameters: Object size, mass, friction, gravity, observation noise, and hundreds of other parameters
ADR Mechanism: Gradually expand the randomization range; when the policy meets performance thresholds within the current range, automatically expand. Update rule for the randomization range \([\phi_{\text{low}}, \phi_{\text{high}}]\):
ANYmal Extreme Locomotion (2022–2024)
Achievement: The ETH Zurich team used RL to train the quadruped robot ANYmal to perform parkour, climbing, and jumping.
Key Techniques:
- Curriculum learning: Gradually increasing terrain difficulty
- Reward design: Velocity tracking + stability + energy efficiency
- Teacher-Student: Teacher uses height scans (privileged information), Student uses proprioceptive history
Terrain Curriculum:
Dexterous Manipulation (2023–2025)
In recent years, RL has made rapid progress in dexterous manipulation:
| Work | Task | Key Innovation |
|---|---|---|
| DexPoint (2023) | Dexterous grasping | Point cloud input + RL |
| RotateIt (2023) | In-hand object rotation | Tactile + vision + RL |
| AnyRotate (2024) | Arbitrary object rotation | No object model required |
| DexCatch (2024) | Dynamic toss and catch | High-speed visual feedback |
Reward Learning: Automating Reward Design
Learning Rewards from Human Preferences (RLHF for Robots)
Inspired by RLHF in LLMs, the robotics community has begun exploring learning rewards from human preferences:
Given trajectory pairs \((\tau_A, \tau_B)\) with human preference labels \(y \in \{A, B\}\), train a reward model \(r_\psi\):
This is the application of the Bradley-Terry model to trajectory pairs.
LLM-Generated Reward Functions
The latest trend is using LLMs to directly generate reward code:
- Describe the task in natural language
- LLM generates a Python reward function
- Train an RL policy in simulation
- Iteratively optimize the reward based on behavioral feedback
Representative works: Eureka (Ma et al., 2023), Language2Reward (Yu et al., 2023).
Model-Based RL
Dynamics Model Learning
Model-Based RL learns an environment dynamics model \(\hat{T}(s'|s, a)\), then uses the model for planning or generating simulated data.
Neural network dynamics model:
Probabilistic model (for uncertainty estimation):
MPC with Learned Model
After learning a dynamics model, Model Predictive Control (MPC) is used for online planning:
where \(H\) is the planning horizon and \(c\) is the cost function.
Advantages: High sample efficiency (typically <1000 real interaction steps to learn simple tasks) Disadvantages: Model error accumulation, high computational overhead
Connections to Related Chapters
- Control theory foundations: Robotics Fundamentals covers PID, MPC as complementary methods to RL
- Sim2Real transfer: Sim2Real details how to deploy RL policies trained in simulation to real environments
- Simulation platforms: Simulation Platforms introduces Isaac Gym/Lab and other training infrastructure
- Imitation learning: Imitation Learning is an important complement to RL; the two are often combined (e.g., RL fine-tuning from IL initialization)
References
- Makoviychuk, V., et al. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation for Robot Learning. NeurIPS Datasets and Benchmarks.
- OpenAI (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113.
- Miki, T., et al. (2022). Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild. Science Robotics.
- Ma, Y., et al. (2023). Eureka: Human-Level Reward Design via Coding Large Language Models. ICLR 2024.
- Haarnoja, T., et al. (2024). Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning. Science Robotics.