Reinforcement Learning in Robotics

Overview

Reinforcement Learning (RL) provides robots with the ability to learn optimal behaviors through trial and error. Unlike imitation learning, RL does not require expert demonstrations but instead explores and optimizes policies autonomously through environmental reward signals. However, applying RL to real robots presents significant challenges: low sample efficiency, difficult reward design, and strict safety constraints.

This article systematically reviews the core techniques and key achievements of RL in robotics.

Formal Framework

Robot RL problems are typically modeled as Partially Observable Markov Decision Processes (POMDPs):

\[ (\mathcal{S}, \mathcal{A}, \mathcal{O}, T, O, r, \gamma) \]

\(\mathcal{S}\): State space (e.g., complete physical state including object poses, contact forces, etc.)
\(\mathcal{A}\): Action space (e.g., joint torques \(\tau \in \mathbb{R}^n\) or desired joint angles \(q_{\text{des}} \in \mathbb{R}^n\))
\(\mathcal{O}\): Observation space (sensor inputs such as RGB images, proprioception)
\(T(s'|s, a)\): State transition probability (governed by physical laws)
\(O(o|s)\): Observation function
\(r(s, a)\): Reward function
\(\gamma \in [0, 1)\): Discount factor

Objective: Find the policy \(\pi^*(a|o)\) that maximizes the expected cumulative reward:

\[ \pi^* = \arg\max_\pi \mathbb{E}_\pi \left[ \sum_{t=0}^{T} \gamma^t r(s_t, a_t) \right] \]

Reward Engineering

Why Reward Design Is So Difficult

Reward Engineering is the most time-consuming aspect of robot RL. An ideal reward function must satisfy:

Informativeness: Provide sufficient gradient signals to guide learning
Intent alignment: The optimal policy under the reward should match the desired behavior
Stability: Should not lead to reward hacking

Sparse vs. Dense Rewards

Sparse rewards: Only provide a reward upon task completion, e.g., \(r = \mathbb{1}[\text{task success}]\).

Pros: Simple to define, no alignment needed
Cons: Exploration is difficult; most trajectories receive zero reward

Dense rewards: Provide intermediate feedback at each step. For robotic grasping:

\[ r_{\text{grasp}}(s, a) = \underbrace{-\alpha \| p_{\text{ee}} - p_{\text{obj}} \|}_{\text{reaching reward}} + \underbrace{\beta \cdot \mathbb{1}[\text{contact}]}_{\text{contact reward}} + \underbrace{\gamma_r \cdot h_{\text{obj}}}_{\text{lifting reward}} + \underbrace{\delta \cdot \mathbb{1}[\text{success}]}_{\text{success reward}} \]

where \(p_{\text{ee}}\) is the end-effector position, \(p_{\text{obj}}\) is the target object position, and \(h_{\text{obj}}\) is the object's lifting height.

Differences in Reward Design: Manipulation vs. Locomotion

Dimension	Manipulation Tasks	Locomotion Tasks
Success Metric	Object reaches target pose	Sustained movement at target velocity
Typical Reward	Position error + contact + completion	Velocity tracking + energy penalty + stability
Safety Constraints	Force/torque limits	Joint limits + self-collision
Difficulty	Nonlinear contact dynamics	Balance + multi-gait switching

Typical quadruped locomotion reward (ANYmal style):

\[ r_{\text{loco}} = w_1 \underbrace{v_x \cdot v_x^{\text{cmd}}}_{\text{velocity tracking}} - w_2 \underbrace{\| \boldsymbol{\omega} \|^2}_{\text{angular velocity penalty}} - w_3 \underbrace{\| \boldsymbol{\tau} \|^2}_{\text{torque penalty}} - w_4 \underbrace{\| \ddot{q} \|^2}_{\text{smoothness penalty}} + w_5 \underbrace{h_{\text{body}}}_{\text{height reward}} \]

Sample Efficiency Problem

Why Model-Free RL Requires Massive Samples

The sample efficiency of Model-Free RL (e.g., PPO, SAC) is limited by the following factors:

High variance of Monte Carlo estimates: The policy gradient estimate \(\hat{g} = \mathbb{E}_\tau [\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t]\) has high variance, requiring large numbers of trajectories for accurate gradients.

Exploring sparse reward spaces: In a \(d\)-dimensional continuous action space, the probability of finding sparse rewards through random exploration decays exponentially with dimension:

\[ P(\text{success by random}) \sim \left(\frac{\delta}{\Delta a}\right)^d \]

where \(\delta\) is the effective action range and \(\Delta a\) is the action space range.

Empirical data:

Task	Algorithm	Required Environment Steps	Wall-clock Time
Simple grasping	SAC	~1M steps	~100 hours
Dexterous manipulation	PPO	~10B steps	~1000 years
Quadruped walking	PPO	~100M steps	~10 years

Clearly, running model-free RL directly on real robots is infeasible.

Solution: Massively Parallel Training in Simulation

Core idea: Run thousands of environments in parallel on GPU-accelerated simulators, compressing training time from years to hours.

Massively Parallel Training: Isaac Gym/Lab

Architecture Principles

NVIDIA Isaac Gym (Makoviychuk et al., 2021) implements end-to-end GPU simulation, eliminating the CPU-GPU data transfer bottleneck:

Traditional pipeline:

CPU Physics → CPU→GPU Transfer → GPU Neural Network → GPU→CPU Transfer → CPU Physics

Isaac Gym pipeline:

GPU Physics → GPU Neural Network → GPU Physics (all on GPU)

Mathematics of Parallel Environments

Let there be \(K\) parallel environments, each producing data \((o_t^k, a_t^k, r_t^k, o_{t+1}^k)\) at time step \(t\). The PPO policy gradient estimate becomes:

\[ \hat{g} = \frac{1}{K \cdot T} \sum_{k=1}^K \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^k | o_t^k) \hat{A}_t^k \]

Variance decreases by \(1/K\); with \(K = 4096\), variance is reduced by approximately 4000x.

Performance Comparison

Simulator	Parallel Environments	Physics Engine	Frame Rate (4096 envs)
MuJoCo (CPU)	1–32	CPU	~500 FPS
Isaac Gym	4096+	GPU (PhysX)	~100,000 FPS
Isaac Lab	4096+	GPU (PhysX/MJX)	~200,000 FPS
Genesis	4096+	GPU	~400,000 FPS

Isaac Gym achieves approximately 200x speedup, compressing training that would take days into minutes.

Asymmetric Actor-Critic

Core Idea

During simulation training, we have access to the complete physical state (privileged information), but only sensor observations are available during real deployment. Asymmetric Actor-Critic exploits this information asymmetry:

Critic (used only during training): Receives privileged state \(s\) (full physical information)
Actor (used during both training and deployment): Receives only available observations \(o\)

\[ \text{Critic: } V_\psi(s), \quad \text{Actor: } \pi_\theta(a|o) \]

Teacher-Student Distillation

A more advanced approach is two-stage training:

Stage 1 — Teacher Training:

Train a "teacher" policy \(\pi_{\text{teacher}}(a|s)\) using privileged information, where \(s\) includes:

Terrain height maps
Precise object poses
Contact forces
Friction coefficients
Other simulator internal variables

Stage 2 — Student Distillation:

Train a "student" policy \(\pi_{\text{student}}(a|o)\) to imitate the teacher, where \(o\) contains only information available from real sensors:

\[ \mathcal{L}_{\text{distill}} = \mathbb{E}_{o, s \sim \text{rollout}} \left[ D_{\text{KL}}(\pi_{\text{teacher}}(\cdot|s) \| \pi_{\text{student}}(\cdot|o)) \right] \]

flowchart LR
    subgraph Stage1["Stage 1: Teacher Training"]
        S1[Privileged State s] --> T[Teacher Policy<br/>pi_teacher]
        T --> A1[Action a]
        A1 --> E1[Simulation Environment]
        E1 --> R[Reward r]
        R --> RL[RL Algorithm<br/>PPO]
        RL --> T
    end

    subgraph Stage2["Stage 2: Student Distillation"]
        S2[Sensor Observation o] --> ST[Student Policy<br/>pi_student]
        ST --> A2[Action a]
        T2[Teacher Policy<br/>Frozen] --> KL[KL Divergence Loss]
        ST --> KL
        KL --> ST
    end

    Stage1 --> Stage2

    style Stage1 fill:#e8f5e9
    style Stage2 fill:#e3f2fd

Key Achievements

OpenAI Rubik's Cube (2019)

Task: Rotate a Rubik's cube in-hand using the Shadow Hand dexterous manipulator.

Technical Stack:

Algorithm: PPO + Automatic Domain Randomization (ADR)
Simulation: MuJoCo, ~13,000 CPUs + 8 GPUs
Training: Equivalent to tens of thousands of years of experience
Randomized parameters: Object size, mass, friction, gravity, observation noise, and hundreds of other parameters

ADR Mechanism: Gradually expand the randomization range; when the policy meets performance thresholds within the current range, automatically expand. Update rule for the randomization range \([\phi_{\text{low}}, \phi_{\text{high}}]\):

\[ \phi_{\text{high}} \leftarrow \phi_{\text{high}} + \Delta \phi \quad \text{if} \quad \text{success\_rate} > \eta \]

ANYmal Extreme Locomotion (2022–2024)

Achievement: The ETH Zurich team used RL to train the quadruped robot ANYmal to perform parkour, climbing, and jumping.

Key Techniques:

Curriculum learning: Gradually increasing terrain difficulty
Reward design: Velocity tracking + stability + energy efficiency
Teacher-Student: Teacher uses height scans (privileged information), Student uses proprioceptive history

Terrain Curriculum:

\[ \text{difficulty}_i \leftarrow \text{difficulty}_i + \alpha \cdot (\text{success\_rate}_i - \text{threshold}) \]

Dexterous Manipulation (2023–2025)

In recent years, RL has made rapid progress in dexterous manipulation:

Work	Task	Key Innovation
DexPoint (2023)	Dexterous grasping	Point cloud input + RL
RotateIt (2023)	In-hand object rotation	Tactile + vision + RL
AnyRotate (2024)	Arbitrary object rotation	No object model required
DexCatch (2024)	Dynamic toss and catch	High-speed visual feedback

Reward Learning: Automating Reward Design

Learning Rewards from Human Preferences (RLHF for Robots)

Inspired by RLHF in LLMs, the robotics community has begun exploring learning rewards from human preferences:

Given trajectory pairs \((\tau_A, \tau_B)\) with human preference labels \(y \in \{A, B\}\), train a reward model \(r_\psi\):

\[ P(A \succ B) = \frac{\exp(\sum_t r_\psi(s_t^A, a_t^A))}{\exp(\sum_t r_\psi(s_t^A, a_t^A)) + \exp(\sum_t r_\psi(s_t^B, a_t^B))} \]

This is the application of the Bradley-Terry model to trajectory pairs.

LLM-Generated Reward Functions

The latest trend is using LLMs to directly generate reward code:

Describe the task in natural language
LLM generates a Python reward function
Train an RL policy in simulation
Iteratively optimize the reward based on behavioral feedback

Representative works: Eureka (Ma et al., 2023), Language2Reward (Yu et al., 2023).

Model-Based RL

Dynamics Model Learning

Model-Based RL learns an environment dynamics model \(\hat{T}(s'|s, a)\), then uses the model for planning or generating simulated data.

Neural network dynamics model:

\[ \hat{s}_{t+1} = f_\theta(s_t, a_t), \quad \theta^* = \arg\min_\theta \sum_{(s, a, s') \in \mathcal{D}} \|f_\theta(s, a) - s'\|^2 \]

Probabilistic model (for uncertainty estimation):

\[ \hat{s}_{t+1} \sim \mathcal{N}(\mu_\theta(s_t, a_t), \sigma_\theta^2(s_t, a_t)) \]

MPC with Learned Model

After learning a dynamics model, Model Predictive Control (MPC) is used for online planning:

\[ a_t^* = \arg\min_{a_{t:t+H}} \sum_{k=0}^{H} c(s_{t+k}, a_{t+k}), \quad \text{s.t. } s_{t+k+1} = f_\theta(s_{t+k}, a_{t+k}) \]

where \(H\) is the planning horizon and \(c\) is the cost function.

Advantages: High sample efficiency (typically <1000 real interaction steps to learn simple tasks) Disadvantages: Model error accumulation, high computational overhead

Control theory foundations: Robotics Fundamentals covers PID, MPC as complementary methods to RL
Sim2Real transfer: Sim2Real details how to deploy RL policies trained in simulation to real environments
Simulation platforms: Simulation Platforms introduces Isaac Gym/Lab and other training infrastructure
Imitation learning: Imitation Learning is an important complement to RL; the two are often combined (e.g., RL fine-tuning from IL initialization)

References

Makoviychuk, V., et al. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation for Robot Learning. NeurIPS Datasets and Benchmarks.
OpenAI (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113.
Miki, T., et al. (2022). Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild. Science Robotics.
Ma, Y., et al. (2023). Eureka: Human-Level Reward Design via Coding Large Language Models. ICLR 2024.
Haarnoja, T., et al. (2024). Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning. Science Robotics.