Sim2Real: Transferring from Simulation to Reality
Overview
Sim2Real (Simulation-to-Reality) is a key technology in robot learning that bridges simulation training and real-world deployment. The core problem is the Reality Gap: simulators cannot perfectly reproduce the physical properties of the real world, causing policies that perform well in simulation to fail in real environments.
The research goal of Sim2Real is to develop systematic methods that enable policies trained in simulation to transfer robustly to real robots.
Sources of the Reality Gap
The gap between simulation and reality arises from multiple levels:
Dynamics Gap
Simulators exhibit systematic biases in modeling physical processes:
| Physical Phenomenon | Simulation Approximation | Real-world Behavior |
|---|---|---|
| Contact | Penetration penalty forces / constraints | Complex deformation, friction |
| Friction | Coulomb model \(f = \mu N\) | Nonlinear, anisotropic |
| Soft bodies | Finite elements / mass-spring | Continuous deformation |
| Motors | Ideal torque sources | Delay, nonlinearity, thermal effects |
| Sensors | Ideal values + Gaussian noise | Complex noise, bias drift |
Visual Gap
| Dimension | Simulation Rendering | Real Images |
|---|---|---|
| Lighting | Simplified lighting models | Complex ambient light, reflections |
| Textures | Limited texture library | Infinite diversity |
| Camera | Ideal pinhole model | Distortion, chromatic aberration, noise |
| Occlusion | Perfect depth | Sensor noise, holes |
Domain Randomization
Core Idea
The core hypothesis of Domain Randomization (DR) is: if a policy can succeed across a large number of randomized simulation environments, then the real environment is simply "one sample" among these randomized environments.
Formalization: Define the simulation parameter vector \(\xi \in \Xi\), sampled from distribution \(p(\xi)\) at the beginning of each training episode:
Physical Parameter Randomization
Typical physical parameter randomization ranges:
| Parameter | Symbol | Default Value | Randomization Range |
|---|---|---|---|
| Friction coefficient | \(\mu\) | 1.0 | [0.2, 2.0] |
| Object mass | \(m\) | Nominal value | [0.5x, 2.0x] |
| Damping coefficient | \(b\) | Nominal value | [0.5x, 3.0x] |
| Joint backlash | \(\delta\) | 0 | [0, 0.02] rad |
| Actuator delay | \(\Delta t\) | 0 | [0, 30] ms |
| Gravity | \(g\) | 9.81 | [9.4, 10.2] m/s\(^2\) |
| Terrain friction | \(\mu_g\) | 0.7 | [0.3, 1.2] |
Visual Randomization
Visual domain randomization introduces variations at the rendering level:
- Texture randomization: Randomly replacing object and background textures
- Lighting randomization: Direction, intensity, color, number of lights
- Camera randomization: Position offset, field of view, white balance
- Distractors: Randomly adding irrelevant objects to the scene
Automatic Domain Randomization (ADR)
Manually setting randomization ranges requires domain expertise. Automatic Domain Randomization (ADR, OpenAI 2019) adjusts them automatically:
Algorithm: For each parameter \(\xi_i\), maintain a range \([\xi_i^{\text{low}}, \xi_i^{\text{high}}]\). Within an evaluation window:
This allows the randomization range to adaptively expand or contract during training.
System Identification
Core Idea
Unlike domain randomization's approach of "training to be robust across all environments," System Identification (SysID) aims to make the simulation as close to reality as possible.
Parameter Identification
Given the simulator model \(f_\xi(s, a)\) and real robot trajectories \(\{(s_t^{\text{real}}, a_t^{\text{real}}, s_{t+1}^{\text{real}})\}\), optimize the simulation parameters:
Bayesian Optimization
When the objective function is non-differentiable (e.g., the simulator is a black box), Bayesian optimization is used:
- Execute a standard action sequence on the real robot and record trajectory \(\tau^{\text{real}}\)
- Execute the same action sequence in simulation to obtain \(\tau^{\text{sim}}(\xi)\)
- Define a distance metric \(d(\tau^{\text{real}}, \tau^{\text{sim}}(\xi))\)
- Model \(d(\xi)\) with a Gaussian process and use Bayesian optimization to search for \(\xi^*\)
Online Adaptation
Going further, environment parameters can be estimated online during deployment. Treat \(\xi\) as a latent variable and infer it from observation history:
where \(g_\phi\) is an encoder (typically an RNN or Transformer) that infers current environment parameters from the most recent \(L\) steps of observation-action history.
Domain Adaptation
Adversarial Feature Alignment
Domain adaptation bridges the gap by learning domain-invariant features. The core method is adversarial training:
Architecture:
- Feature extractor \(F_\theta: \mathcal{O} \rightarrow \mathcal{Z}\)
- Task head \(C_\psi: \mathcal{Z} \rightarrow \mathcal{A}\)
- Domain discriminator \(D_\omega: \mathcal{Z} \rightarrow \{0, 1\}\) (0=sim, 1=real)
Objective Function:
Through the Gradient Reversal Layer, the feature extractor \(F_\theta\) simultaneously minimizes the task loss and maximizes the confusion of the domain discriminator, thereby learning domain-invariant features.
Image-Level Transfer
Using image-to-image translation (e.g., CycleGAN) to convert simulation images to a style closer to reality:
With cycle consistency loss:
Teacher-Student Distillation
Complete Training Pipeline
Teacher-Student is one of the most successful Sim2Real frameworks, especially in quadruped locomotion control.
flowchart TD
subgraph SIM["Simulation Environment (GPU Parallel)"]
ENV[4096 Parallel Environments<br/>Domain Randomization]
end
subgraph TEACHER["Stage 1: Teacher Training"]
OBS_T[Privileged Observations<br/>Terrain Scan + Contact Forces<br/>Friction Coefficients + Object Poses]
ACTOR_T[Teacher Actor<br/>MLP 256-256-256]
CRITIC_T[Teacher Critic<br/>MLP 512-256-128]
PPO[PPO Algorithm]
OBS_T --> ACTOR_T
OBS_T --> CRITIC_T
ACTOR_T --> PPO
CRITIC_T --> PPO
end
subgraph STUDENT["Stage 2: Student Distillation"]
OBS_S[Sensor Observations<br/>Joint Angles + IMU<br/>+ Action History]
ENCODER[History Encoder<br/>RNN/Transformer]
ACTOR_S[Student Actor<br/>MLP 256-256-256]
KL[KL Divergence Distillation Loss]
OBS_S --> ENCODER
ENCODER --> ACTOR_S
ACTOR_T -.->|Frozen Weights| KL
ACTOR_S --> KL
end
subgraph DEPLOY["Stage 3: Real Deployment"]
REAL_OBS[Real Sensors] --> ACTOR_DEPLOY[Student Actor<br/>Proprioception Only]
ACTOR_DEPLOY --> REAL_ACT[Motor Commands<br/>@ 50Hz]
end
SIM --> TEACHER
TEACHER --> STUDENT
STUDENT --> DEPLOY
style SIM fill:#fff3e0
style TEACHER fill:#e8f5e9
style STUDENT fill:#e3f2fd
style DEPLOY fill:#fce4ec
Privileged Information Design
Teacher's privileged information (available in simulation, unavailable in reality):
| Privileged Information | Dimension | Description |
|---|---|---|
| Terrain height scan | \(\mathbb{R}^{187}\) | Height sample points around feet |
| External forces | \(\mathbb{R}^{3}\) | Perturbation forces applied to the torso |
| Friction coefficient | \(\mathbb{R}^{1}\) | Current ground friction |
| Payload mass | \(\mathbb{R}^{1}\) | Additional load on the back |
| Motor strength | \(\mathbb{R}^{12}\) | Actual gain for each motor |
Student's available information:
| Observation | Dimension | Description |
|---|---|---|
| Joint angles | \(\mathbb{R}^{12}\) | Encoder readings |
| Joint angular velocities | \(\mathbb{R}^{12}\) | Encoder differentials |
| IMU | \(\mathbb{R}^{6}\) | Angular velocity + acceleration |
| Action history | \(\mathbb{R}^{12 \times L}\) | Past \(L\) steps of actions |
| Velocity command | \(\mathbb{R}^{3}\) | \((v_x, v_y, \omega_z)\) |
History Encoder
The Student implicitly infers environment parameters through historical information. The encoder maps the past \(L\) steps of observation-action pairs to a latent variable:
Then the Student policy is conditioned on \(z_t\):
Intuitively, \(z_t\) encodes an implicit estimate of the current terrain type, friction, payload, and similar information.
Sim2Real Best Practices
Success Factors
- Sufficient domain randomization: Cover the range of possible variations in the real environment
- Accurate default parameters: Determine nominal parameters through system identification
- Robust observations: Avoid relying on signals that are easy in simulation but inaccurate in reality
- Delay modeling: Inject communication/computation delays in simulation
- Noise injection: Sensor noise, actuator noise
- Action smoothing: Penalize abrupt action changes
Common Failure Modes
| Failure Mode | Cause | Solution |
|---|---|---|
| Action jitter | No motor dynamics in simulation | Add low-pass filtering and smoothness penalties |
| Cannot walk | Inaccurate friction model | Wider friction randomization range |
| Grasping failure | Contact model discrepancy | Force/torque feedback + domain randomization |
| Vision failure | Unrealistic rendering | Visual domain randomization + domain adaptation |
| Delay sensitivity | Unmodeled delays | Inject 10–30ms random delays |
Evaluation Metrics
Transfer Success Rate
The most direct metric is the ratio of task success rate in the real environment to that in simulation:
Ideally close to 1.0. In practice:
- Simple tasks (e.g., quadruped walking): 0.8–0.95
- Manipulation tasks (e.g., dexterous manipulation): 0.3–0.7
- Vision tasks (e.g., image-based manipulation): 0.2–0.6
Zero-Shot vs. Few-Shot Transfer
- Zero-Shot: Directly deploy after simulation training without any real data
- Few-Shot: Simulation pretraining + fine-tuning with a small amount of real data
Connections to Other Chapters
- Simulation platforms: Simulation Platforms provides detailed introductions to simulators such as Isaac Gym/Lab and MuJoCo
- Reinforcement learning: Reinforcement Learning in Robotics details RL training methods in simulation
- Deployment practice: Real-world Deployment covers engineering details of Sim2Real deployment
- Control theory: Robotics Fundamentals provides safety guarantees from control theory for Sim2Real
References
- Tobin, J., et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS.
- OpenAI (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113.
- Miki, T., et al. (2022). Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild. Science Robotics.
- Peng, X.B., et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.
- Rudin, N., et al. (2022). Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. CoRL.