Sim2Real: Transferring from Simulation to Reality

Overview

Sim2Real (Simulation-to-Reality) is a key technology in robot learning that bridges simulation training and real-world deployment. The core problem is the Reality Gap: simulators cannot perfectly reproduce the physical properties of the real world, causing policies that perform well in simulation to fail in real environments.

The research goal of Sim2Real is to develop systematic methods that enable policies trained in simulation to transfer robustly to real robots.

Sources of the Reality Gap

The gap between simulation and reality arises from multiple levels:

Dynamics Gap

Simulators exhibit systematic biases in modeling physical processes:

Physical Phenomenon	Simulation Approximation	Real-world Behavior
Contact	Penetration penalty forces / constraints	Complex deformation, friction
Friction	Coulomb model $f = \mu N$	Nonlinear, anisotropic
Soft bodies	Finite elements / mass-spring	Continuous deformation
Motors	Ideal torque sources	Delay, nonlinearity, thermal effects
Sensors	Ideal values + Gaussian noise	Complex noise, bias drift

Visual Gap

Dimension	Simulation Rendering	Real Images
Lighting	Simplified lighting models	Complex ambient light, reflections
Textures	Limited texture library	Infinite diversity
Camera	Ideal pinhole model	Distortion, chromatic aberration, noise
Occlusion	Perfect depth	Sensor noise, holes

Domain Randomization

Core Idea

The core hypothesis of Domain Randomization (DR) is: if a policy can succeed across a large number of randomized simulation environments, then the real environment is simply "one sample" among these randomized environments.

Formalization: Define the simulation parameter vector $\xi \in \Xi$, sampled from distribution $p(\xi)$ at the beginning of each training episode:

\[ \xi \sim p(\xi), \quad \pi^* = \arg\max_\pi \mathbb{E}_{\xi \sim p(\xi)} \left[ \mathbb{E}_\pi \left[ \sum_t \gamma^t r_t \mid \xi \right] \right] \]

Physical Parameter Randomization

Typical physical parameter randomization ranges:

Parameter	Symbol	Default Value	Randomization Range
Friction coefficient	$\mu$	1.0	[0.2, 2.0]
Object mass	$m$	Nominal value	[0.5x, 2.0x]
Damping coefficient	$b$	Nominal value	[0.5x, 3.0x]
Joint backlash	$\delta$	0	[0, 0.02] rad
Actuator delay	$\Delta t$	0	[0, 30] ms
Gravity	$g$	9.81	[9.4, 10.2] m/s$^2$
Terrain friction	$\mu_g$	0.7	[0.3, 1.2]

Visual Randomization

Visual domain randomization introduces variations at the rendering level:

Texture randomization: Randomly replacing object and background textures
Lighting randomization: Direction, intensity, color, number of lights
Camera randomization: Position offset, field of view, white balance
Distractors: Randomly adding irrelevant objects to the scene

Automatic Domain Randomization (ADR)

Manually setting randomization ranges requires domain expertise. Automatic Domain Randomization (ADR, OpenAI 2019) adjusts them automatically:

Algorithm: For each parameter $\xi_i$, maintain a range $[\xi_i^{\text{low}}, \xi_i^{\text{high}}]$. Within an evaluation window:

\[ \text{If success\_rate} > \eta_{\text{up}}: \quad \xi_i^{\text{high}} \leftarrow \xi_i^{\text{high}} + \Delta\xi_i $$ $$ \text{If success\_rate} < \eta_{\text{down}}: \quad \xi_i^{\text{high}} \leftarrow \xi_i^{\text{high}} - \Delta\xi_i \]

This allows the randomization range to adaptively expand or contract during training.

System Identification

Core Idea

Unlike domain randomization's approach of "training to be robust across all environments," System Identification (SysID) aims to make the simulation as close to reality as possible.

Parameter Identification

Given the simulator model $f_\xi(s, a)$ and real robot trajectories $\{(s_t^{\text{real}}, a_t^{\text{real}}, s_{t+1}^{\text{real}})\}$, optimize the simulation parameters:

\[ \xi^* = \arg\min_\xi \sum_t \| f_\xi(s_t^{\text{real}}, a_t^{\text{real}}) - s_{t+1}^{\text{real}} \|^2 \]

Bayesian Optimization

When the objective function is non-differentiable (e.g., the simulator is a black box), Bayesian optimization is used:

Execute a standard action sequence on the real robot and record trajectory $\tau^{\text{real}}$
Execute the same action sequence in simulation to obtain $\tau^{\text{sim}}(\xi)$
Define a distance metric $d(\tau^{\text{real}}, \tau^{\text{sim}}(\xi))$
Model $d(\xi)$ with a Gaussian process and use Bayesian optimization to search for $\xi^*$

Online Adaptation

Going further, environment parameters can be estimated online during deployment. Treat $\xi$ as a latent variable and infer it from observation history:

\[ \hat{\xi}_t = g_\phi(o_{t-L:t}, a_{t-L:t-1}) \]

where $g_\phi$ is an encoder (typically an RNN or Transformer) that infers current environment parameters from the most recent $L$ steps of observation-action history.

Domain Adaptation

Adversarial Feature Alignment

Domain adaptation bridges the gap by learning domain-invariant features. The core method is adversarial training:

Architecture:

Feature extractor $F_\theta: \mathcal{O} \rightarrow \mathcal{Z}$
Task head $C_\psi: \mathcal{Z} \rightarrow \mathcal{A}$
Domain discriminator $D_\omega: \mathcal{Z} \rightarrow \{0, 1\}$ (0=sim, 1=real)

Objective Function:

\[ \min_{\theta, \psi} \max_\omega \underbrace{\mathcal{L}_{\text{task}}(C_\psi(F_\theta(o_{\text{sim}})), a^*)}_{\text{task loss in simulation}} - \lambda \underbrace{\mathcal{L}_{\text{domain}}(D_\omega(F_\theta(o)), d)}_{\text{domain classification loss}} \]

Through the Gradient Reversal Layer, the feature extractor $F_\theta$ simultaneously minimizes the task loss and maximizes the confusion of the domain discriminator, thereby learning domain-invariant features.

Image-Level Transfer

Using image-to-image translation (e.g., CycleGAN) to convert simulation images to a style closer to reality:

\[ G_{\text{sim} \to \text{real}}: I_{\text{sim}} \to \hat{I}_{\text{real}} \]

With cycle consistency loss:

\[ \mathcal{L}_{\text{cycle}} = \|G_{\text{real} \to \text{sim}}(G_{\text{sim} \to \text{real}}(I_{\text{sim}})) - I_{\text{sim}}\|_1 \]

Teacher-Student Distillation

Complete Training Pipeline

Teacher-Student is one of the most successful Sim2Real frameworks, especially in quadruped locomotion control.

flowchart TD
    subgraph SIM["Simulation Environment (GPU Parallel)"]
        ENV[4096 Parallel Environments<br/>Domain Randomization]
    end

    subgraph TEACHER["Stage 1: Teacher Training"]
        OBS_T[Privileged Observations<br/>Terrain Scan + Contact Forces<br/>Friction Coefficients + Object Poses]
        ACTOR_T[Teacher Actor<br/>MLP 256-256-256]
        CRITIC_T[Teacher Critic<br/>MLP 512-256-128]
        PPO[PPO Algorithm]

        OBS_T --> ACTOR_T
        OBS_T --> CRITIC_T
        ACTOR_T --> PPO
        CRITIC_T --> PPO
    end

    subgraph STUDENT["Stage 2: Student Distillation"]
        OBS_S[Sensor Observations<br/>Joint Angles + IMU<br/>+ Action History]
        ENCODER[History Encoder<br/>RNN/Transformer]
        ACTOR_S[Student Actor<br/>MLP 256-256-256]
        KL[KL Divergence Distillation Loss]

        OBS_S --> ENCODER
        ENCODER --> ACTOR_S
        ACTOR_T -.->|Frozen Weights| KL
        ACTOR_S --> KL
    end

    subgraph DEPLOY["Stage 3: Real Deployment"]
        REAL_OBS[Real Sensors] --> ACTOR_DEPLOY[Student Actor<br/>Proprioception Only]
        ACTOR_DEPLOY --> REAL_ACT[Motor Commands<br/>@ 50Hz]
    end

    SIM --> TEACHER
    TEACHER --> STUDENT
    STUDENT --> DEPLOY

    style SIM fill:#fff3e0
    style TEACHER fill:#e8f5e9
    style STUDENT fill:#e3f2fd
    style DEPLOY fill:#fce4ec

Privileged Information Design

Teacher's privileged information (available in simulation, unavailable in reality):

Privileged Information	Dimension	Description
Terrain height scan	$\mathbb{R}^{187}$	Height sample points around feet
External forces	$\mathbb{R}^{3}$	Perturbation forces applied to the torso
Friction coefficient	$\mathbb{R}^{1}$	Current ground friction
Payload mass	$\mathbb{R}^{1}$	Additional load on the back
Motor strength	$\mathbb{R}^{12}$	Actual gain for each motor

Student's available information:

Observation	Dimension	Description
Joint angles	$\mathbb{R}^{12}$	Encoder readings
Joint angular velocities	$\mathbb{R}^{12}$	Encoder differentials
IMU	$\mathbb{R}^{6}$	Angular velocity + acceleration
Action history	$\mathbb{R}^{12 \times L}$	Past $L$ steps of actions
Velocity command	$\mathbb{R}^{3}$	$(v_x, v_y, \omega_z)$

History Encoder

The Student implicitly infers environment parameters through historical information. The encoder maps the past $L$ steps of observation-action pairs to a latent variable:

\[ z_t = \text{RNN}_\phi(o_{t-L:t}, a_{t-L:t-1}) \]

Then the Student policy is conditioned on $z_t$:

\[ a_t = \pi_{\text{student}}(o_t, z_t) \]

Intuitively, $z_t$ encodes an implicit estimate of the current terrain type, friction, payload, and similar information.

Sim2Real Best Practices

Success Factors

Sufficient domain randomization: Cover the range of possible variations in the real environment
Accurate default parameters: Determine nominal parameters through system identification
Robust observations: Avoid relying on signals that are easy in simulation but inaccurate in reality
Delay modeling: Inject communication/computation delays in simulation
Noise injection: Sensor noise, actuator noise
Action smoothing: Penalize abrupt action changes

Common Failure Modes

Failure Mode	Cause	Solution
Action jitter	No motor dynamics in simulation	Add low-pass filtering and smoothness penalties
Cannot walk	Inaccurate friction model	Wider friction randomization range
Grasping failure	Contact model discrepancy	Force/torque feedback + domain randomization
Vision failure	Unrealistic rendering	Visual domain randomization + domain adaptation
Delay sensitivity	Unmodeled delays	Inject 10–30ms random delays

Evaluation Metrics

Transfer Success Rate

The most direct metric is the ratio of task success rate in the real environment to that in simulation:

\[ \text{Transfer Ratio} = \frac{\text{Success Rate}_{\text{real}}}{\text{Success Rate}_{\text{sim}}} \]

Ideally close to 1.0. In practice:

Simple tasks (e.g., quadruped walking): 0.8–0.95
Manipulation tasks (e.g., dexterous manipulation): 0.3–0.7
Vision tasks (e.g., image-based manipulation): 0.2–0.6

Zero-Shot vs. Few-Shot Transfer

Zero-Shot: Directly deploy after simulation training without any real data
Few-Shot: Simulation pretraining + fine-tuning with a small amount of real data

Connections to Other Chapters

Simulation platforms: Simulation Platforms provides detailed introductions to simulators such as Isaac Gym/Lab and MuJoCo
Reinforcement learning: Reinforcement Learning in Robotics details RL training methods in simulation
Deployment practice: Real-world Deployment covers engineering details of Sim2Real deployment
Control theory: Robotics Fundamentals provides safety guarantees from control theory for Sim2Real

References

Tobin, J., et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS.
OpenAI (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113.
Miki, T., et al. (2022). Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild. Science Robotics.
Peng, X.B., et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.
Rudin, N., et al. (2022). Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. CoRL.

Parameter	Symbol	Default Value	Randomization Range
Friction coefficient	\(\mu\)	1.0	[0.2, 2.0]
Object mass	\(m\)	Nominal value	[0.5x, 2.0x]
Damping coefficient	\(b\)	Nominal value	[0.5x, 3.0x]
Joint backlash	\(\delta\)	0	[0, 0.02] rad
Actuator delay	\(\Delta t\)	0	[0, 30] ms
Gravity	\(g\)	9.81	[9.4, 10.2] m/s\(^2\)
Terrain friction	\(\mu_g\)	0.7	[0.3, 1.2]

Privileged Information	Dimension	Description
Terrain height scan	\(\mathbb{R}^{187}\)	Height sample points around feet
External forces	\(\mathbb{R}^{3}\)	Perturbation forces applied to the torso
Friction coefficient	\(\mathbb{R}^{1}\)	Current ground friction
Payload mass	\(\mathbb{R}^{1}\)	Additional load on the back
Motor strength	\(\mathbb{R}^{12}\)	Actual gain for each motor

Sim2Real: Transferring from Simulation to Reality

Overview

Sources of the Reality Gap

Dynamics Gap

Visual Gap

Domain Randomization

Core Idea

Physical Parameter Randomization

Visual Randomization

Automatic Domain Randomization (ADR)

System Identification

Core Idea

Parameter Identification

Bayesian Optimization

Online Adaptation

Domain Adaptation

Adversarial Feature Alignment

Image-Level Transfer

Teacher-Student Distillation

Complete Training Pipeline

Privileged Information Design

History Encoder

Sim2Real Best Practices

Success Factors

Common Failure Modes

Evaluation Metrics

Transfer Success Rate

Zero-Shot vs. Few-Shot Transfer

Connections to Other Chapters

References

评论 #