Skip to content

Sim2Real: Transferring from Simulation to Reality

Overview

Sim2Real (Simulation-to-Reality) is a key technology in robot learning that bridges simulation training and real-world deployment. The core problem is the Reality Gap: simulators cannot perfectly reproduce the physical properties of the real world, causing policies that perform well in simulation to fail in real environments.

The research goal of Sim2Real is to develop systematic methods that enable policies trained in simulation to transfer robustly to real robots.


Sources of the Reality Gap

The gap between simulation and reality arises from multiple levels:

Dynamics Gap

Simulators exhibit systematic biases in modeling physical processes:

Physical Phenomenon Simulation Approximation Real-world Behavior
Contact Penetration penalty forces / constraints Complex deformation, friction
Friction Coulomb model \(f = \mu N\) Nonlinear, anisotropic
Soft bodies Finite elements / mass-spring Continuous deformation
Motors Ideal torque sources Delay, nonlinearity, thermal effects
Sensors Ideal values + Gaussian noise Complex noise, bias drift

Visual Gap

Dimension Simulation Rendering Real Images
Lighting Simplified lighting models Complex ambient light, reflections
Textures Limited texture library Infinite diversity
Camera Ideal pinhole model Distortion, chromatic aberration, noise
Occlusion Perfect depth Sensor noise, holes

Domain Randomization

Core Idea

The core hypothesis of Domain Randomization (DR) is: if a policy can succeed across a large number of randomized simulation environments, then the real environment is simply "one sample" among these randomized environments.

Formalization: Define the simulation parameter vector \(\xi \in \Xi\), sampled from distribution \(p(\xi)\) at the beginning of each training episode:

\[ \xi \sim p(\xi), \quad \pi^* = \arg\max_\pi \mathbb{E}_{\xi \sim p(\xi)} \left[ \mathbb{E}_\pi \left[ \sum_t \gamma^t r_t \mid \xi \right] \right] \]

Physical Parameter Randomization

Typical physical parameter randomization ranges:

Parameter Symbol Default Value Randomization Range
Friction coefficient \(\mu\) 1.0 [0.2, 2.0]
Object mass \(m\) Nominal value [0.5x, 2.0x]
Damping coefficient \(b\) Nominal value [0.5x, 3.0x]
Joint backlash \(\delta\) 0 [0, 0.02] rad
Actuator delay \(\Delta t\) 0 [0, 30] ms
Gravity \(g\) 9.81 [9.4, 10.2] m/s\(^2\)
Terrain friction \(\mu_g\) 0.7 [0.3, 1.2]

Visual Randomization

Visual domain randomization introduces variations at the rendering level:

  • Texture randomization: Randomly replacing object and background textures
  • Lighting randomization: Direction, intensity, color, number of lights
  • Camera randomization: Position offset, field of view, white balance
  • Distractors: Randomly adding irrelevant objects to the scene

Automatic Domain Randomization (ADR)

Manually setting randomization ranges requires domain expertise. Automatic Domain Randomization (ADR, OpenAI 2019) adjusts them automatically:

Algorithm: For each parameter \(\xi_i\), maintain a range \([\xi_i^{\text{low}}, \xi_i^{\text{high}}]\). Within an evaluation window:

\[ \text{If success\_rate} > \eta_{\text{up}}: \quad \xi_i^{\text{high}} \leftarrow \xi_i^{\text{high}} + \Delta\xi_i $$ $$ \text{If success\_rate} < \eta_{\text{down}}: \quad \xi_i^{\text{high}} \leftarrow \xi_i^{\text{high}} - \Delta\xi_i \]

This allows the randomization range to adaptively expand or contract during training.


System Identification

Core Idea

Unlike domain randomization's approach of "training to be robust across all environments," System Identification (SysID) aims to make the simulation as close to reality as possible.

Parameter Identification

Given the simulator model \(f_\xi(s, a)\) and real robot trajectories \(\{(s_t^{\text{real}}, a_t^{\text{real}}, s_{t+1}^{\text{real}})\}\), optimize the simulation parameters:

\[ \xi^* = \arg\min_\xi \sum_t \| f_\xi(s_t^{\text{real}}, a_t^{\text{real}}) - s_{t+1}^{\text{real}} \|^2 \]

Bayesian Optimization

When the objective function is non-differentiable (e.g., the simulator is a black box), Bayesian optimization is used:

  1. Execute a standard action sequence on the real robot and record trajectory \(\tau^{\text{real}}\)
  2. Execute the same action sequence in simulation to obtain \(\tau^{\text{sim}}(\xi)\)
  3. Define a distance metric \(d(\tau^{\text{real}}, \tau^{\text{sim}}(\xi))\)
  4. Model \(d(\xi)\) with a Gaussian process and use Bayesian optimization to search for \(\xi^*\)

Online Adaptation

Going further, environment parameters can be estimated online during deployment. Treat \(\xi\) as a latent variable and infer it from observation history:

\[ \hat{\xi}_t = g_\phi(o_{t-L:t}, a_{t-L:t-1}) \]

where \(g_\phi\) is an encoder (typically an RNN or Transformer) that infers current environment parameters from the most recent \(L\) steps of observation-action history.


Domain Adaptation

Adversarial Feature Alignment

Domain adaptation bridges the gap by learning domain-invariant features. The core method is adversarial training:

Architecture:

  • Feature extractor \(F_\theta: \mathcal{O} \rightarrow \mathcal{Z}\)
  • Task head \(C_\psi: \mathcal{Z} \rightarrow \mathcal{A}\)
  • Domain discriminator \(D_\omega: \mathcal{Z} \rightarrow \{0, 1\}\) (0=sim, 1=real)

Objective Function:

\[ \min_{\theta, \psi} \max_\omega \underbrace{\mathcal{L}_{\text{task}}(C_\psi(F_\theta(o_{\text{sim}})), a^*)}_{\text{task loss in simulation}} - \lambda \underbrace{\mathcal{L}_{\text{domain}}(D_\omega(F_\theta(o)), d)}_{\text{domain classification loss}} \]

Through the Gradient Reversal Layer, the feature extractor \(F_\theta\) simultaneously minimizes the task loss and maximizes the confusion of the domain discriminator, thereby learning domain-invariant features.

Image-Level Transfer

Using image-to-image translation (e.g., CycleGAN) to convert simulation images to a style closer to reality:

\[ G_{\text{sim} \to \text{real}}: I_{\text{sim}} \to \hat{I}_{\text{real}} \]

With cycle consistency loss:

\[ \mathcal{L}_{\text{cycle}} = \|G_{\text{real} \to \text{sim}}(G_{\text{sim} \to \text{real}}(I_{\text{sim}})) - I_{\text{sim}}\|_1 \]

Teacher-Student Distillation

Complete Training Pipeline

Teacher-Student is one of the most successful Sim2Real frameworks, especially in quadruped locomotion control.

flowchart TD
    subgraph SIM["Simulation Environment (GPU Parallel)"]
        ENV[4096 Parallel Environments<br/>Domain Randomization]
    end

    subgraph TEACHER["Stage 1: Teacher Training"]
        OBS_T[Privileged Observations<br/>Terrain Scan + Contact Forces<br/>Friction Coefficients + Object Poses]
        ACTOR_T[Teacher Actor<br/>MLP 256-256-256]
        CRITIC_T[Teacher Critic<br/>MLP 512-256-128]
        PPO[PPO Algorithm]

        OBS_T --> ACTOR_T
        OBS_T --> CRITIC_T
        ACTOR_T --> PPO
        CRITIC_T --> PPO
    end

    subgraph STUDENT["Stage 2: Student Distillation"]
        OBS_S[Sensor Observations<br/>Joint Angles + IMU<br/>+ Action History]
        ENCODER[History Encoder<br/>RNN/Transformer]
        ACTOR_S[Student Actor<br/>MLP 256-256-256]
        KL[KL Divergence Distillation Loss]

        OBS_S --> ENCODER
        ENCODER --> ACTOR_S
        ACTOR_T -.->|Frozen Weights| KL
        ACTOR_S --> KL
    end

    subgraph DEPLOY["Stage 3: Real Deployment"]
        REAL_OBS[Real Sensors] --> ACTOR_DEPLOY[Student Actor<br/>Proprioception Only]
        ACTOR_DEPLOY --> REAL_ACT[Motor Commands<br/>@ 50Hz]
    end

    SIM --> TEACHER
    TEACHER --> STUDENT
    STUDENT --> DEPLOY

    style SIM fill:#fff3e0
    style TEACHER fill:#e8f5e9
    style STUDENT fill:#e3f2fd
    style DEPLOY fill:#fce4ec

Privileged Information Design

Teacher's privileged information (available in simulation, unavailable in reality):

Privileged Information Dimension Description
Terrain height scan \(\mathbb{R}^{187}\) Height sample points around feet
External forces \(\mathbb{R}^{3}\) Perturbation forces applied to the torso
Friction coefficient \(\mathbb{R}^{1}\) Current ground friction
Payload mass \(\mathbb{R}^{1}\) Additional load on the back
Motor strength \(\mathbb{R}^{12}\) Actual gain for each motor

Student's available information:

Observation Dimension Description
Joint angles \(\mathbb{R}^{12}\) Encoder readings
Joint angular velocities \(\mathbb{R}^{12}\) Encoder differentials
IMU \(\mathbb{R}^{6}\) Angular velocity + acceleration
Action history \(\mathbb{R}^{12 \times L}\) Past \(L\) steps of actions
Velocity command \(\mathbb{R}^{3}\) \((v_x, v_y, \omega_z)\)

History Encoder

The Student implicitly infers environment parameters through historical information. The encoder maps the past \(L\) steps of observation-action pairs to a latent variable:

\[ z_t = \text{RNN}_\phi(o_{t-L:t}, a_{t-L:t-1}) \]

Then the Student policy is conditioned on \(z_t\):

\[ a_t = \pi_{\text{student}}(o_t, z_t) \]

Intuitively, \(z_t\) encodes an implicit estimate of the current terrain type, friction, payload, and similar information.


Sim2Real Best Practices

Success Factors

  1. Sufficient domain randomization: Cover the range of possible variations in the real environment
  2. Accurate default parameters: Determine nominal parameters through system identification
  3. Robust observations: Avoid relying on signals that are easy in simulation but inaccurate in reality
  4. Delay modeling: Inject communication/computation delays in simulation
  5. Noise injection: Sensor noise, actuator noise
  6. Action smoothing: Penalize abrupt action changes

Common Failure Modes

Failure Mode Cause Solution
Action jitter No motor dynamics in simulation Add low-pass filtering and smoothness penalties
Cannot walk Inaccurate friction model Wider friction randomization range
Grasping failure Contact model discrepancy Force/torque feedback + domain randomization
Vision failure Unrealistic rendering Visual domain randomization + domain adaptation
Delay sensitivity Unmodeled delays Inject 10–30ms random delays

Evaluation Metrics

Transfer Success Rate

The most direct metric is the ratio of task success rate in the real environment to that in simulation:

\[ \text{Transfer Ratio} = \frac{\text{Success Rate}_{\text{real}}}{\text{Success Rate}_{\text{sim}}} \]

Ideally close to 1.0. In practice:

  • Simple tasks (e.g., quadruped walking): 0.8–0.95
  • Manipulation tasks (e.g., dexterous manipulation): 0.3–0.7
  • Vision tasks (e.g., image-based manipulation): 0.2–0.6

Zero-Shot vs. Few-Shot Transfer

  • Zero-Shot: Directly deploy after simulation training without any real data
  • Few-Shot: Simulation pretraining + fine-tuning with a small amount of real data

Connections to Other Chapters


References

  1. Tobin, J., et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS.
  2. OpenAI (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113.
  3. Miki, T., et al. (2022). Learning Robust Perceptive Locomotion for Quadrupedal Robots in the Wild. Science Robotics.
  4. Peng, X.B., et al. (2018). Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. ICRA.
  5. Rudin, N., et al. (2022). Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. CoRL.

评论 #