Representation Learning and RL
Motivation
In RL from pixels or high-dimensional sensor inputs, the quality of state representations directly determines learning efficiency and final performance. Good representations should:
- Capture information relevant to decision-making
- Ignore irrelevant visual details (e.g., background textures)
- Generalize well
- Support efficient policy and value function learning
Data Augmentation Methods
DrQ (Data-regularized Q)
Kostrikov et al. (2020) found that simple image data augmentation can significantly improve pixel-based RL performance.
Core method: Apply random shifts to observation images
Application in Q-learning:
Augmented observations are used for both target computation and policy evaluation.
DrQ-v2: Yarats et al. (2022) improved version combining:
- Random shift augmentation
- \(n\)-step returns
- Starting from DDPG rather than SAC (simpler and more efficient)
RAD (Reinforcement Learning with Augmented Data)
Laskin et al. (2020) systematically studied the effects of various data augmentation methods in RL:
| Augmentation | Description | Effectiveness |
|---|---|---|
| Random crop | Random cropping | Most effective |
| Random shift | Random translation | Very effective |
| Color jitter | Color perturbation | Partially effective |
| Random convolution | Random convolutional filters | Effective |
| Grayscale | Convert to grayscale | Task-dependent |
| Cutout | Random masking | Partially effective |
Key findings:
- Simple crop/shift augmentations can match or surpass complex representation learning methods
- Different environments benefit from different augmentation strategies
- Data augmentation is the simplest and most effective way to improve pixel-based RL performance
Contrastive Learning Methods
CURL (Contrastive Unsupervised Representations for RL)
Laskin et al. (2020) introduced contrastive learning for RL representation learning:
Positive pairs: Two different augmentations of the same observation
Contrastive loss (InfoNCE):
where \(q = f_q(s_q)\), \(k = f_k(s_k)\), and \(W\) is a learnable bilinear matrix.
Integration with RL:
- Encoder \(f_q\) doubles as the feature extractor for policy/value networks
- Contrastive loss serves as an auxiliary task, trained jointly with the RL loss
- Momentum encoder \(f_k\) is updated via EMA
Limitations of Contrastive Learning
- Negative sample selection affects performance
- Contrastive objectives may not align with RL objectives
- Increased computational overhead
Bisimulation Metrics
Core Idea
Two states should have similar representations if they are behaviorally indistinguishable.
Bisimulation relation: States \(s_1\) and \(s_2\) are bisimilar if:
- They have identical immediate rewards: \(R(s_1, a) = R(s_2, a), \forall a\)
- Their transition distributions are identical over bisimulation equivalence classes
Policy Bisimulation Metric
Zhang et al. (2021) proposed a policy-specific bisimulation metric:
where \(W_1\) is the Wasserstein distance and \(c\) is a discount factor.
DeepMDP
Gelada et al. (2019) learn representations \(\phi\) satisfying:
Training objectives:
- Reward prediction: \(\hat{R}(\phi(s), a) \approx R(s, a)\)
- Transition prediction: \(\hat{P}(\phi(s), a) \approx \phi(s')\)
Advantages
- Theoretically guaranteed representation quality
- Automatically ignores decision-irrelevant information (e.g., background changes)
- Suitable for scenarios requiring generalization across visual appearances
World Model Representations
Dreamer's Latent Space
Dreamer (Hafner et al., 2020) learns compressed latent state representations:
RSSM (Recurrent State-Space Model):
- Deterministic path: \(h_t = f(h_{t-1}, z_{t-1}, a_{t-1})\)
- Stochastic path: \(z_t \sim q(z_t | h_t, o_t)\) (posterior)
- Prior: \(\hat{z}_t \sim p(z_t | h_t)\)
Latent space properties:
- Captures information relevant to predicting future rewards and observations
- Compresses high-dimensional observations into a low-dimensional latent space
- Supports imagination-based planning in the latent space
Connection to Representation Learning
World model representation learning uses reconstruction objectives and prediction objectives to learn useful representations:
See Model-Based RL for details.
Self-Predictive Representations (SPR)
Core Idea
Schwarzer et al. (2021) proposed SPR (Self-Predictive Representations), learning by predicting one's own future representations:
where:
- \(\phi(s_{t+k})\): Target encoder's encoding of the future state (updated via EMA)
- \(\hat{z}_{t+k}\): Future representation predicted from the current state via a transition model
- \(g\): Projection head
- \(\bar{f}\): Target projection head
Relationship to Other Methods
| Method | Prediction Target | Learning Signal |
|---|---|---|
| Reconstruction | Raw pixels | Pixel-level error |
| CURL | Augmentations of same observation | Contrastive loss |
| SPR | Future representations | Prediction loss |
| Bisimulation | Behavioral equivalence | Metric distance |
Advantages
- No pixel reconstruction needed (avoids pixel-level details)
- No negative samples needed (a challenge for contrastive methods)
- Naturally compatible with temporal difference learning
Summary and Selection Guide
Complexity-Performance Tradeoff
Method complexity (simple to complex):
Data augmentation < Contrastive learning < Self-prediction < Bisimulation < World models
Recommended path:
1. First try data augmentation (DrQ-v2)
2. If insufficient, add contrastive/self-predictive auxiliary tasks
3. If generalization is needed, consider bisimulation metrics
4. If planning is needed, use world models
Practical Recommendations
| Scenario | Recommended Method |
|---|---|
| Pixel input, rapid prototyping | DrQ-v2 |
| Need sample efficiency | CURL / SPR |
| Visual generalization (background changes) | Bisimulation metrics |
| Need model predictions | Dreamer |
| Discrete actions (Atari) | SPR + Rainbow |
| Continuous control | DrQ-v2 |
References
- Kostrikov et al., "Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels" (ICLR 2021)
- Yarats et al., "Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning" (ICLR 2022)
- Laskin et al., "Reinforcement Learning with Augmented Data" (NeurIPS 2020)
- Laskin et al., "CURL: Contrastive Unsupervised Representations for Reinforcement Learning" (ICML 2020)
- Zhang et al., "Learning Invariant Representations for Reinforcement Learning without Reconstruction" (ICLR 2021)
- Schwarzer et al., "Data-Efficient Reinforcement Learning with Self-Predictive Representations" (ICLR 2021)
- Hafner et al., "Dream to Control: Learning Behaviors by Latent Imagination" (ICLR 2020)
- Gelada et al., "DeepMDP: Learning Continuous Latent Space Models for Representation Learning" (ICML 2019)