Meta-Reinforcement Learning
Overview
Meta-Reinforcement Learning (Meta-RL) aims to learn how to learn — by training on a large number of related tasks, the agent acquires the ability to rapidly adapt to new tasks.
Problem Setting
Given a task distribution \(p(\mathcal{T})\), each task \(\mathcal{T}_i\) is an MDP \((\mathcal{S}, \mathcal{A}, P_i, R_i, \gamma)\) where transition dynamics or reward functions may differ.
Goal: Learn a meta-policy/meta-learner that can rapidly adapt to new tasks with minimal interaction.
Difference from Standard RL
| Property | Standard RL | Meta-RL |
|---|---|---|
| Training | Single task | Task distribution |
| Objective | Single-task optimality | Cross-task rapid adaptation |
| Generalization | Within state space | Within task space |
| Sample requirement | Large (per task) | Small (new tasks) |
RL²: Learning to Reinforcement Learn
Core Idea
Duan et al. (2016) and Wang et al. (2016) independently proposed encoding the entire RL algorithm in the weights of an RNN.
Architecture
Concatenate multiple episodes into one long sequence processed by an RNN:
where:
- \(h_t\): RNN hidden state, encoding task information
- \(d_{t-1}\): Previous step's termination flag
- \(\theta\): Meta-parameters, trained via RL over many tasks
Key Insights
- The RNN hidden state implicitly performs task inference — inferring the current task by observing rewards and transitions
- The entire "learning algorithm" is encoded in the RNN's forward pass
- Outer-loop RL (e.g., PPO) is used for meta-training; inner-loop "learning" occurs through the RNN's hidden state
Training Procedure
Meta-training loop:
Sample task T ~ p(T)
Reset RNN hidden state h₀
Run K episodes (hidden state carries over):
for episode k = 1 to K:
for step t = 1 to H:
a_t = π(h_t)
s_{t+1}, r_t = env.step(a_t)
h_{t+1} = RNN(h_t, s_t, a_t, r_t)
Update meta-parameters θ using returns from all episodes
Limitations
- Constrained by RNN memory capacity
- Meta-training is computationally expensive
- Performance degrades on tasks far from the training distribution
MAML for RL
Core Idea
Finn et al. (2017) applied MAML (Model-Agnostic Meta-Learning) to RL, learning a set of initial parameters such that a few gradient steps suffice for adaptation to new tasks.
Bi-Level Optimization
Inner update (task adaptation):
For each task \(\mathcal{T}_i\), perform one (or a few) policy gradient steps from the meta-parameters \(\theta\).
Outer update (meta-optimization):
Evaluate performance at the adapted parameters \(\theta'_i\) and optimize the meta-parameters \(\theta\).
Algorithm
- Initialize meta-parameters \(\theta\)
- Sample a batch of tasks \(\{\mathcal{T}_i\}\)
- For each task: a. Collect a small number of trajectories b. Compute policy gradient c. Perform inner update to obtain \(\theta'_i\)
- Collect new trajectories using \(\theta'_i\)
- Compute outer gradient and update \(\theta\)
Variants
- MAML + TRPO: Uses TRPO as both inner and outer optimizer
- ProMP: Improves MAML with a probabilistic inference perspective
- E-MAML: Meta-learning that accounts for exploration
Pros and Cons
Pros:
- Model-agnostic — works with any differentiable policy
- Theoretically elegant — learns a "good starting point"
- Interpretable adaptation — it is simply gradient descent
Cons:
- Requires second-order gradients (Hessian-vector products)
- Limited inner update steps
- Sensitive to inner learning rate
Context-Based Methods: PEARL
Motivation
Rakelly et al. (2019) proposed PEARL, which performs task inference via probabilistic inference, avoiding MAML's gradient-through issues.
Architecture
Context encoder: Infers a task representation from a small number of experiences
where the context \(c = \{(s_j, a_j, r_j, s'_j)\}_{j=1}^N\) consists of a few interaction experiences.
Conditional policy: Conditioned on the task representation
Conditional value function:
Probabilistic Framework
Uses a variational inference framework:
The posterior is assumed to be factorized (product form), enabling incremental updates to the task representation as new experiences arrive.
Training Objective
Combines the RL objective with variational inference:
Comparison with RL² and MAML
| Method | Adaptation Mechanism | Task Inference | Off-Policy Training |
|---|---|---|---|
| RL² | RNN hidden state | Implicit | No |
| MAML | Gradient updates | Via gradients | No |
| PEARL | Probabilistic inference | Explicit posterior | Yes |
PEARL's key advantage is support for off-policy training (using SAC), which greatly improves sample efficiency.
Task Inference
Explicit Task Inference
Learn a task inference model \(p(z | \tau_{1:t})\) to infer task identity or parameters from historical trajectories:
- Bayesian methods: Maintain a posterior distribution over task parameters
- Neural network methods: Train an encoder to directly output task representations
Implicit Task Inference
Perform task inference implicitly through model architecture:
- RNN hidden states in RL²
- Attention mechanisms in Transformers
- Memory-augmented networks
Few-Shot Adaptation
Adaptation Efficiency Metrics
The key metric for Meta-RL is adaptation efficiency — performance on a new task after K episodes:
An ideal meta-RL method should achieve high performance when K is very small.
Zero-Shot vs. Few-Shot
- Zero-Shot: No new task experience needed (via task descriptions or prior inference)
- One-Shot: Only one episode required
- Few-Shot: A handful of episodes required
Practical Recommendations
| Scenario | Recommended Method |
|---|---|
| Simple task structure | RL² |
| Need for rapid adaptation | MAML |
| Need for high sample efficiency | PEARL |
| Continuous control tasks | PEARL + SAC |
| Discrete action tasks | RL² + PPO |
References
- Duan et al., "RL²: Fast Reinforcement Learning via Slow Reinforcement Learning" (2016)
- Wang et al., "Learning to Reinforcement Learn" (2016)
- Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (ICML 2017)
- Rakelly et al., "Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables" (ICML 2019)
- Rothfuss et al., "ProMP: Proximal Meta-Policy Search" (ICLR 2019)
- Beck et al., "A Survey of Meta-Reinforcement Learning" (2023)