Skip to content

Meta-Reinforcement Learning

Overview

Meta-Reinforcement Learning (Meta-RL) aims to learn how to learn — by training on a large number of related tasks, the agent acquires the ability to rapidly adapt to new tasks.

Problem Setting

Given a task distribution \(p(\mathcal{T})\), each task \(\mathcal{T}_i\) is an MDP \((\mathcal{S}, \mathcal{A}, P_i, R_i, \gamma)\) where transition dynamics or reward functions may differ.

Goal: Learn a meta-policy/meta-learner that can rapidly adapt to new tasks with minimal interaction.

Difference from Standard RL

Property Standard RL Meta-RL
Training Single task Task distribution
Objective Single-task optimality Cross-task rapid adaptation
Generalization Within state space Within task space
Sample requirement Large (per task) Small (new tasks)

RL²: Learning to Reinforcement Learn

Core Idea

Duan et al. (2016) and Wang et al. (2016) independently proposed encoding the entire RL algorithm in the weights of an RNN.

Architecture

Concatenate multiple episodes into one long sequence processed by an RNN:

\[h_t = f_\theta(h_{t-1}, s_t, a_{t-1}, r_{t-1}, d_{t-1})\]
\[a_t \sim \pi_\theta(\cdot | h_t)\]

where:

  • \(h_t\): RNN hidden state, encoding task information
  • \(d_{t-1}\): Previous step's termination flag
  • \(\theta\): Meta-parameters, trained via RL over many tasks

Key Insights

  • The RNN hidden state implicitly performs task inference — inferring the current task by observing rewards and transitions
  • The entire "learning algorithm" is encoded in the RNN's forward pass
  • Outer-loop RL (e.g., PPO) is used for meta-training; inner-loop "learning" occurs through the RNN's hidden state

Training Procedure

Meta-training loop:
    Sample task T ~ p(T)
    Reset RNN hidden state h₀
    Run K episodes (hidden state carries over):
        for episode k = 1 to K:
            for step t = 1 to H:
                a_t = π(h_t)
                s_{t+1}, r_t = env.step(a_t)
                h_{t+1} = RNN(h_t, s_t, a_t, r_t)
    Update meta-parameters θ using returns from all episodes

Limitations

  • Constrained by RNN memory capacity
  • Meta-training is computationally expensive
  • Performance degrades on tasks far from the training distribution

MAML for RL

Core Idea

Finn et al. (2017) applied MAML (Model-Agnostic Meta-Learning) to RL, learning a set of initial parameters such that a few gradient steps suffice for adaptation to new tasks.

Bi-Level Optimization

Inner update (task adaptation):

\[\theta'_i = \theta + \alpha \nabla_\theta J_{\mathcal{T}_i}(\pi_\theta)\]

For each task \(\mathcal{T}_i\), perform one (or a few) policy gradient steps from the meta-parameters \(\theta\).

Outer update (meta-optimization):

\[\theta \leftarrow \theta + \beta \sum_{\mathcal{T}_i} \nabla_\theta J_{\mathcal{T}_i}(\pi_{\theta'_i})\]

Evaluate performance at the adapted parameters \(\theta'_i\) and optimize the meta-parameters \(\theta\).

Algorithm

  1. Initialize meta-parameters \(\theta\)
  2. Sample a batch of tasks \(\{\mathcal{T}_i\}\)
  3. For each task: a. Collect a small number of trajectories b. Compute policy gradient c. Perform inner update to obtain \(\theta'_i\)
  4. Collect new trajectories using \(\theta'_i\)
  5. Compute outer gradient and update \(\theta\)

Variants

  • MAML + TRPO: Uses TRPO as both inner and outer optimizer
  • ProMP: Improves MAML with a probabilistic inference perspective
  • E-MAML: Meta-learning that accounts for exploration

Pros and Cons

Pros:

  • Model-agnostic — works with any differentiable policy
  • Theoretically elegant — learns a "good starting point"
  • Interpretable adaptation — it is simply gradient descent

Cons:

  • Requires second-order gradients (Hessian-vector products)
  • Limited inner update steps
  • Sensitive to inner learning rate

Context-Based Methods: PEARL

Motivation

Rakelly et al. (2019) proposed PEARL, which performs task inference via probabilistic inference, avoiding MAML's gradient-through issues.

Architecture

Context encoder: Infers a task representation from a small number of experiences

\[z \sim q_\phi(z | c)\]

where the context \(c = \{(s_j, a_j, r_j, s'_j)\}_{j=1}^N\) consists of a few interaction experiences.

Conditional policy: Conditioned on the task representation

\[a = \pi_\theta(s, z)\]

Conditional value function:

\[Q_\psi(s, a, z)\]

Probabilistic Framework

Uses a variational inference framework:

\[q_\phi(z | c) = \prod_{j=1}^N q_\phi(z | s_j, a_j, r_j, s'_j)\]

The posterior is assumed to be factorized (product form), enabling incremental updates to the task representation as new experiences arrive.

Training Objective

Combines the RL objective with variational inference:

\[\max_{\theta, \psi, \phi} \; \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \left[\mathbb{E}_{z \sim q_\phi(z|c)} [J(\pi_\theta(\cdot, z))] - \beta D_{\text{KL}}(q_\phi(z|c) \| p(z))\right]\]

Comparison with RL² and MAML

Method Adaptation Mechanism Task Inference Off-Policy Training
RL² RNN hidden state Implicit No
MAML Gradient updates Via gradients No
PEARL Probabilistic inference Explicit posterior Yes

PEARL's key advantage is support for off-policy training (using SAC), which greatly improves sample efficiency.

Task Inference

Explicit Task Inference

Learn a task inference model \(p(z | \tau_{1:t})\) to infer task identity or parameters from historical trajectories:

  • Bayesian methods: Maintain a posterior distribution over task parameters
  • Neural network methods: Train an encoder to directly output task representations

Implicit Task Inference

Perform task inference implicitly through model architecture:

  • RNN hidden states in RL²
  • Attention mechanisms in Transformers
  • Memory-augmented networks

Few-Shot Adaptation

Adaptation Efficiency Metrics

The key metric for Meta-RL is adaptation efficiency — performance on a new task after K episodes:

\[\text{Performance}(K) = \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} [J(\pi_{\text{adapted}}^K)]\]

An ideal meta-RL method should achieve high performance when K is very small.

Zero-Shot vs. Few-Shot

  • Zero-Shot: No new task experience needed (via task descriptions or prior inference)
  • One-Shot: Only one episode required
  • Few-Shot: A handful of episodes required

Practical Recommendations

Scenario Recommended Method
Simple task structure RL²
Need for rapid adaptation MAML
Need for high sample efficiency PEARL
Continuous control tasks PEARL + SAC
Discrete action tasks RL² + PPO

References

  • Duan et al., "RL²: Fast Reinforcement Learning via Slow Reinforcement Learning" (2016)
  • Wang et al., "Learning to Reinforcement Learn" (2016)
  • Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (ICML 2017)
  • Rakelly et al., "Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables" (ICML 2019)
  • Rothfuss et al., "ProMP: Proximal Meta-Policy Search" (ICLR 2019)
  • Beck et al., "A Survey of Meta-Reinforcement Learning" (2023)

评论 #