Meta-Reinforcement Learning

Overview

Meta-Reinforcement Learning (Meta-RL) aims to learn how to learn — by training on a large number of related tasks, the agent acquires the ability to rapidly adapt to new tasks.

Problem Setting

Given a task distribution \(p(\mathcal{T})\), each task \(\mathcal{T}_i\) is an MDP \((\mathcal{S}, \mathcal{A}, P_i, R_i, \gamma)\) where transition dynamics or reward functions may differ.

Goal: Learn a meta-policy/meta-learner that can rapidly adapt to new tasks with minimal interaction.

Difference from Standard RL

Property	Standard RL	Meta-RL
Training	Single task	Task distribution
Objective	Single-task optimality	Cross-task rapid adaptation
Generalization	Within state space	Within task space
Sample requirement	Large (per task)	Small (new tasks)

RL²: Learning to Reinforcement Learn

Core Idea

Duan et al. (2016) and Wang et al. (2016) independently proposed encoding the entire RL algorithm in the weights of an RNN.

Architecture

Concatenate multiple episodes into one long sequence processed by an RNN:

\[h_t = f_\theta(h_{t-1}, s_t, a_{t-1}, r_{t-1}, d_{t-1})\]

\[a_t \sim \pi_\theta(\cdot | h_t)\]

where:

\(h_t\): RNN hidden state, encoding task information
\(d_{t-1}\): Previous step's termination flag
\(\theta\): Meta-parameters, trained via RL over many tasks

Key Insights

The RNN hidden state implicitly performs task inference — inferring the current task by observing rewards and transitions
The entire "learning algorithm" is encoded in the RNN's forward pass
Outer-loop RL (e.g., PPO) is used for meta-training; inner-loop "learning" occurs through the RNN's hidden state

Training Procedure

Meta-training loop:
    Sample task T ~ p(T)
    Reset RNN hidden state h₀
    Run K episodes (hidden state carries over):
        for episode k = 1 to K:
            for step t = 1 to H:
                a_t = π(h_t)
                s_{t+1}, r_t = env.step(a_t)
                h_{t+1} = RNN(h_t, s_t, a_t, r_t)
    Update meta-parameters θ using returns from all episodes

Limitations

Constrained by RNN memory capacity
Meta-training is computationally expensive
Performance degrades on tasks far from the training distribution

MAML for RL

Core Idea

Finn et al. (2017) applied MAML (Model-Agnostic Meta-Learning) to RL, learning a set of initial parameters such that a few gradient steps suffice for adaptation to new tasks.

Bi-Level Optimization

Inner update (task adaptation):

\[\theta'_i = \theta + \alpha \nabla_\theta J_{\mathcal{T}_i}(\pi_\theta)\]

For each task \(\mathcal{T}_i\), perform one (or a few) policy gradient steps from the meta-parameters \(\theta\).

Outer update (meta-optimization):

\[\theta \leftarrow \theta + \beta \sum_{\mathcal{T}_i} \nabla_\theta J_{\mathcal{T}_i}(\pi_{\theta'_i})\]

Evaluate performance at the adapted parameters \(\theta'_i\) and optimize the meta-parameters \(\theta\).

Algorithm

Initialize meta-parameters \(\theta\)
Sample a batch of tasks \(\{\mathcal{T}_i\}\)
For each task: a. Collect a small number of trajectories b. Compute policy gradient c. Perform inner update to obtain \(\theta'_i\)
Collect new trajectories using \(\theta'_i\)
Compute outer gradient and update \(\theta\)

Variants

MAML + TRPO: Uses TRPO as both inner and outer optimizer
ProMP: Improves MAML with a probabilistic inference perspective
E-MAML: Meta-learning that accounts for exploration

Pros and Cons

Pros:

Model-agnostic — works with any differentiable policy
Theoretically elegant — learns a "good starting point"
Interpretable adaptation — it is simply gradient descent

Cons:

Requires second-order gradients (Hessian-vector products)
Limited inner update steps
Sensitive to inner learning rate

Context-Based Methods: PEARL

Motivation

Rakelly et al. (2019) proposed PEARL, which performs task inference via probabilistic inference, avoiding MAML's gradient-through issues.

Architecture

Context encoder: Infers a task representation from a small number of experiences

\[z \sim q_\phi(z | c)\]

where the context \(c = \{(s_j, a_j, r_j, s'_j)\}_{j=1}^N\) consists of a few interaction experiences.

Conditional policy: Conditioned on the task representation

\[a = \pi_\theta(s, z)\]

Conditional value function:

\[Q_\psi(s, a, z)\]

Probabilistic Framework

Uses a variational inference framework:

\[q_\phi(z | c) = \prod_{j=1}^N q_\phi(z | s_j, a_j, r_j, s'_j)\]

The posterior is assumed to be factorized (product form), enabling incremental updates to the task representation as new experiences arrive.

Training Objective

Combines the RL objective with variational inference:

\[\max_{\theta, \psi, \phi} \; \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \left[\mathbb{E}_{z \sim q_\phi(z|c)} [J(\pi_\theta(\cdot, z))] - \beta D_{\text{KL}}(q_\phi(z|c) \| p(z))\right]\]

Comparison with RL² and MAML

Method	Adaptation Mechanism	Task Inference	Off-Policy Training
RL²	RNN hidden state	Implicit	No
MAML	Gradient updates	Via gradients	No
PEARL	Probabilistic inference	Explicit posterior	Yes

PEARL's key advantage is support for off-policy training (using SAC), which greatly improves sample efficiency.

Task Inference

Explicit Task Inference

Learn a task inference model \(p(z | \tau_{1:t})\) to infer task identity or parameters from historical trajectories:

Bayesian methods: Maintain a posterior distribution over task parameters
Neural network methods: Train an encoder to directly output task representations

Implicit Task Inference

Perform task inference implicitly through model architecture:

RNN hidden states in RL²
Attention mechanisms in Transformers
Memory-augmented networks

Few-Shot Adaptation

Adaptation Efficiency Metrics

The key metric for Meta-RL is adaptation efficiency — performance on a new task after K episodes:

\[\text{Performance}(K) = \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} [J(\pi_{\text{adapted}}^K)]\]

An ideal meta-RL method should achieve high performance when K is very small.

Zero-Shot vs. Few-Shot

Zero-Shot: No new task experience needed (via task descriptions or prior inference)
One-Shot: Only one episode required
Few-Shot: A handful of episodes required

Practical Recommendations

Scenario	Recommended Method
Simple task structure	RL²
Need for rapid adaptation	MAML
Need for high sample efficiency	PEARL
Continuous control tasks	PEARL + SAC
Discrete action tasks	RL² + PPO

References

Duan et al., "RL²: Fast Reinforcement Learning via Slow Reinforcement Learning" (2016)
Wang et al., "Learning to Reinforcement Learn" (2016)
Finn et al., "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks" (ICML 2017)
Rakelly et al., "Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables" (ICML 2019)
Rothfuss et al., "ProMP: Proximal Meta-Policy Search" (ICLR 2019)
Beck et al., "A Survey of Meta-Reinforcement Learning" (2023)

Meta-Reinforcement Learning

Overview

Problem Setting

Difference from Standard RL

RL²: Learning to Reinforcement Learn

Core Idea

Architecture

Key Insights

Training Procedure

Limitations

MAML for RL

Core Idea

Bi-Level Optimization

Algorithm

Variants

Pros and Cons

Context-Based Methods: PEARL

Motivation

Architecture

Probabilistic Framework

Training Objective

Comparison with RL² and MAML

Task Inference

Explicit Task Inference

Implicit Task Inference

Few-Shot Adaptation

Adaptation Efficiency Metrics

Zero-Shot vs. Few-Shot

Practical Recommendations

References

评论 #