Skip to content

Inverse Reinforcement Learning

Problem Definition

The goal of Inverse Reinforcement Learning (IRL) is to recover the reward function from expert demonstrations.

Standard RL: Given reward \(R\), find optimal policy \(\pi^*\)

IRL: Given expert demonstrations \(\mathcal{D} = \{(s_0, a_0, s_1, a_1, \ldots)\}\), recover reward function \(R(s, a)\)

Formalization

Given:

  • State space \(\mathcal{S}\), action space \(\mathcal{A}\)
  • Transition dynamics \(P(s'|s, a)\) (possibly unknown)
  • Discount factor \(\gamma\)
  • Expert demonstration trajectories \(\tau_E = \{(s_t, a_t)\}_{t=0}^T\)

Find: Reward function \(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) such that the expert policy is optimal under this reward.

Ill-Posedness of IRL

IRL is an ill-posed problem:

  • The reward function \(R \equiv 0\) makes all policies optimal
  • For a given expert policy, infinitely many consistent reward functions exist
  • Additional inductive biases are needed to select a "good" reward function

Classic IRL Methods

Linear IRL

Assume the reward function is a linear combination of features:

\[R(s) = \mathbf{w}^T \phi(s)\]

where \(\phi(s)\) is the state feature vector and \(\mathbf{w}\) is the weight to be learned.

Feature matching constraint:

\[\mathbb{E}_{\pi_E}[\phi(s)] = \mathbb{E}_{\pi^*_R}[\phi(s)]\]

That is, the feature expectations of the expert policy match those of the optimal policy under the learned reward.

Max-Margin IRL

Abbeel & Ng (2004) proposed maximizing the value difference between the expert policy and other policies:

\[\max_{\mathbf{w}} \min_\pi \left[\mathbf{w}^T \mu_E - \mathbf{w}^T \mu_\pi\right]\]

where \(\mu_\pi = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(s_t)\right]\) is the feature expectation of the policy.

Maximum Entropy IRL (MaxEntIRL)

Motivation

Classic IRL methods produce deterministic policies and cannot account for stochasticity in expert behavior. Maximum Entropy IRL (Ziebart et al., 2008) assumes expert behavior follows the maximum entropy principle.

Probabilistic Model

The probability of a trajectory is proportional to the exponential of its cumulative reward:

\[P(\tau | R) = \frac{1}{Z} \exp\left(\sum_{t=0}^T R(s_t, a_t)\right)\]

where \(Z\) is the partition function.

Optimization Objective

Maximize the log-likelihood of expert trajectories:

\[\max_R \sum_{\tau \in \mathcal{D}} \log P(\tau | R) = \max_R \sum_{\tau \in \mathcal{D}} \left[\sum_t R(s_t, a_t) - \log Z\right]\]

Gradient

\[\nabla_R \mathcal{L} = \mu_E - \mathbb{E}_{\pi_R}[\mu]\]

This is the difference between the expert's state-action visitation frequency and that of the current policy.

Algorithm

  1. Initialize reward parameters
  2. Solve for the optimal policy under the current reward (forward RL)
  3. Compute the current policy's state visitation frequency
  4. Update reward parameters to make expert trajectories more likely
  5. Repeat steps 2-4 until convergence

Challenge: Each reward update requires re-solving the RL problem, which is computationally expensive.

GAIL (Generative Adversarial Imitation Learning)

Motivation

Ho & Ermon (2016) unified imitation learning and IRL within a generative adversarial framework, avoiding explicit reward function recovery.

Core Idea

Treat the policy as a "generator" and train a discriminator to distinguish expert behavior from policy behavior:

\[\min_\pi \max_D \; \mathbb{E}_{\pi_E}[\log D(s, a)] + \mathbb{E}_\pi[\log(1 - D(s, a))]\]

where:

  • \(D(s, a)\): Discriminator, judging whether \((s, a)\) comes from the expert
  • \(\pi\): Policy (generator), attempting to produce behavior similar to the expert

Training Procedure

  1. Discriminator update: Fix the policy, optimize the discriminator to distinguish expert and policy state-action pairs
  2. Policy update: Use \(-\log(1 - D(s, a))\) as the reward signal and update the policy with policy gradient methods (e.g., TRPO, PPO)

Analogy with GANs

GAN GAIL
Generator produces images Policy generates trajectories
Discriminator distinguishes real/fake images Discriminator distinguishes expert/policy behavior
Minimizes JS divergence Minimizes occupancy measure divergence

Theoretical Connection

GAIL minimizes the Jensen-Shannon divergence between the policy occupancy measure \(\rho_\pi(s, a)\) and the expert occupancy measure \(\rho_E(s, a)\):

\[\min_\pi D_{\text{JS}}(\rho_\pi \| \rho_E)\]

Strengths and Limitations

Strengths:

  • No need to explicitly recover the reward function
  • More sample-efficient than behavioral cloning (uses online interaction)
  • Compatible with any policy optimization method

Limitations:

  • Training instability (inherits GAN issues)
  • Does not produce an interpretable reward function
  • Requires online interaction with the environment

AIRL (Adversarial Inverse Reinforcement Learning)

Motivation

Fu et al. (2018) improved upon GAIL by structuring the discriminator to recover a transferable reward function.

Discriminator Structure

\[D_\theta(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)}\]

where:

\[f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s)\]
  • \(g_\theta(s, a)\): Learned reward function
  • \(h_\phi\): Shaping term similar to a potential function

Key Properties

  • At the optimal discriminator, \(g_\theta\) recovers the true reward function (up to an equivalence class)
  • The learned reward can transfer to different dynamics
  • The reward function is interpretable

Relationship to MaxEntIRL

AIRL can be viewed as an adversarial training version of MaxEntIRL, avoiding the high computational cost of the inner RL loop.

Connection to Imitation Learning

Method Taxonomy

Imitation Learning Methods
├── Behavioral Cloning (BC)
│   └── Direct supervised learning: π(a|s) = π_E(a|s)
├── Inverse Reinforcement Learning (IRL)
│   ├── MaxEntIRL: Recover reward → Train policy
│   └── AIRL: Adversarially learn reward
└── Adversarial Imitation Learning
    └── GAIL: Directly match occupancy measures

Comparison

Method Requires Environment Interaction Recovers Reward Generalization
Behavioral Cloning No No Weak (distribution shift)
DAgger Yes No Medium
MaxEntIRL Yes Yes Strong
GAIL Yes No (implicit) Medium
AIRL Yes Yes Strong (transferable)

When to Use IRL over Behavioral Cloning

  • An interpretable reward function is needed
  • The reward needs to transfer to different environments
  • Expert demonstrations are limited but online interaction is possible
  • Environment dynamics may change

Modern Developments

Foundation-Model-Based IRL

Leveraging pretrained foundation models (e.g., LLMs, VLMs) to extract implicit reward signals:

  • Language models evaluate the reasonableness of behavior
  • Vision models assess the goal-relevance of states
  • Iterative optimization combined with human feedback

Offline IRL

Recovering reward functions from offline datasets without online interaction:

  • Handling distribution shift in datasets
  • Combining conservative estimation methods

References

  • Ng & Russell, "Algorithms for Inverse Reinforcement Learning" (ICML 2000)
  • Abbeel & Ng, "Apprenticeship Learning via Inverse Reinforcement Learning" (ICML 2004)
  • Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning" (AAAI 2008)
  • Ho & Ermon, "Generative Adversarial Imitation Learning" (NeurIPS 2016)
  • Fu et al., "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning" (ICLR 2018)

评论 #