Inverse Reinforcement Learning

Problem Definition

The goal of Inverse Reinforcement Learning (IRL) is to recover the reward function from expert demonstrations.

Standard RL: Given reward \(R\), find optimal policy \(\pi^*\)

IRL: Given expert demonstrations \(\mathcal{D} = \{(s_0, a_0, s_1, a_1, \ldots)\}\), recover reward function \(R(s, a)\)

Formalization

Given:

State space \(\mathcal{S}\), action space \(\mathcal{A}\)
Transition dynamics \(P(s'|s, a)\) (possibly unknown)
Discount factor \(\gamma\)
Expert demonstration trajectories \(\tau_E = \{(s_t, a_t)\}_{t=0}^T\)

Find: Reward function \(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) such that the expert policy is optimal under this reward.

Ill-Posedness of IRL

IRL is an ill-posed problem:

The reward function \(R \equiv 0\) makes all policies optimal
For a given expert policy, infinitely many consistent reward functions exist
Additional inductive biases are needed to select a "good" reward function

Classic IRL Methods

Linear IRL

Assume the reward function is a linear combination of features:

\[R(s) = \mathbf{w}^T \phi(s)\]

where \(\phi(s)\) is the state feature vector and \(\mathbf{w}\) is the weight to be learned.

Feature matching constraint:

\[\mathbb{E}_{\pi_E}[\phi(s)] = \mathbb{E}_{\pi^*_R}[\phi(s)]\]

That is, the feature expectations of the expert policy match those of the optimal policy under the learned reward.

Max-Margin IRL

Abbeel & Ng (2004) proposed maximizing the value difference between the expert policy and other policies:

\[\max_{\mathbf{w}} \min_\pi \left[\mathbf{w}^T \mu_E - \mathbf{w}^T \mu_\pi\right]\]

where \(\mu_\pi = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(s_t)\right]\) is the feature expectation of the policy.

Maximum Entropy IRL (MaxEntIRL)

Motivation

Classic IRL methods produce deterministic policies and cannot account for stochasticity in expert behavior. Maximum Entropy IRL (Ziebart et al., 2008) assumes expert behavior follows the maximum entropy principle.

Probabilistic Model

The probability of a trajectory is proportional to the exponential of its cumulative reward:

\[P(\tau | R) = \frac{1}{Z} \exp\left(\sum_{t=0}^T R(s_t, a_t)\right)\]

where \(Z\) is the partition function.

Optimization Objective

Maximize the log-likelihood of expert trajectories:

\[\max_R \sum_{\tau \in \mathcal{D}} \log P(\tau | R) = \max_R \sum_{\tau \in \mathcal{D}} \left[\sum_t R(s_t, a_t) - \log Z\right]\]

Gradient

\[\nabla_R \mathcal{L} = \mu_E - \mathbb{E}_{\pi_R}[\mu]\]

This is the difference between the expert's state-action visitation frequency and that of the current policy.

Algorithm

Initialize reward parameters
Solve for the optimal policy under the current reward (forward RL)
Compute the current policy's state visitation frequency
Update reward parameters to make expert trajectories more likely
Repeat steps 2-4 until convergence

Challenge: Each reward update requires re-solving the RL problem, which is computationally expensive.

GAIL (Generative Adversarial Imitation Learning)

Motivation

Ho & Ermon (2016) unified imitation learning and IRL within a generative adversarial framework, avoiding explicit reward function recovery.

Core Idea

Treat the policy as a "generator" and train a discriminator to distinguish expert behavior from policy behavior:

\[\min_\pi \max_D \; \mathbb{E}_{\pi_E}[\log D(s, a)] + \mathbb{E}_\pi[\log(1 - D(s, a))]\]

where:

\(D(s, a)\): Discriminator, judging whether \((s, a)\) comes from the expert
\(\pi\): Policy (generator), attempting to produce behavior similar to the expert

Training Procedure

Discriminator update: Fix the policy, optimize the discriminator to distinguish expert and policy state-action pairs
Policy update: Use \(-\log(1 - D(s, a))\) as the reward signal and update the policy with policy gradient methods (e.g., TRPO, PPO)

Analogy with GANs

GAN	GAIL
Generator produces images	Policy generates trajectories
Discriminator distinguishes real/fake images	Discriminator distinguishes expert/policy behavior
Minimizes JS divergence	Minimizes occupancy measure divergence

Theoretical Connection

GAIL minimizes the Jensen-Shannon divergence between the policy occupancy measure \(\rho_\pi(s, a)\) and the expert occupancy measure \(\rho_E(s, a)\):

\[\min_\pi D_{\text{JS}}(\rho_\pi \| \rho_E)\]

Strengths and Limitations

Strengths:

No need to explicitly recover the reward function
More sample-efficient than behavioral cloning (uses online interaction)
Compatible with any policy optimization method

Limitations:

Training instability (inherits GAN issues)
Does not produce an interpretable reward function
Requires online interaction with the environment

AIRL (Adversarial Inverse Reinforcement Learning)

Motivation

Fu et al. (2018) improved upon GAIL by structuring the discriminator to recover a transferable reward function.

Discriminator Structure

\[D_\theta(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)}\]

where:

\[f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s)\]

\(g_\theta(s, a)\): Learned reward function
\(h_\phi\): Shaping term similar to a potential function

Key Properties

At the optimal discriminator, \(g_\theta\) recovers the true reward function (up to an equivalence class)
The learned reward can transfer to different dynamics
The reward function is interpretable

Relationship to MaxEntIRL

AIRL can be viewed as an adversarial training version of MaxEntIRL, avoiding the high computational cost of the inner RL loop.

Connection to Imitation Learning

Method Taxonomy

Imitation Learning Methods
├── Behavioral Cloning (BC)
│   └── Direct supervised learning: π(a|s) = π_E(a|s)
├── Inverse Reinforcement Learning (IRL)
│   ├── MaxEntIRL: Recover reward → Train policy
│   └── AIRL: Adversarially learn reward
└── Adversarial Imitation Learning
    └── GAIL: Directly match occupancy measures

Comparison

Method	Requires Environment Interaction	Recovers Reward	Generalization
Behavioral Cloning	No	No	Weak (distribution shift)
DAgger	Yes	No	Medium
MaxEntIRL	Yes	Yes	Strong
GAIL	Yes	No (implicit)	Medium
AIRL	Yes	Yes	Strong (transferable)

When to Use IRL over Behavioral Cloning

An interpretable reward function is needed
The reward needs to transfer to different environments
Expert demonstrations are limited but online interaction is possible
Environment dynamics may change

Modern Developments

Foundation-Model-Based IRL

Leveraging pretrained foundation models (e.g., LLMs, VLMs) to extract implicit reward signals:

Language models evaluate the reasonableness of behavior
Vision models assess the goal-relevance of states
Iterative optimization combined with human feedback

Offline IRL

Recovering reward functions from offline datasets without online interaction:

Handling distribution shift in datasets
Combining conservative estimation methods

Imitation Learning: Detailed discussion of imitation learning methods
Reward Engineering: General reward function design approaches
LLM Post-Training: Application of reward models in RLHF

References

Ng & Russell, "Algorithms for Inverse Reinforcement Learning" (ICML 2000)
Abbeel & Ng, "Apprenticeship Learning via Inverse Reinforcement Learning" (ICML 2004)
Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning" (AAAI 2008)
Ho & Ermon, "Generative Adversarial Imitation Learning" (NeurIPS 2016)
Fu et al., "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning" (ICLR 2018)

Inverse Reinforcement Learning

Problem Definition

Formalization

Ill-Posedness of IRL

Classic IRL Methods

Linear IRL

Max-Margin IRL

Maximum Entropy IRL (MaxEntIRL)

Motivation

Probabilistic Model

Optimization Objective

Gradient

Algorithm

GAIL (Generative Adversarial Imitation Learning)

Motivation

Core Idea

Training Procedure

Analogy with GANs

Theoretical Connection

Strengths and Limitations

AIRL (Adversarial Inverse Reinforcement Learning)

Motivation

Discriminator Structure

Key Properties

Relationship to MaxEntIRL

Connection to Imitation Learning

Method Taxonomy

Comparison

When to Use IRL over Behavioral Cloning

Modern Developments

Foundation-Model-Based IRL

Offline IRL

Related Topics

References

评论 #