Inverse Reinforcement Learning
Problem Definition
The goal of Inverse Reinforcement Learning (IRL) is to recover the reward function from expert demonstrations.
Standard RL: Given reward \(R\), find optimal policy \(\pi^*\)
IRL: Given expert demonstrations \(\mathcal{D} = \{(s_0, a_0, s_1, a_1, \ldots)\}\), recover reward function \(R(s, a)\)
Formalization
Given:
- State space \(\mathcal{S}\), action space \(\mathcal{A}\)
- Transition dynamics \(P(s'|s, a)\) (possibly unknown)
- Discount factor \(\gamma\)
- Expert demonstration trajectories \(\tau_E = \{(s_t, a_t)\}_{t=0}^T\)
Find: Reward function \(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) such that the expert policy is optimal under this reward.
Ill-Posedness of IRL
IRL is an ill-posed problem:
- The reward function \(R \equiv 0\) makes all policies optimal
- For a given expert policy, infinitely many consistent reward functions exist
- Additional inductive biases are needed to select a "good" reward function
Classic IRL Methods
Linear IRL
Assume the reward function is a linear combination of features:
where \(\phi(s)\) is the state feature vector and \(\mathbf{w}\) is the weight to be learned.
Feature matching constraint:
That is, the feature expectations of the expert policy match those of the optimal policy under the learned reward.
Max-Margin IRL
Abbeel & Ng (2004) proposed maximizing the value difference between the expert policy and other policies:
where \(\mu_\pi = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(s_t)\right]\) is the feature expectation of the policy.
Maximum Entropy IRL (MaxEntIRL)
Motivation
Classic IRL methods produce deterministic policies and cannot account for stochasticity in expert behavior. Maximum Entropy IRL (Ziebart et al., 2008) assumes expert behavior follows the maximum entropy principle.
Probabilistic Model
The probability of a trajectory is proportional to the exponential of its cumulative reward:
where \(Z\) is the partition function.
Optimization Objective
Maximize the log-likelihood of expert trajectories:
Gradient
This is the difference between the expert's state-action visitation frequency and that of the current policy.
Algorithm
- Initialize reward parameters
- Solve for the optimal policy under the current reward (forward RL)
- Compute the current policy's state visitation frequency
- Update reward parameters to make expert trajectories more likely
- Repeat steps 2-4 until convergence
Challenge: Each reward update requires re-solving the RL problem, which is computationally expensive.
GAIL (Generative Adversarial Imitation Learning)
Motivation
Ho & Ermon (2016) unified imitation learning and IRL within a generative adversarial framework, avoiding explicit reward function recovery.
Core Idea
Treat the policy as a "generator" and train a discriminator to distinguish expert behavior from policy behavior:
where:
- \(D(s, a)\): Discriminator, judging whether \((s, a)\) comes from the expert
- \(\pi\): Policy (generator), attempting to produce behavior similar to the expert
Training Procedure
- Discriminator update: Fix the policy, optimize the discriminator to distinguish expert and policy state-action pairs
- Policy update: Use \(-\log(1 - D(s, a))\) as the reward signal and update the policy with policy gradient methods (e.g., TRPO, PPO)
Analogy with GANs
| GAN | GAIL |
|---|---|
| Generator produces images | Policy generates trajectories |
| Discriminator distinguishes real/fake images | Discriminator distinguishes expert/policy behavior |
| Minimizes JS divergence | Minimizes occupancy measure divergence |
Theoretical Connection
GAIL minimizes the Jensen-Shannon divergence between the policy occupancy measure \(\rho_\pi(s, a)\) and the expert occupancy measure \(\rho_E(s, a)\):
Strengths and Limitations
Strengths:
- No need to explicitly recover the reward function
- More sample-efficient than behavioral cloning (uses online interaction)
- Compatible with any policy optimization method
Limitations:
- Training instability (inherits GAN issues)
- Does not produce an interpretable reward function
- Requires online interaction with the environment
AIRL (Adversarial Inverse Reinforcement Learning)
Motivation
Fu et al. (2018) improved upon GAIL by structuring the discriminator to recover a transferable reward function.
Discriminator Structure
where:
- \(g_\theta(s, a)\): Learned reward function
- \(h_\phi\): Shaping term similar to a potential function
Key Properties
- At the optimal discriminator, \(g_\theta\) recovers the true reward function (up to an equivalence class)
- The learned reward can transfer to different dynamics
- The reward function is interpretable
Relationship to MaxEntIRL
AIRL can be viewed as an adversarial training version of MaxEntIRL, avoiding the high computational cost of the inner RL loop.
Connection to Imitation Learning
Method Taxonomy
Imitation Learning Methods
├── Behavioral Cloning (BC)
│ └── Direct supervised learning: π(a|s) = π_E(a|s)
├── Inverse Reinforcement Learning (IRL)
│ ├── MaxEntIRL: Recover reward → Train policy
│ └── AIRL: Adversarially learn reward
└── Adversarial Imitation Learning
└── GAIL: Directly match occupancy measures
Comparison
| Method | Requires Environment Interaction | Recovers Reward | Generalization |
|---|---|---|---|
| Behavioral Cloning | No | No | Weak (distribution shift) |
| DAgger | Yes | No | Medium |
| MaxEntIRL | Yes | Yes | Strong |
| GAIL | Yes | No (implicit) | Medium |
| AIRL | Yes | Yes | Strong (transferable) |
When to Use IRL over Behavioral Cloning
- An interpretable reward function is needed
- The reward needs to transfer to different environments
- Expert demonstrations are limited but online interaction is possible
- Environment dynamics may change
Modern Developments
Foundation-Model-Based IRL
Leveraging pretrained foundation models (e.g., LLMs, VLMs) to extract implicit reward signals:
- Language models evaluate the reasonableness of behavior
- Vision models assess the goal-relevance of states
- Iterative optimization combined with human feedback
Offline IRL
Recovering reward functions from offline datasets without online interaction:
- Handling distribution shift in datasets
- Combining conservative estimation methods
Related Topics
- Imitation Learning: Detailed discussion of imitation learning methods
- Reward Engineering: General reward function design approaches
- LLM Post-Training: Application of reward models in RLHF
References
- Ng & Russell, "Algorithms for Inverse Reinforcement Learning" (ICML 2000)
- Abbeel & Ng, "Apprenticeship Learning via Inverse Reinforcement Learning" (ICML 2004)
- Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning" (AAAI 2008)
- Ho & Ermon, "Generative Adversarial Imitation Learning" (NeurIPS 2016)
- Fu et al., "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning" (ICLR 2018)