Hierarchical Reinforcement Learning
Motivation
Standard flat RL methods struggle with the following challenges:
- Long time horizons: Tasks requiring hundreds of steps to complete
- Sparse rewards: Extended periods without feedback signals
- Combinatorial complexity: Exponential blowup of the action space
- Transfer and reuse: Sharing skills across different tasks
Hierarchical Reinforcement Learning (HRL) addresses these issues by making decisions at multiple temporal scales.
The Options Framework
Definition
The Options framework by Sutton, Precup & Singh (1999) provides the theoretical foundation for HRL. An Option \(\omega\) is defined by a triple:
- \(I_\omega \subseteq \mathcal{S}\): Initiation Set — the set of states where the Option can be initiated
- \(\pi_\omega: \mathcal{S} \times \mathcal{A} \to [0, 1]\): Intra-option Policy — the behavioral policy during Option execution
- \(\beta_\omega: \mathcal{S} \to [0, 1]\): Termination Condition — the probability of terminating the Option at each state
Semi-Markov Decision Process (SMDP)
Options extend the decision process from MDP to SMDP:
- Each Option may persist for multiple time steps
- The high-level policy selects the next Option when the current one terminates
- This naturally achieves temporal abstraction
Option-Value Function
Under the Options framework, the value function is defined as:
where updates within an Option use the intra-option policy \(\pi_\omega\).
Option-Critic Architecture
Bacon et al. (2017) proposed end-to-end learning of Options:
- Simultaneously learns intra-option policies \(\pi_\omega\) and termination conditions \(\beta_\omega\)
- Optimized using policy gradient methods
- Termination gradient:
When an Option's value exceeds the average value, its termination probability decreases.
MAXQ Value Decomposition
Core Idea
Dietterich (2000) proposed decomposing the value function into a hierarchy of subtask values:
- \(V^\pi_a(s)\): Expected return from executing subtask \(a\)
- \(C^\pi(s, a)\): Expected return from continuing with policy \(\pi\) after completing subtask \(a\) (completion function)
Hierarchical Decomposition
Root Task
├── Sub-task 1
│ ├── Primitive action a₁
│ └── Primitive action a₂
├── Sub-task 2
│ ├── Sub-sub-task 2.1
│ │ ├── Primitive action a₃
│ │ └── Primitive action a₄
│ └── Primitive action a₅
└── Sub-task 3
└── Primitive action a₆
Advantage: Allows subtasks to be learned and reused independently.
Feudal Networks
Feudal RL
Building on Dayan & Hinton's (1993) original idea, Vezhnevets et al. (2017) proposed FeUdal Networks (FuN):
Manager:
- Operates at a low frequency
- Sets subgoal directions \(g_t\)
- Observes state changes over longer time scales
Worker:
- Operates at a high frequency
- Receives subgoals set by the Manager
- Executes primitive actions to achieve subgoals
Manager's goal setting:
The Manager produces goals as direction vectors in a learned goal space.
Worker's intrinsic reward:
The Worker is rewarded for moving in the direction specified by the Manager.
Goal-Conditioned Reinforcement Learning
Universal Value Function
Schaul et al. (2015) proposed universal value functions that take goals as additional input:
The policy is also conditioned on the goal:
This enables a single network to handle multiple goals.
Hindsight Experience Replay (HER)
Andrychowicz et al. (2017) proposed a landmark method for solving the sparse reward problem in goal-conditioned RL:
Core idea: Failed trajectories still contain useful information — although the agent did not reach goal \(g\), it did reach other states, which can serve as "imagined goals."
Algorithm:
- Sample trajectory \(\tau = (s_0, a_0, s_1, \ldots, s_T)\) with original goal \(g\)
- For each transition \((s_t, a_t, s_{t+1})\) in the trajectory:
- Store original experience \((s_t, a_t, r_t, s_{t+1}, g)\)
- Select substitute goal \(g' = s_T\) (or other states chosen by some strategy)
- Recompute reward \(r' = R(s_t, a_t, g')\)
- Store "hindsight" experience \((s_t, a_t, r', s_{t+1}, g')\)
Effect: Transforms sparse rewards into dense feedback, dramatically accelerating goal-conditioned RL.
HIRO (Hierarchical RL with Off-policy Correction)
Nachum et al. (2018) proposed a two-level hierarchical method:
High-level policy: Selects a subgoal \(g_t \in \mathcal{S}\) every \(c\) steps
Low-level policy: Takes the current state and subgoal as input, outputs actions
Low-level reward (intrinsic):
Off-policy correction: Relabels subgoals for the high level, making off-policy data usable:
HAC (Hierarchical Actor-Critic)
Levy et al. (2019) extended HER to multi-level hierarchical structures:
- Supports arbitrary numbers of hierarchy levels
- Each level is trained using HER
- The action space of a higher level is the goal space of the level below
Overview of Hierarchical Architectures
graph TD
A[Hierarchical RL Methods] --> B[Temporal Abstraction]
A --> C[Goal-Conditioned]
A --> D[Skill Discovery]
B --> B1[Options Framework]
B --> B2[Option-Critic]
B --> B3[MAXQ]
C --> C1[Universal Value Function]
C --> C2[HER]
C --> C3[HIRO]
C --> C4[HAC]
D --> D1[Feudal Networks]
D --> D2[DIAYN]
D --> D3[Information-Theoretic<br/>Methods]
B1 --> E[Semi-MDP<br/>SMDP]
C2 --> F[Sparse Reward<br/>Solution]
style A fill:#f9f,stroke:#333
style C fill:#bbf,stroke:#333
style F fill:#bfb,stroke:#333
Skill Discovery
Unsupervised Skill Learning
Automatically discovering useful behavioral primitives without manually defining subtasks:
DIAYN (Diversity is All You Need):
Maximizes mutual information between states and skill codes while minimizing the dependence of actions on skills (encouraging skills to be distinguished by states rather than actions).
Skill Composition
Combining discovered skills into complex behaviors:
- Sequential composition: Executing multiple skills in sequence
- Parallel composition: Activating multiple skills simultaneously
- Conditional composition: Selecting skills based on the state
Practical Recommendations
| Scenario | Recommended Method |
|---|---|
| Known subtask structure | Options / MAXQ |
| Sparse goal achievement | HER + DDPG/SAC |
| Multi-level subgoals | HAC |
| Continuous control + subgoals | HIRO |
| Automatic skill discovery | DIAYN / Option-Critic |
References
- Sutton, Precup & Singh, "Between MDPs and Semi-MDPs" (Artificial Intelligence 1999)
- Dietterich, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" (JAIR 2000)
- Andrychowicz et al., "Hindsight Experience Replay" (NeurIPS 2017)
- Vezhnevets et al., "FeUdal Networks for Hierarchical Reinforcement Learning" (ICML 2017)
- Nachum et al., "Data-Efficient Hierarchical Reinforcement Learning" (NeurIPS 2018)
- Levy et al., "Learning Multi-Level Hierarchies with Hindsight" (ICLR 2019)
- Bacon et al., "The Option-Critic Architecture" (AAAI 2017)
- Eysenbach et al., "Diversity is All You Need" (ICLR 2019)