Hierarchical Reinforcement Learning

Motivation

Standard flat RL methods struggle with the following challenges:

Long time horizons: Tasks requiring hundreds of steps to complete
Sparse rewards: Extended periods without feedback signals
Combinatorial complexity: Exponential blowup of the action space
Transfer and reuse: Sharing skills across different tasks

Hierarchical Reinforcement Learning (HRL) addresses these issues by making decisions at multiple temporal scales.

The Options Framework

Definition

The Options framework by Sutton, Precup & Singh (1999) provides the theoretical foundation for HRL. An Option \(\omega\) is defined by a triple:

\[\omega = (I_\omega, \pi_\omega, \beta_\omega)\]

\(I_\omega \subseteq \mathcal{S}\): Initiation Set — the set of states where the Option can be initiated
\(\pi_\omega: \mathcal{S} \times \mathcal{A} \to [0, 1]\): Intra-option Policy — the behavioral policy during Option execution
\(\beta_\omega: \mathcal{S} \to [0, 1]\): Termination Condition — the probability of terminating the Option at each state

Semi-Markov Decision Process (SMDP)

Options extend the decision process from MDP to SMDP:

Each Option may persist for multiple time steps
The high-level policy selects the next Option when the current one terminates
This naturally achieves temporal abstraction

Option-Value Function

Under the Options framework, the value function is defined as:

\[Q_\Omega(s, \omega) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, \omega_0 = \omega\right]\]

where updates within an Option use the intra-option policy \(\pi_\omega\).

Option-Critic Architecture

Bacon et al. (2017) proposed end-to-end learning of Options:

Simultaneously learns intra-option policies \(\pi_\omega\) and termination conditions \(\beta_\omega\)
Optimized using policy gradient methods
Termination gradient:

\[\frac{\partial Q_\Omega}{\partial \beta_\omega(s)} = -(Q_\omega(s, \omega) - V_\Omega(s))\]

When an Option's value exceeds the average value, its termination probability decreases.

MAXQ Value Decomposition

Core Idea

Dietterich (2000) proposed decomposing the value function into a hierarchy of subtask values:

\[Q^\pi(s, a) = V^\pi_a(s) + C^\pi(s, a)\]

\(V^\pi_a(s)\): Expected return from executing subtask \(a\)
\(C^\pi(s, a)\): Expected return from continuing with policy \(\pi\) after completing subtask \(a\) (completion function)

Hierarchical Decomposition

Root Task
├── Sub-task 1
│   ├── Primitive action a₁
│   └── Primitive action a₂
├── Sub-task 2
│   ├── Sub-sub-task 2.1
│   │   ├── Primitive action a₃
│   │   └── Primitive action a₄
│   └── Primitive action a₅
└── Sub-task 3
    └── Primitive action a₆

Advantage: Allows subtasks to be learned and reused independently.

Feudal Networks

Feudal RL

Building on Dayan & Hinton's (1993) original idea, Vezhnevets et al. (2017) proposed FeUdal Networks (FuN):

Manager:

Operates at a low frequency
Sets subgoal directions \(g_t\)
Observes state changes over longer time scales

Worker:

Operates at a high frequency
Receives subgoals set by the Manager
Executes primitive actions to achieve subgoals

Manager's goal setting:

\[g_t = f_{\text{Mspace}}(s_t)\]

The Manager produces goals as direction vectors in a learned goal space.

Worker's intrinsic reward:

\[r^i_t = \frac{1}{c} \sum_{i=1}^{c} \cos(g_{t-i}, s_t - s_{t-i})\]

The Worker is rewarded for moving in the direction specified by the Manager.

Goal-Conditioned Reinforcement Learning

Universal Value Function

Schaul et al. (2015) proposed universal value functions that take goals as additional input:

\[V(s, g; \theta)\]

The policy is also conditioned on the goal:

\[\pi(a | s, g)\]

This enables a single network to handle multiple goals.

Hindsight Experience Replay (HER)

Andrychowicz et al. (2017) proposed a landmark method for solving the sparse reward problem in goal-conditioned RL:

Core idea: Failed trajectories still contain useful information — although the agent did not reach goal \(g\), it did reach other states, which can serve as "imagined goals."

Algorithm:

Sample trajectory \(\tau = (s_0, a_0, s_1, \ldots, s_T)\) with original goal \(g\)
For each transition \((s_t, a_t, s_{t+1})\) in the trajectory:
- Store original experience \((s_t, a_t, r_t, s_{t+1}, g)\)
- Select substitute goal \(g' = s_T\) (or other states chosen by some strategy)
- Recompute reward \(r' = R(s_t, a_t, g')\)
- Store "hindsight" experience \((s_t, a_t, r', s_{t+1}, g')\)

Effect: Transforms sparse rewards into dense feedback, dramatically accelerating goal-conditioned RL.

HIRO (Hierarchical RL with Off-policy Correction)

Nachum et al. (2018) proposed a two-level hierarchical method:

High-level policy: Selects a subgoal \(g_t \in \mathcal{S}\) every \(c\) steps

Low-level policy: Takes the current state and subgoal as input, outputs actions

\[a_t = \pi_{\text{low}}(s_t, g_t)\]

Low-level reward (intrinsic):

\[r^{\text{low}}_t = -\|s_{t+1} - (s_t + g_t)\|\]

Off-policy correction: Relabels subgoals for the high level, making off-policy data usable:

\[\tilde{g}_t = \arg\max_g \sum_{i=0}^{c-1} \log \pi_{\text{low}}(a_{t+i} | s_{t+i}, g_{t+i})\]

HAC (Hierarchical Actor-Critic)

Levy et al. (2019) extended HER to multi-level hierarchical structures:

Supports arbitrary numbers of hierarchy levels
Each level is trained using HER
The action space of a higher level is the goal space of the level below

Overview of Hierarchical Architectures

graph TD
    A[Hierarchical RL Methods] --> B[Temporal Abstraction]
    A --> C[Goal-Conditioned]
    A --> D[Skill Discovery]

    B --> B1[Options Framework]
    B --> B2[Option-Critic]
    B --> B3[MAXQ]

    C --> C1[Universal Value Function]
    C --> C2[HER]
    C --> C3[HIRO]
    C --> C4[HAC]

    D --> D1[Feudal Networks]
    D --> D2[DIAYN]
    D --> D3[Information-Theoretic<br/>Methods]

    B1 --> E[Semi-MDP<br/>SMDP]
    C2 --> F[Sparse Reward<br/>Solution]

    style A fill:#f9f,stroke:#333
    style C fill:#bbf,stroke:#333
    style F fill:#bfb,stroke:#333

Skill Discovery

Unsupervised Skill Learning

Automatically discovering useful behavioral primitives without manually defining subtasks:

DIAYN (Diversity is All You Need):

\[\max_\pi I(s; z) - I(a; z | s)\]

Maximizes mutual information between states and skill codes while minimizing the dependence of actions on skills (encouraging skills to be distinguished by states rather than actions).

Skill Composition

Combining discovered skills into complex behaviors:

Sequential composition: Executing multiple skills in sequence
Parallel composition: Activating multiple skills simultaneously
Conditional composition: Selecting skills based on the state

Practical Recommendations

Scenario	Recommended Method
Known subtask structure	Options / MAXQ
Sparse goal achievement	HER + DDPG/SAC
Multi-level subgoals	HAC
Continuous control + subgoals	HIRO
Automatic skill discovery	DIAYN / Option-Critic

References

Sutton, Precup & Singh, "Between MDPs and Semi-MDPs" (Artificial Intelligence 1999)
Dietterich, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition" (JAIR 2000)
Andrychowicz et al., "Hindsight Experience Replay" (NeurIPS 2017)
Vezhnevets et al., "FeUdal Networks for Hierarchical Reinforcement Learning" (ICML 2017)
Nachum et al., "Data-Efficient Hierarchical Reinforcement Learning" (NeurIPS 2018)
Levy et al., "Learning Multi-Level Hierarchies with Hindsight" (ICLR 2019)
Bacon et al., "The Option-Critic Architecture" (AAAI 2017)
Eysenbach et al., "Diversity is All You Need" (ICLR 2019)