Exploration Strategies

The Exploration-Exploitation Tradeoff

One of the central challenges in reinforcement learning is the Exploration-Exploitation Dilemma:

Exploitation: Selecting the best action according to current knowledge, maximizing short-term return
Exploration: Trying new actions to gather more information, potentially improving long-term return

Converging prematurely to a suboptimal policy (too much exploitation) or wasting time on useless actions (too much exploration) both lead to degraded performance.

Classic Exploration Methods

ε-Greedy

The simplest exploration strategy:

\[a_t = \begin{cases} \arg\max_a Q(s_t, a) & \text{with probability } 1 - \varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}\]

Variants:

ε-decay: \(\varepsilon_t = \max(\varepsilon_{\min}, \varepsilon_0 \cdot \alpha^t)\), gradually reducing exploration as training progresses
Adaptive ε: Dynamically adjusting based on learning progress

Pros: Simple to implement, with theoretical guarantees (in the tabular case)

Cons: Exploration is undirected and does not leverage uncertainty information

Boltzmann / Softmax Exploration

Probabilistic action selection based on relative Q-values:

\[\pi(a|s) = \frac{\exp(Q(s,a) / \tau)}{\sum_{a'} \exp(Q(s,a') / \tau)}\]

where \(\tau\) is the temperature parameter:

\(\tau \to 0\): Approaches greedy policy (pure exploitation)
\(\tau \to \infty\): Approaches uniform random policy (pure exploration)

Advantage: Compared to ε-greedy, exploration probability is allocated according to action value differences.

UCB (Upper Confidence Bound)

Based on the principle of "optimism in the face of uncertainty":

\[a_t = \arg\max_a \left[ Q(s_t, a) + c \sqrt{\frac{\ln t}{N(s_t, a)}} \right]\]

where \(N(s_t, a)\) is the visit count for the state-action pair, and \(c\) is the exploration coefficient.

Core idea: Actions with fewer visits have higher uncertainty upper bounds and should be explored first.

Theoretical guarantee: UCB1 achieves logarithmic regret in multi-armed bandit problems.

Count-Based Exploration

Classic Count Methods

In tabular settings, one can directly maintain visit counts \(N(s)\) for states (or state-actions) and define an intrinsic reward:

\[r^{\text{intrinsic}}(s) = \frac{\beta}{\sqrt{N(s)}}\]

This encourages the agent to visit infrequently seen states.

Pseudo-Counts

In high-dimensional state spaces, direct counting is infeasible. Bellemare et al. (2016) proposed using density models to estimate pseudo-counts:

\[\hat{N}(s) = \frac{\rho(s)(1 - \rho'(s))}{\rho'(s) - \rho(s)}\]

where \(\rho(s)\) and \(\rho'(s)\) are the density model estimates before and after updating with \(s\).

Hash-Based Counting

Methods like SimHash map high-dimensional states to discrete codes for counting:

\[\phi(s) = \text{sgn}(A \cdot f(s))\]

where \(A\) is a random projection matrix and \(f(s)\) is the state feature.

Curiosity-Driven Exploration

ICM (Intrinsic Curiosity Module)

Pathak et al. (2017) proposed curiosity-driven exploration with two models:

Inverse Model: Predicts the action from adjacent states

\[\hat{a}_t = g(\phi(s_t), \phi(s_{t+1}))\]

This learns action-relevant state features, filtering out uncontrollable environmental changes.

Forward Model: Predicts the next state's features

\[\hat{\phi}(s_{t+1}) = f(\phi(s_t), a_t)\]

Intrinsic reward: The forward model's prediction error

\[r^i_t = \frac{\eta}{2} \|\hat{\phi}(s_{t+1}) - \phi(s_{t+1})\|^2\]

Transitions that are harder to predict yield higher rewards, encouraging the agent to explore "surprising" states.

RND (Random Network Distillation)

Burda et al. (2019) proposed a simpler curiosity-based method:

Fixed random network \(f: \mathcal{S} \to \mathbb{R}^k\) (target network, randomly initialized and never updated)
Trainable network \(\hat{f}: \mathcal{S} \to \mathbb{R}^k\) (predictor network, trained to mimic the target)

Intrinsic reward:

\[r^i_t = \|\hat{f}(s_{t+1}) - f(s_{t+1})\|^2\]

Key insights:

For frequently visited states, the predictor fits the target well, yielding small errors
For novel states, prediction error is large, producing high intrinsic reward
Avoids the "noisy TV" problem of ICM (no longer sensitive to random but uncontrollable state changes)

NovelD

Zhang et al. (2021) improved upon RND:

\[r^{\text{NovelD}}_t = \max\left(\frac{e_t}{\bar{e}} - \alpha, 0\right)\]

where \(e_t\) is the RND error for the current state and \(\bar{e}\) is the average RND error over neighboring states.

Key improvement: Uses relative rather than absolute novelty, better distinguishing truly novel states.

Go-Explore

Ecoffet et al. (2021) proposed a two-phase approach for hard-exploration environments (e.g., Montezuma's Revenge):

Phase 1: Explore

Maintain a state archive recording discovered interesting states
Select a state from the archive and deterministically "go back" to it
Explore randomly from that state
Add newly discovered states to the archive

Phase 2: Robustify

Use imitation learning or RL to convert Phase 1 trajectories into a robust policy.

Key innovations:

"Go back, then explore" solves the hard exploration problem
Decouples exploration from policy learning

Taxonomy of Exploration Methods

graph TD
    A[Exploration Methods] --> B[Undirected Exploration]
    A --> C[Directed Exploration]

    B --> B1[ε-Greedy]
    B --> B2[Boltzmann]
    B --> B3[Noisy Networks<br/>NoisyNet]

    C --> D[Uncertainty-Based]
    C --> E[Count-Based]
    C --> F[Prediction-Based]

    D --> D1[UCB]
    D --> D2[Thompson Sampling]
    D --> D3[Bootstrapped DQN]

    E --> E1[Classic Counts]
    E --> E2[Pseudo-Counts]
    E --> E3[Hash-Based Counts]

    F --> F1[ICM]
    F --> F2[RND]
    F --> F3[NovelD]

    A --> G[Systematic Exploration]
    G --> G1[Go-Explore]
    G --> G2[First-Return<br/>Then-Explore]

    style A fill:#f9f,stroke:#333
    style C fill:#bbf,stroke:#333
    style F fill:#bfb,stroke:#333

Challenges in Exploration

Sparse Reward Problem

When extrinsic rewards are very sparse, purely random exploration is unlikely to discover reward signals. Intrinsic reward methods mitigate this by providing dense exploration signals.

The Noisy TV Problem

Prediction-error-based methods can be "attracted" to stochasticity in the environment — if there are unpredictable but uncontrollable changes (like random noise on a TV screen), the agent will remain perpetually "curious" about them.

Solutions:

ICM's inverse model filters out uncontrollable changes
RND's prediction of a deterministic target avoids this issue
Ensemble models distinguish epistemic from aleatoric uncertainty

Hard Exploration

Some environments require executing a long sequence of specific actions to reach meaningful states (e.g., Montezuma's Revenge requires finding a key, opening a door, then entering a new room).