Exploration Strategies
The Exploration-Exploitation Tradeoff
One of the central challenges in reinforcement learning is the Exploration-Exploitation Dilemma:
- Exploitation: Selecting the best action according to current knowledge, maximizing short-term return
- Exploration: Trying new actions to gather more information, potentially improving long-term return
Converging prematurely to a suboptimal policy (too much exploitation) or wasting time on useless actions (too much exploration) both lead to degraded performance.
Classic Exploration Methods
ε-Greedy
The simplest exploration strategy:
Variants:
- ε-decay: \(\varepsilon_t = \max(\varepsilon_{\min}, \varepsilon_0 \cdot \alpha^t)\), gradually reducing exploration as training progresses
- Adaptive ε: Dynamically adjusting based on learning progress
Pros: Simple to implement, with theoretical guarantees (in the tabular case)
Cons: Exploration is undirected and does not leverage uncertainty information
Boltzmann / Softmax Exploration
Probabilistic action selection based on relative Q-values:
where \(\tau\) is the temperature parameter:
- \(\tau \to 0\): Approaches greedy policy (pure exploitation)
- \(\tau \to \infty\): Approaches uniform random policy (pure exploration)
Advantage: Compared to ε-greedy, exploration probability is allocated according to action value differences.
UCB (Upper Confidence Bound)
Based on the principle of "optimism in the face of uncertainty":
where \(N(s_t, a)\) is the visit count for the state-action pair, and \(c\) is the exploration coefficient.
Core idea: Actions with fewer visits have higher uncertainty upper bounds and should be explored first.
Theoretical guarantee: UCB1 achieves logarithmic regret in multi-armed bandit problems.
Count-Based Exploration
Classic Count Methods
In tabular settings, one can directly maintain visit counts \(N(s)\) for states (or state-actions) and define an intrinsic reward:
This encourages the agent to visit infrequently seen states.
Pseudo-Counts
In high-dimensional state spaces, direct counting is infeasible. Bellemare et al. (2016) proposed using density models to estimate pseudo-counts:
where \(\rho(s)\) and \(\rho'(s)\) are the density model estimates before and after updating with \(s\).
Hash-Based Counting
Methods like SimHash map high-dimensional states to discrete codes for counting:
where \(A\) is a random projection matrix and \(f(s)\) is the state feature.
Curiosity-Driven Exploration
ICM (Intrinsic Curiosity Module)
Pathak et al. (2017) proposed curiosity-driven exploration with two models:
Inverse Model: Predicts the action from adjacent states
This learns action-relevant state features, filtering out uncontrollable environmental changes.
Forward Model: Predicts the next state's features
Intrinsic reward: The forward model's prediction error
Transitions that are harder to predict yield higher rewards, encouraging the agent to explore "surprising" states.
RND (Random Network Distillation)
Burda et al. (2019) proposed a simpler curiosity-based method:
- Fixed random network \(f: \mathcal{S} \to \mathbb{R}^k\) (target network, randomly initialized and never updated)
- Trainable network \(\hat{f}: \mathcal{S} \to \mathbb{R}^k\) (predictor network, trained to mimic the target)
Intrinsic reward:
Key insights:
- For frequently visited states, the predictor fits the target well, yielding small errors
- For novel states, prediction error is large, producing high intrinsic reward
- Avoids the "noisy TV" problem of ICM (no longer sensitive to random but uncontrollable state changes)
NovelD
Zhang et al. (2021) improved upon RND:
where \(e_t\) is the RND error for the current state and \(\bar{e}\) is the average RND error over neighboring states.
Key improvement: Uses relative rather than absolute novelty, better distinguishing truly novel states.
Go-Explore
Ecoffet et al. (2021) proposed a two-phase approach for hard-exploration environments (e.g., Montezuma's Revenge):
Phase 1: Explore
- Maintain a state archive recording discovered interesting states
- Select a state from the archive and deterministically "go back" to it
- Explore randomly from that state
- Add newly discovered states to the archive
Phase 2: Robustify
Use imitation learning or RL to convert Phase 1 trajectories into a robust policy.
Key innovations:
- "Go back, then explore" solves the hard exploration problem
- Decouples exploration from policy learning
Taxonomy of Exploration Methods
graph TD
A[Exploration Methods] --> B[Undirected Exploration]
A --> C[Directed Exploration]
B --> B1[ε-Greedy]
B --> B2[Boltzmann]
B --> B3[Noisy Networks<br/>NoisyNet]
C --> D[Uncertainty-Based]
C --> E[Count-Based]
C --> F[Prediction-Based]
D --> D1[UCB]
D --> D2[Thompson Sampling]
D --> D3[Bootstrapped DQN]
E --> E1[Classic Counts]
E --> E2[Pseudo-Counts]
E --> E3[Hash-Based Counts]
F --> F1[ICM]
F --> F2[RND]
F --> F3[NovelD]
A --> G[Systematic Exploration]
G --> G1[Go-Explore]
G --> G2[First-Return<br/>Then-Explore]
style A fill:#f9f,stroke:#333
style C fill:#bbf,stroke:#333
style F fill:#bfb,stroke:#333
Challenges in Exploration
Sparse Reward Problem
When extrinsic rewards are very sparse, purely random exploration is unlikely to discover reward signals. Intrinsic reward methods mitigate this by providing dense exploration signals.
The Noisy TV Problem
Prediction-error-based methods can be "attracted" to stochasticity in the environment — if there are unpredictable but uncontrollable changes (like random noise on a TV screen), the agent will remain perpetually "curious" about them.
Solutions:
- ICM's inverse model filters out uncontrollable changes
- RND's prediction of a deterministic target avoids this issue
- Ensemble models distinguish epistemic from aleatoric uncertainty
Hard Exploration
Some environments require executing a long sequence of specific actions to reach meaningful states (e.g., Montezuma's Revenge requires finding a key, opening a door, then entering a new room).
Solutions:
- Go-Explore's two-phase approach
- Subgoal decomposition via hierarchical RL
- Curriculum learning with gradually increasing difficulty
Practical Recommendations
| Scenario | Recommended Method |
|---|---|
| Simple environments, rapid prototyping | ε-greedy (with decay) |
| Continuous action spaces | Parameter-space noise (NoisyNet) |
| Sparse rewards | RND / ICM |
| Extremely hard exploration | Go-Explore |
| Theoretical guarantees needed | UCB / Thompson Sampling |
| High-dimensional observations | RND (avoids noisy TV problem) |
References
- Pathak et al., "Curiosity-driven Exploration by Self-Supervised Prediction" (ICML 2017)
- Burda et al., "Exploration by Random Network Distillation" (ICLR 2019)
- Ecoffet et al., "First Return, Then Explore" (Nature 2021)
- Bellemare et al., "Unifying Count-Based Exploration and Intrinsic Motivation" (NeurIPS 2016)
- Zhang et al., "NovelD: A Simple yet Effective Exploration Criterion" (NeurIPS 2021)