Introduction to Stable Baselines3 (Using Atari as an Example)
We start with Atari to explore how to design a reinforcement learning agent and use Stable Baselines3 -- a library that encapsulates the underlying computations -- for training. Since the core application of reinforcement learning lies in robotics, the RL training materials are placed in the robotics notebook section.
For theoretical background on the underlying computations in reinforcement learning, please refer to the reinforcement learning notes in the AI notebook.
Atari Games
Atari 2600
The Atari 2600 is a home video game console released in 1977, featuring hundreds of classic games (Breakout, Pong, Space Invaders, Enduro, etc.). These games share several characteristics:
- Input: Pixel-based screens (typically 210x160 RGB)
- Output: Discrete actions (up, down, left, right, fire, etc., typically 4-18 actions)
- Reward: Game score
- Wide difficulty range: From simple Pong to extremely difficult Montezuma's Revenge
ALE (Arcade Learning Environment)
ALE is the standardized interface for Atari game research, proposed by Bellemare et al. (2013). It provides:
- Unified game ROM loading and simulation
- Standardized observation and action interfaces
- A benchmark suite of 57 Atari games
In Gymnasium (formerly OpenAI Gym), Atari environments are accessed through the ale-py package:
import gymnasium as gym
env = gym.make("BreakoutNoFrameskip-v4")
obs, info = env.reset()
print(f"Observation space: {env.observation_space.shape}") # (210, 160, 3)
print(f"Action space: {env.action_space.n}") # 4
Why Is Atari the Classic RL Benchmark?
- Birth of DQN: Mnih et al. (2013, 2015) first demonstrated the power of deep RL on Atari
- Pixels to decisions: Learning policies directly from raw pixels without hand-crafted features
- Standardized evaluation: A unified evaluation protocol widely used by the research community
SB3 Quick Start
Installation
# Base installation
pip install stable-baselines3[extra]
# Atari environments
pip install gymnasium[atari]
pip install ale-py
# Download Atari ROMs (license agreement required)
ale-import-roms
Three Steps: Create Environment -> Train Model -> Evaluate
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack
# Step 1: Create environment (automatically applies Atari preprocessing)
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=8, seed=42)
env = VecFrameStack(env, n_stack=4)
# Step 2: Train model
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="./tb_logs/")
model.learn(total_timesteps=10_000_000)
# Step 3: Save and evaluate
model.save("ppo_breakout")
make_atari_env automatically wraps the environment with all preprocessing steps described below.
Environment Preprocessing
The raw Atari observation is a 210x160 RGB image, which is inefficient to use directly. The standard preprocessing pipeline includes:
Frame Skipping
NoFrameskip-v4 environment + MaxAndSkipEnv(skip=4)
- Each action is repeated for 4 frames; returns the pixel-wise maximum of the last 2 frames
- Purpose: Reduce computation and mitigate Atari's sprite flickering issue
- Effect: Decision frequency drops from 60Hz to 15Hz
Grayscale Conversion and Resizing
# 210x160 RGB -> 84x84 Grayscale
WarpFrame(width=84, height=84)
- Converts the color image to grayscale and resizes to 84x84
- Purpose: Reduce input dimensionality; color information is generally unimportant in Atari games
Frame Stacking
VecFrameStack(env, n_stack=4)
- Stacks 4 consecutive frames into one observation, resulting in shape
(4, 84, 84) - Purpose: Enables the network to perceive motion information (velocity, direction), addressing the partial observability of single-frame observations
Reward Clipping
ClipRewardEnv # Clips rewards to {-1, 0, +1}
- Purpose: Score scales vary enormously across games (Pong: -1/+1, Breakout: 1-7 points per block); clipping unifies hyperparameters
- Note: Use original rewards for evaluation, clipped rewards for training
Episodic Life Mechanism
EpisodicLifeEnv # Treats each life loss as an episode termination
- Purpose: Accelerates early-stage learning by letting the agent experience "failure" signals more quickly
Complete Preprocessing Pipeline
make_atari_env automatically applies wrappers in the following order:
Raw Atari environment (210x160x3)
-> NoopResetEnv (random no-op start)
-> MaxAndSkipEnv (frame skipping)
-> EpisodicLifeEnv (life as episode)
-> FireResetEnv (auto-press FIRE to start)
-> WarpFrame (grayscale + 84x84)
-> ClipRewardEnv (reward clipping)
-> VecFrameStack (stack 4 frames)
-> Final observation: (4, 84, 84)
Training Configuration
Hyperparameter Settings
Recommended PPO hyperparameters for Atari (from rl-baselines3-zoo):
model = PPO(
"CnnPolicy",
env,
learning_rate=2.5e-4,
n_steps=128, # Steps collected per update (per env)
batch_size=256, # Mini-batch size
n_epochs=4, # Epochs per update
gamma=0.99, # Discount factor
gae_lambda=0.95, # GAE lambda
clip_range=0.1, # PPO clip range
ent_coef=0.01, # Entropy coefficient (encourages exploration)
vf_coef=0.5, # Value function loss coefficient
max_grad_norm=0.5, # Gradient clipping
tensorboard_log="./tb_logs/",
verbose=1,
)
Callbacks
from stable_baselines3.common.callbacks import (
EvalCallback,
CheckpointCallback,
)
# Evaluation callback: periodically evaluate and save the best model
eval_callback = EvalCallback(
eval_env,
best_model_save_path="./logs/best_model/",
log_path="./logs/eval/",
eval_freq=50_000, # Evaluate every 50k steps
n_eval_episodes=10, # 10 episodes per evaluation
deterministic=True,
)
# Checkpoint callback: periodically save the model
checkpoint_callback = CheckpointCallback(
save_freq=100_000,
save_path="./logs/checkpoints/",
name_prefix="ppo_breakout",
)
# Pass callbacks during training
model.learn(
total_timesteps=10_000_000,
callback=[eval_callback, checkpoint_callback],
)
Logging
SB3 automatically logs the following to TensorBoard:
rollout/ep_rew_mean: Mean episode rewardrollout/ep_len_mean: Mean episode lengthtrain/loss: Total losstrain/policy_gradient_loss: Policy gradient losstrain/value_loss: Value function losstrain/entropy_loss: Policy entropy
TensorBoard Monitoring
Launching TensorBoard
tensorboard --logdir=./tb_logs/
# Then open http://localhost:6006
Rollout Metrics
| Metric | Meaning | Expected Trend |
|---|---|---|
ep_rew_mean |
Mean episode reward | Should increase steadily |
ep_len_mean |
Mean episode length | Depends on game (should increase for Breakout) |
Train Metrics
| Metric | Meaning | Expected Trend |
|---|---|---|
policy_gradient_loss |
Policy gradient loss | Fluctuates but roughly converges |
value_loss |
Value function loss | Rises then falls |
entropy_loss |
Policy entropy | Gradually decreases (less exploration) |
approx_kl |
PPO approximate KL divergence | Should stay small (< 0.05) |
clip_fraction |
Fraction of clipped updates | Starts large, shrinks over time |
explained_variance |
Value function explained variance | Should approach 1.0 |
Learning Curve Analysis
- Normal curve: Slow rise -> rapid improvement -> plateau
- Overfitting: Training reward rises but evaluation reward drops
- Non-convergence: Reward fluctuates at a low level for a long time -> check hyperparameters or preprocessing
Model Evaluation
Deterministic vs. Stochastic Evaluation
from stable_baselines3.common.evaluation import evaluate_policy
# Deterministic evaluation (selects the most probable action)
mean_reward, std_reward = evaluate_policy(
model, eval_env, n_eval_episodes=20, deterministic=True
)
print(f"Deterministic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")
# Stochastic evaluation (samples actions from the policy distribution)
mean_reward, std_reward = evaluate_policy(
model, eval_env, n_eval_episodes=20, deterministic=False
)
print(f"Stochastic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")
- Deterministic evaluation: More stable, suitable for measuring final performance
- Stochastic evaluation: Closer to training behavior, rewards are typically slightly lower
Recording Video
from stable_baselines3.common.vec_env import VecVideoRecorder
# Wrap the environment for video recording
eval_env = VecVideoRecorder(
eval_env,
video_folder="./videos/",
record_video_trigger=lambda x: x == 0, # Record the first episode
video_length=2000,
)
obs = eval_env.reset()
for _ in range(2000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, done, info = eval_env.step(action)
eval_env.close()
Evaluation Commands Summary
# 1. View TensorBoard
tensorboard --logdir=./tb_logs/
# 2. Test the best model (deterministic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20 --deterministic
# 3. Test the best model (stochastic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20
# 4. Compare the final model
python visualize_agent.py --model-path ./ppo_breakout.zip --episodes 20 --deterministic
# 5. Record the best performance
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --record --deterministic
Common Issues and Debugging
Issue: Training Reward Not Increasing
Possible causes and solutions:
| Cause | Solution |
|---|---|
| Learning rate too high | Lower learning_rate (try 1e-4) |
| Incorrect preprocessing | Confirm you are using make_atari_env |
| Missing frame stacking | Check VecFrameStack(n_stack=4) |
| Insufficient training steps | Atari typically requires 5M-50M steps |
| Too few parallel environments | Increase n_envs (recommended 8-16) |
Issue: Out of Memory (OOM)
# Reduce parallel environments
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=4) # Reduce from 8 to 4
# Reduce batch_size
model = PPO("CnnPolicy", env, batch_size=128) # Reduce from 256 to 128
Issue: High Variance in Evaluation Performance
- Increase the number of evaluation episodes (
n_eval_episodes=30) - Use deterministic evaluation (
deterministic=True) - Some games inherently have high variance (e.g., Montezuma's Revenge)
Issue: Large Gap Between Training and Evaluation Rewards
- Training uses clipped rewards ({-1, 0, +1}), while evaluation uses original rewards -- the scales are inherently different
- Ensure the evaluation environment does not use
ClipRewardEnv
Recommended Debugging Workflow
- Start with Pong: Pong is the simplest Atari game; you should see a learning signal within 100k steps
- Check TensorBoard: Confirm that
ep_rew_meanshows an upward trend - Check
explained_variance: If it stays near 0 or goes negative, the value network is not learning - Check
approx_kl: If too large (> 0.1), the update step size is too large - Visualize: Load the model and watch its actual behavior -- numbers do not always tell the full story
References
- Stable Baselines3 documentation: stable-baselines3.readthedocs.io
- rl-baselines3-zoo (pretrained models and hyperparameters): github.com/DLR-RM/rl-baselines3-zoo
- Mnih et al., Playing Atari with Deep Reinforcement Learning, 2013
- Mnih et al., Human-level control through deep reinforcement learning, Nature, 2015