Skip to content

Introduction to Stable Baselines3 (Using Atari as an Example)

We start with Atari to explore how to design a reinforcement learning agent and use Stable Baselines3 -- a library that encapsulates the underlying computations -- for training. Since the core application of reinforcement learning lies in robotics, the RL training materials are placed in the robotics notebook section.

For theoretical background on the underlying computations in reinforcement learning, please refer to the reinforcement learning notes in the AI notebook.


Atari Games

Atari 2600

The Atari 2600 is a home video game console released in 1977, featuring hundreds of classic games (Breakout, Pong, Space Invaders, Enduro, etc.). These games share several characteristics:

  • Input: Pixel-based screens (typically 210x160 RGB)
  • Output: Discrete actions (up, down, left, right, fire, etc., typically 4-18 actions)
  • Reward: Game score
  • Wide difficulty range: From simple Pong to extremely difficult Montezuma's Revenge

ALE (Arcade Learning Environment)

ALE is the standardized interface for Atari game research, proposed by Bellemare et al. (2013). It provides:

  • Unified game ROM loading and simulation
  • Standardized observation and action interfaces
  • A benchmark suite of 57 Atari games

In Gymnasium (formerly OpenAI Gym), Atari environments are accessed through the ale-py package:

import gymnasium as gym

env = gym.make("BreakoutNoFrameskip-v4")
obs, info = env.reset()
print(f"Observation space: {env.observation_space.shape}")  # (210, 160, 3)
print(f"Action space: {env.action_space.n}")                # 4

Why Is Atari the Classic RL Benchmark?

  • Birth of DQN: Mnih et al. (2013, 2015) first demonstrated the power of deep RL on Atari
  • Pixels to decisions: Learning policies directly from raw pixels without hand-crafted features
  • Standardized evaluation: A unified evaluation protocol widely used by the research community

SB3 Quick Start

Installation

# Base installation
pip install stable-baselines3[extra]

# Atari environments
pip install gymnasium[atari]
pip install ale-py

# Download Atari ROMs (license agreement required)
ale-import-roms

Three Steps: Create Environment -> Train Model -> Evaluate

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Step 1: Create environment (automatically applies Atari preprocessing)
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=8, seed=42)
env = VecFrameStack(env, n_stack=4)

# Step 2: Train model
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="./tb_logs/")
model.learn(total_timesteps=10_000_000)

# Step 3: Save and evaluate
model.save("ppo_breakout")

make_atari_env automatically wraps the environment with all preprocessing steps described below.


Environment Preprocessing

The raw Atari observation is a 210x160 RGB image, which is inefficient to use directly. The standard preprocessing pipeline includes:

Frame Skipping

NoFrameskip-v4 environment + MaxAndSkipEnv(skip=4)
  • Each action is repeated for 4 frames; returns the pixel-wise maximum of the last 2 frames
  • Purpose: Reduce computation and mitigate Atari's sprite flickering issue
  • Effect: Decision frequency drops from 60Hz to 15Hz

Grayscale Conversion and Resizing

# 210x160 RGB -> 84x84 Grayscale
WarpFrame(width=84, height=84)
  • Converts the color image to grayscale and resizes to 84x84
  • Purpose: Reduce input dimensionality; color information is generally unimportant in Atari games

Frame Stacking

VecFrameStack(env, n_stack=4)
  • Stacks 4 consecutive frames into one observation, resulting in shape (4, 84, 84)
  • Purpose: Enables the network to perceive motion information (velocity, direction), addressing the partial observability of single-frame observations

Reward Clipping

ClipRewardEnv  # Clips rewards to {-1, 0, +1}
  • Purpose: Score scales vary enormously across games (Pong: -1/+1, Breakout: 1-7 points per block); clipping unifies hyperparameters
  • Note: Use original rewards for evaluation, clipped rewards for training

Episodic Life Mechanism

EpisodicLifeEnv  # Treats each life loss as an episode termination
  • Purpose: Accelerates early-stage learning by letting the agent experience "failure" signals more quickly

Complete Preprocessing Pipeline

make_atari_env automatically applies wrappers in the following order:

Raw Atari environment (210x160x3)
-> NoopResetEnv (random no-op start)
-> MaxAndSkipEnv (frame skipping)
-> EpisodicLifeEnv (life as episode)
-> FireResetEnv (auto-press FIRE to start)
-> WarpFrame (grayscale + 84x84)
-> ClipRewardEnv (reward clipping)
-> VecFrameStack (stack 4 frames)
-> Final observation: (4, 84, 84)

Training Configuration

Hyperparameter Settings

Recommended PPO hyperparameters for Atari (from rl-baselines3-zoo):

model = PPO(
    "CnnPolicy",
    env,
    learning_rate=2.5e-4,
    n_steps=128,           # Steps collected per update (per env)
    batch_size=256,        # Mini-batch size
    n_epochs=4,            # Epochs per update
    gamma=0.99,            # Discount factor
    gae_lambda=0.95,       # GAE lambda
    clip_range=0.1,        # PPO clip range
    ent_coef=0.01,         # Entropy coefficient (encourages exploration)
    vf_coef=0.5,           # Value function loss coefficient
    max_grad_norm=0.5,     # Gradient clipping
    tensorboard_log="./tb_logs/",
    verbose=1,
)

Callbacks

from stable_baselines3.common.callbacks import (
    EvalCallback,
    CheckpointCallback,
)

# Evaluation callback: periodically evaluate and save the best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./logs/best_model/",
    log_path="./logs/eval/",
    eval_freq=50_000,      # Evaluate every 50k steps
    n_eval_episodes=10,    # 10 episodes per evaluation
    deterministic=True,
)

# Checkpoint callback: periodically save the model
checkpoint_callback = CheckpointCallback(
    save_freq=100_000,
    save_path="./logs/checkpoints/",
    name_prefix="ppo_breakout",
)

# Pass callbacks during training
model.learn(
    total_timesteps=10_000_000,
    callback=[eval_callback, checkpoint_callback],
)

Logging

SB3 automatically logs the following to TensorBoard:

  • rollout/ep_rew_mean: Mean episode reward
  • rollout/ep_len_mean: Mean episode length
  • train/loss: Total loss
  • train/policy_gradient_loss: Policy gradient loss
  • train/value_loss: Value function loss
  • train/entropy_loss: Policy entropy

TensorBoard Monitoring

Launching TensorBoard

tensorboard --logdir=./tb_logs/
# Then open http://localhost:6006

Rollout Metrics

Metric Meaning Expected Trend
ep_rew_mean Mean episode reward Should increase steadily
ep_len_mean Mean episode length Depends on game (should increase for Breakout)

Train Metrics

Metric Meaning Expected Trend
policy_gradient_loss Policy gradient loss Fluctuates but roughly converges
value_loss Value function loss Rises then falls
entropy_loss Policy entropy Gradually decreases (less exploration)
approx_kl PPO approximate KL divergence Should stay small (< 0.05)
clip_fraction Fraction of clipped updates Starts large, shrinks over time
explained_variance Value function explained variance Should approach 1.0

Learning Curve Analysis

  • Normal curve: Slow rise -> rapid improvement -> plateau
  • Overfitting: Training reward rises but evaluation reward drops
  • Non-convergence: Reward fluctuates at a low level for a long time -> check hyperparameters or preprocessing

Model Evaluation

Deterministic vs. Stochastic Evaluation

from stable_baselines3.common.evaluation import evaluate_policy

# Deterministic evaluation (selects the most probable action)
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=20, deterministic=True
)
print(f"Deterministic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")

# Stochastic evaluation (samples actions from the policy distribution)
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=20, deterministic=False
)
print(f"Stochastic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")
  • Deterministic evaluation: More stable, suitable for measuring final performance
  • Stochastic evaluation: Closer to training behavior, rewards are typically slightly lower

Recording Video

from stable_baselines3.common.vec_env import VecVideoRecorder

# Wrap the environment for video recording
eval_env = VecVideoRecorder(
    eval_env,
    video_folder="./videos/",
    record_video_trigger=lambda x: x == 0,  # Record the first episode
    video_length=2000,
)

obs = eval_env.reset()
for _ in range(2000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = eval_env.step(action)
eval_env.close()

Evaluation Commands Summary

# 1. View TensorBoard
tensorboard --logdir=./tb_logs/

# 2. Test the best model (deterministic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20 --deterministic

# 3. Test the best model (stochastic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20

# 4. Compare the final model
python visualize_agent.py --model-path ./ppo_breakout.zip --episodes 20 --deterministic

# 5. Record the best performance
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --record --deterministic

Common Issues and Debugging

Issue: Training Reward Not Increasing

Possible causes and solutions:

Cause Solution
Learning rate too high Lower learning_rate (try 1e-4)
Incorrect preprocessing Confirm you are using make_atari_env
Missing frame stacking Check VecFrameStack(n_stack=4)
Insufficient training steps Atari typically requires 5M-50M steps
Too few parallel environments Increase n_envs (recommended 8-16)

Issue: Out of Memory (OOM)

# Reduce parallel environments
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=4)  # Reduce from 8 to 4

# Reduce batch_size
model = PPO("CnnPolicy", env, batch_size=128)  # Reduce from 256 to 128

Issue: High Variance in Evaluation Performance

  • Increase the number of evaluation episodes (n_eval_episodes=30)
  • Use deterministic evaluation (deterministic=True)
  • Some games inherently have high variance (e.g., Montezuma's Revenge)

Issue: Large Gap Between Training and Evaluation Rewards

  • Training uses clipped rewards ({-1, 0, +1}), while evaluation uses original rewards -- the scales are inherently different
  • Ensure the evaluation environment does not use ClipRewardEnv
  1. Start with Pong: Pong is the simplest Atari game; you should see a learning signal within 100k steps
  2. Check TensorBoard: Confirm that ep_rew_mean shows an upward trend
  3. Check explained_variance: If it stays near 0 or goes negative, the value network is not learning
  4. Check approx_kl: If too large (> 0.1), the update step size is too large
  5. Visualize: Load the model and watch its actual behavior -- numbers do not always tell the full story

References


评论 #