Introduction to Stable Baselines3 (Using Atari as an Example)

We start with Atari to explore how to design a reinforcement learning agent and use Stable Baselines3 -- a library that encapsulates the underlying computations -- for training. Since the core application of reinforcement learning lies in robotics, the RL training materials are placed in the robotics notebook section.

For theoretical background on the underlying computations in reinforcement learning, please refer to the reinforcement learning notes in the AI notebook.

Atari Games

Atari 2600

The Atari 2600 is a home video game console released in 1977, featuring hundreds of classic games (Breakout, Pong, Space Invaders, Enduro, etc.). These games share several characteristics:

Input: Pixel-based screens (typically 210x160 RGB)
Output: Discrete actions (up, down, left, right, fire, etc., typically 4-18 actions)
Reward: Game score
Wide difficulty range: From simple Pong to extremely difficult Montezuma's Revenge

ALE (Arcade Learning Environment)

ALE is the standardized interface for Atari game research, proposed by Bellemare et al. (2013). It provides:

Unified game ROM loading and simulation
Standardized observation and action interfaces
A benchmark suite of 57 Atari games

In Gymnasium (formerly OpenAI Gym), Atari environments are accessed through the ale-py package:

import gymnasium as gym

env = gym.make("BreakoutNoFrameskip-v4")
obs, info = env.reset()
print(f"Observation space: {env.observation_space.shape}")  # (210, 160, 3)
print(f"Action space: {env.action_space.n}")                # 4

Why Is Atari the Classic RL Benchmark?

Birth of DQN: Mnih et al. (2013, 2015) first demonstrated the power of deep RL on Atari
Pixels to decisions: Learning policies directly from raw pixels without hand-crafted features
Standardized evaluation: A unified evaluation protocol widely used by the research community

SB3 Quick Start

Installation

# Base installation
pip install stable-baselines3[extra]

# Atari environments
pip install gymnasium[atari]
pip install ale-py

# Download Atari ROMs (license agreement required)
ale-import-roms

Three Steps: Create Environment -> Train Model -> Evaluate

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Step 1: Create environment (automatically applies Atari preprocessing)
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=8, seed=42)
env = VecFrameStack(env, n_stack=4)

# Step 2: Train model
model = PPO("CnnPolicy", env, verbose=1, tensorboard_log="./tb_logs/")
model.learn(total_timesteps=10_000_000)

# Step 3: Save and evaluate
model.save("ppo_breakout")

make_atari_env automatically wraps the environment with all preprocessing steps described below.

Environment Preprocessing

The raw Atari observation is a 210x160 RGB image, which is inefficient to use directly. The standard preprocessing pipeline includes:

Frame Skipping

NoFrameskip-v4 environment + MaxAndSkipEnv(skip=4)

Each action is repeated for 4 frames; returns the pixel-wise maximum of the last 2 frames
Purpose: Reduce computation and mitigate Atari's sprite flickering issue
Effect: Decision frequency drops from 60Hz to 15Hz

Grayscale Conversion and Resizing

# 210x160 RGB -> 84x84 Grayscale
WarpFrame(width=84, height=84)

Converts the color image to grayscale and resizes to 84x84
Purpose: Reduce input dimensionality; color information is generally unimportant in Atari games

Frame Stacking

VecFrameStack(env, n_stack=4)

Stacks 4 consecutive frames into one observation, resulting in shape (4, 84, 84)
Purpose: Enables the network to perceive motion information (velocity, direction), addressing the partial observability of single-frame observations

Reward Clipping

ClipRewardEnv  # Clips rewards to {-1, 0, +1}

Purpose: Score scales vary enormously across games (Pong: -1/+1, Breakout: 1-7 points per block); clipping unifies hyperparameters
Note: Use original rewards for evaluation, clipped rewards for training

Episodic Life Mechanism

EpisodicLifeEnv  # Treats each life loss as an episode termination

Purpose: Accelerates early-stage learning by letting the agent experience "failure" signals more quickly

Complete Preprocessing Pipeline

make_atari_env automatically applies wrappers in the following order:

Raw Atari environment (210x160x3)
-> NoopResetEnv (random no-op start)
-> MaxAndSkipEnv (frame skipping)
-> EpisodicLifeEnv (life as episode)
-> FireResetEnv (auto-press FIRE to start)
-> WarpFrame (grayscale + 84x84)
-> ClipRewardEnv (reward clipping)
-> VecFrameStack (stack 4 frames)
-> Final observation: (4, 84, 84)

Training Configuration

Hyperparameter Settings

Recommended PPO hyperparameters for Atari (from rl-baselines3-zoo):

model = PPO(
    "CnnPolicy",
    env,
    learning_rate=2.5e-4,
    n_steps=128,           # Steps collected per update (per env)
    batch_size=256,        # Mini-batch size
    n_epochs=4,            # Epochs per update
    gamma=0.99,            # Discount factor
    gae_lambda=0.95,       # GAE lambda
    clip_range=0.1,        # PPO clip range
    ent_coef=0.01,         # Entropy coefficient (encourages exploration)
    vf_coef=0.5,           # Value function loss coefficient
    max_grad_norm=0.5,     # Gradient clipping
    tensorboard_log="./tb_logs/",
    verbose=1,
)

Callbacks

from stable_baselines3.common.callbacks import (
    EvalCallback,
    CheckpointCallback,
)

# Evaluation callback: periodically evaluate and save the best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./logs/best_model/",
    log_path="./logs/eval/",
    eval_freq=50_000,      # Evaluate every 50k steps
    n_eval_episodes=10,    # 10 episodes per evaluation
    deterministic=True,
)

# Checkpoint callback: periodically save the model
checkpoint_callback = CheckpointCallback(
    save_freq=100_000,
    save_path="./logs/checkpoints/",
    name_prefix="ppo_breakout",
)

# Pass callbacks during training
model.learn(
    total_timesteps=10_000_000,
    callback=[eval_callback, checkpoint_callback],
)

Logging

SB3 automatically logs the following to TensorBoard:

rollout/ep_rew_mean: Mean episode reward
rollout/ep_len_mean: Mean episode length
train/loss: Total loss
train/policy_gradient_loss: Policy gradient loss
train/value_loss: Value function loss
train/entropy_loss: Policy entropy

TensorBoard Monitoring

Launching TensorBoard

tensorboard --logdir=./tb_logs/
# Then open http://localhost:6006

Rollout Metrics

Metric	Meaning	Expected Trend
`ep_rew_mean`	Mean episode reward	Should increase steadily
`ep_len_mean`	Mean episode length	Depends on game (should increase for Breakout)

Train Metrics

Metric	Meaning	Expected Trend
`policy_gradient_loss`	Policy gradient loss	Fluctuates but roughly converges
`value_loss`	Value function loss	Rises then falls
`entropy_loss`	Policy entropy	Gradually decreases (less exploration)
`approx_kl`	PPO approximate KL divergence	Should stay small (< 0.05)
`clip_fraction`	Fraction of clipped updates	Starts large, shrinks over time
`explained_variance`	Value function explained variance	Should approach 1.0

Learning Curve Analysis

Normal curve: Slow rise -> rapid improvement -> plateau
Overfitting: Training reward rises but evaluation reward drops
Non-convergence: Reward fluctuates at a low level for a long time -> check hyperparameters or preprocessing

Model Evaluation

Deterministic vs. Stochastic Evaluation

from stable_baselines3.common.evaluation import evaluate_policy

# Deterministic evaluation (selects the most probable action)
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=20, deterministic=True
)
print(f"Deterministic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")

# Stochastic evaluation (samples actions from the policy distribution)
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=20, deterministic=False
)
print(f"Stochastic evaluation: {mean_reward:.1f} +/- {std_reward:.1f}")

Deterministic evaluation: More stable, suitable for measuring final performance
Stochastic evaluation: Closer to training behavior, rewards are typically slightly lower

Recording Video

from stable_baselines3.common.vec_env import VecVideoRecorder

# Wrap the environment for video recording
eval_env = VecVideoRecorder(
    eval_env,
    video_folder="./videos/",
    record_video_trigger=lambda x: x == 0,  # Record the first episode
    video_length=2000,
)

obs = eval_env.reset()
for _ in range(2000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = eval_env.step(action)
eval_env.close()

Evaluation Commands Summary

# 1. View TensorBoard
tensorboard --logdir=./tb_logs/

# 2. Test the best model (deterministic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20 --deterministic

# 3. Test the best model (stochastic)
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --episodes 20

# 4. Compare the final model
python visualize_agent.py --model-path ./ppo_breakout.zip --episodes 20 --deterministic

# 5. Record the best performance
python visualize_agent.py --model-path ./logs/best_model/best_model.zip --record --deterministic

Common Issues and Debugging

Issue: Training Reward Not Increasing

Possible causes and solutions:

Cause	Solution
Learning rate too high	Lower `learning_rate` (try 1e-4)
Incorrect preprocessing	Confirm you are using `make_atari_env`
Missing frame stacking	Check `VecFrameStack(n_stack=4)`
Insufficient training steps	Atari typically requires 5M-50M steps
Too few parallel environments	Increase `n_envs` (recommended 8-16)

Issue: Out of Memory (OOM)

# Reduce parallel environments
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=4)  # Reduce from 8 to 4

# Reduce batch_size
model = PPO("CnnPolicy", env, batch_size=128)  # Reduce from 256 to 128

Issue: High Variance in Evaluation Performance

Increase the number of evaluation episodes (n_eval_episodes=30)
Use deterministic evaluation (deterministic=True)
Some games inherently have high variance (e.g., Montezuma's Revenge)

Issue: Large Gap Between Training and Evaluation Rewards

Training uses clipped rewards ({-1, 0, +1}), while evaluation uses original rewards -- the scales are inherently different
Ensure the evaluation environment does not use ClipRewardEnv

Recommended Debugging Workflow

Start with Pong: Pong is the simplest Atari game; you should see a learning signal within 100k steps
Check TensorBoard: Confirm that ep_rew_mean shows an upward trend
Check explained_variance: If it stays near 0 or goes negative, the value network is not learning
Check approx_kl: If too large (> 0.1), the update step size is too large
Visualize: Load the model and watch its actual behavior -- numbers do not always tell the full story

References

Stable Baselines3 documentation: stable-baselines3.readthedocs.io
rl-baselines3-zoo (pretrained models and hyperparameters): github.com/DLR-RM/rl-baselines3-zoo
Mnih et al., Playing Atari with Deep Reinforcement Learning, 2013
Mnih et al., Human-level control through deep reinforcement learning, Nature, 2015