RL Engineering in Practice

The theoretical formulations of reinforcement learning algorithms are often elegant and concise, but the gap between equations and runnable, reproducible code is filled with engineering details. These details are typically glossed over in papers, yet they can have an enormous impact on final performance. This note systematically covers the most essential practical knowledge in RL engineering: from Buffer design for data collection, to Normalization tricks for training stability, to distributed sampling architectures, to logging, evaluation, and hyperparameter tuning guidelines.

The Rollout Mechanism

What Is a Rollout

In on-policy methods (A2C, PPO, etc.), a Rollout refers to the process of interacting with the environment using the current policy \(\pi_\theta\) to collect a batch of experience data. After each rollout, the data is used to update the policy and then discarded (or, in PPO, reused for K epochs before being discarded).

The amount of data in a single rollout is determined by two parameters:

\[ \text{Rollout Size} = N_{\text{envs}} \times T_{\text{horizon}} \]

where \(N_{\text{envs}}\) is the number of parallel environments and \(T_{\text{horizon}}\) is the number of steps collected per environment.

Rollout Buffer vs Replay Buffer

These two are the data storage mechanisms for on-policy and off-policy methods respectively, and they are fundamentally different:

Dimension	Rollout Buffer	Replay Buffer
Algorithms	A2C, PPO (on-policy)	DQN, SAC (off-policy)
Data source	Current policy \(\pi_\theta\)	Mixture of historical policies
Storage	Fixed size, cleared after each rollout	Circular queue, continuously stored
Data reuse	Discarded after use (or K epochs)	Repeatedly sampled, highly reused
Typical size	\(N \times T\) (thousands to tens of thousands)	\(10^5 \sim 10^6\) transitions
Sampling	Sequential traversal (shuffled into mini-batches)	Uniform random sampling

Data Structure of a Rollout Buffer

A typical Rollout Buffer stores the following fields:

class RolloutBuffer:
    states:     ndarray  # (N*T, obs_dim)   observed states
    actions:    ndarray  # (N*T, act_dim)   actions taken
    rewards:    ndarray  # (N*T,)           immediate rewards
    dones:      ndarray  # (N*T,)           whether terminated
    log_probs:  ndarray  # (N*T,)           log π_old(a|s)
    values:     ndarray  # (N*T,)           V_old(s)
    advantages: ndarray  # (N*T,)           GAE advantage estimates (computed afterward)
    returns:    ndarray  # (N*T,)           target returns (computed afterward)

Here, log_probs and values are computed with the old parameters during data collection and serve as "reference values" in subsequent policy updates. advantages and returns are computed via backward recursion using GAE after the rollout is complete.

Episode Truncation and Padding

In parallel environments, episodes in different environments have different lengths. When an episode ends in one environment, there are two ways to handle it:

Approach 1: Auto-reset

Most Gym/Gymnasium vectorized environments use this approach. When an episode ends in one environment, it is automatically reset and the initial state of the new episode is returned. Key considerations:

The next state \(s'\) after done=True belongs to a new episode and should not be used for bootstrapping
GAE computation must use (1 - done) to "cut off" propagation across episode boundaries
The true "terminal observation" must be recorded, since auto-reset causes the returned \(s'\) to already be the initial state of the new episode

Approach 2: Truncation

When an environment terminates due to reaching a maximum step limit (as opposed to a natural termination), this is called Truncation. Since the episode has not truly ended at truncation, bootstrapping with \(V(s_{\text{last}})\) is required:

\[ \delta_{T-1} = r_{T-1} + \gamma V(s_T) - V(s_{T-1}) \quad \text{(truncated, do not multiply by (1-done))} \]

In Gymnasium, truncated and terminated are separate signals and must be handled independently.

Experience Replay (Replay Buffer)

Experience replay is a core component of off-policy methods. A major reason DQN can train stably is that experience replay breaks the temporal correlation in the data.

Basic Replay Buffer

Data structure: A fixed-capacity circular buffer that stores transitions \((s, a, r, s', \text{done})\).

Circular Buffer (capacity = C)

Write pointer →
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ t5│ t6│ t7│ t8│ t9│ t0│ t1│ t2│ t3│ t4│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                          ↑
                     Oldest data (overwritten next)

Core operations:

Store: After each interaction step, write \((s, a, r, s', \text{done})\) to the current position and advance the pointer. When full, the oldest data is overwritten.
Sample: Uniformly at random sample a mini-batch (typically 32--256 transitions) from the buffer for training.

The problem with uniform sampling: All transitions are sampled with equal probability, but not all data has equal "learning value." Some transitions contain a wealth of new information (e.g., encountering a rare state or making a large error), while others are "boring" common transitions. This motivates PER.

Prioritized Experience Replay (PER)

Core idea: Give transitions with higher "learning value" a higher probability of being sampled. "Learning value" is measured by the absolute TD error -- a large TD error means the current value estimate has a large prediction error for that transition, indicating more room to learn from it.

Priority definition:

\[ p_i = |\delta_i| + \epsilon \]

where \(\delta_i = r_i + \gamma \max_{a'} Q(s_i', a') - Q(s_i, a_i)\) is the TD error, and \(\epsilon > 0\) is a small constant that prevents the priority from being zero (which would mean the transition is never sampled).

Sampling probability:

\[ P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha} \]

\(\alpha \in [0, 1]\) controls the strength of prioritization: \(\alpha = 0\) degenerates to uniform sampling, while \(\alpha = 1\) samples fully according to priority.

Importance sampling correction: Priority sampling changes the data distribution, and without correction this introduces bias. PER uses importance sampling weights to correct the gradients:

\[ w_i = \left( \frac{1}{C} \cdot \frac{1}{P(i)} \right)^\beta \]

where \(C\) is the buffer capacity, and \(\beta \in [0, 1]\) controls the degree of correction. During training, \(\beta\) is linearly annealed from an initial value (e.g., 0.4) to 1.0, ensuring that bias is fully eliminated in the later stages of training.

In practice, the weights are normalized: \(w_i \leftarrow w_i / \max_j w_j\), to prevent excessively large weights from causing instability.

Implementation efficiency: A naive implementation requires \(O(C)\) time to compute sampling probabilities. In practice, a Sum Tree (a type of complete binary tree) data structure is used to reduce both sampling and update complexity to \(O(\log C)\).

Sum Tree structure (stores sums of priorities)

              [Total = 42]
             /            \
          [29]             [13]
         /    \           /    \
       [13]   [16]      [3]   [10]
       / \    / \       / \    / \
      [3][10][12][4]   [1][2] [8][2]    ← Leaf nodes = priority of each transition

Hindsight Experience Replay (HER)

Applicable scenario: Goal-conditioned reinforcement learning, where the agent must reach a specified goal \(g\), and the policy is \(\pi(a|s, g)\).

Core problem: In sparse reward environments (where positive reward is only given upon reaching the goal), the agent almost never receives positive feedback, making learning impossible to bootstrap.

Core idea: Even if the agent fails to reach the intended goal \(g\), the state \(s_T\) it actually reached can itself serve as a "hindsight goal." By relabeling the failed experience as "successful experience with \(s_T\) as the goal," additional positive samples are created.

The HER process:

Original experience (goal g = [5, 5], actually reached [3, 2], reward = 0):
  (s₀, a₀, r=0, s₁, g=[5,5])
  (s₁, a₁, r=0, s₂, g=[5,5])
  ...
  (s_{T-1}, a_{T-1}, r=0, s_T=[3,2], g=[5,5])    ← All failures

HER relabeling (replace goal with g' = s_T = [3,2]):
  (s₀, a₀, r=0, s₁, g'=[3,2])
  (s₁, a₁, r=0, s₂, g'=[3,2])
  ...
  (s_{T-1}, a_{T-1}, r=1, s_T=[3,2], g'=[3,2])   ← The last step "succeeded"!

Goal Sampling Strategy:

Final: Use the state reached at the end of the episode as the new goal
Future: Randomly select a state from some point after the current timestep as the new goal
Episode: Randomly select from anywhere within the episode
Random: Randomly select from the entire buffer

In practice, the Future strategy generally works best, as it leverages the most relabeled data.

Normalization Tricks

Normalization is a critical safeguard for RL training stability. Unlike supervised learning where data distributions are relatively fixed, in RL the state distribution, reward distribution, and advantage distribution are all constantly shifting (because the policy is changing).

Observation Normalization

Problem: Different state dimensions can have vastly different scales. For example, a robot's joint angles might lie in \([-\pi, \pi]\), while positional coordinates might be in \([-100, 100]\).

Solution: Maintain a running mean and running standard deviation of the observations, and standardize each observation:

\[ s_{\text{norm}} = \frac{s - \mu_{\text{running}}}{\sigma_{\text{running}} + \epsilon} \]

Running statistics update (Welford's online algorithm):

\[ \mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n} \]

\[ M_n = M_{n-1} + (x_n - \mu_{n-1})(x_n - \mu_n) \]

\[ \sigma_n^2 = \frac{M_n}{n} \]

Important notes:

During evaluation, use the statistics accumulated during training; do not update them
Clipping the normalized values (e.g., to \([-10, 10]\)) can prevent extreme outliers
In Stable Baselines3, this is implemented via the VecNormalize wrapper

Reward Normalization

Problem: Reward scales vary enormously across environments. Atari game rewards might be \(\{-1, 0, +1\}\), while MuJoCo task rewards could range over \([-100, 100]\).

Method 1: Reward Scaling

Scale rewards by their running standard deviation (note: do not subtract the mean, because doing so changes the nature of the task -- it turns "obtain positive reward" into "deviate from the mean"):

\[ r_{\text{scaled}} = \frac{r}{\sigma_{\text{running}}(r) + \epsilon} \]

In practice, a more common approach is to normalize returns \(G_t\): scale rewards by the running standard deviation of \(G_t\):

\[ r_{\text{scaled}} = \frac{r}{\sigma_{\text{running}}(G) + \epsilon} \]

Method 2: Reward Clipping

Simply clip rewards to a fixed range, such as \([-1, 1]\) or \([-5, 5]\). The original DQN Atari paper used \(\text{clip}(r, -1, 1)\).

Pros: Simple and effective; suitable when only the direction of the reward matters
Cons: Discards magnitude information

Advantage Normalization

Standardize advantage values within each mini-batch (per-batch standardization):

\[ \hat{A}_t \leftarrow \frac{\hat{A}_t - \text{mean}(\hat{A})}{\text{std}(\hat{A}) + \epsilon} \]

Why this works:

Roughly half the advantages become positive and half negative, yielding more balanced gradient signals
The gradient scale is decoupled from the advantage scale, reducing sensitivity to the learning rate
Signal strength is automatically adjusted across different training stages

Note: This introduces a slight bias (since it shifts the mean of the advantages), but in practice the effect is negligible -- the stability gains far outweigh the cost of the bias.

Parallel Sampling

Data collection is the bottleneck of on-policy RL. Parallelizing the sampling process can significantly increase throughput.

Vectorized Environments

SubprocVecEnv (multi-process environments):

Each environment runs in an independent subprocess and communicates with the main process via pipes.

Main process (policy inference + gradient updates)
    │
    ├──[pipe]──> Subprocess 1: Env1.step(a1) → (s1', r1, done1)
    ├──[pipe]──> Subprocess 2: Env2.step(a2) → (s2', r2, done2)
    ├──[pipe]──> Subprocess 3: Env3.step(a3) → (s3', r3, done3)
    └──[pipe]──> Subprocess N: EnvN.step(aN) → (sN', rN, doneN)

Pros: True parallelism, utilizing multiple CPU cores
Cons: Inter-process communication overhead; data serialization/deserialization
Best for: Computationally heavy environments (e.g., physics simulation, rendering)

DummyVecEnv (single-process environments):

All environments execute sequentially in the same process, simply wrapped with a vectorized interface.

Pros: No communication overhead; easy to debug
Cons: No true parallelism
Best for: Lightweight environments (e.g., CartPole); debugging

Async vs Sync Collection

Synchronous collection:

All environments step simultaneously; the policy is only queried after all environments have finished. This is the standard mode for A2C/PPO.

Time ──>

Env1: ████████░░░░████████░░░░    ← Fast env waits for slow env
Env2: ████████████████████████    ← Slow env
Env3: ██████░░░░░░██████░░░░░░
           ↑              ↑
        Batch inference   Batch inference

Pros: Data is aligned; simple to implement; policy inference can be batched (GPU-efficient)
Cons: Bottlenecked by the slowest environment

Asynchronous collection:

Environments send data as soon as they finish, without waiting for others. A3C uses this mode.

Pros: No idle time; throughput is maximized
Cons: Data alignment is difficult; policy versions may be inconsistent (stale gradient problem)

Distributed Training

When extremely high data throughput is needed, sampling and training can be distributed across multiple machines.

Typical architecture:

┌─────────────────────────────────────────────────────┐
│                  Learner (GPU)                       │
│        Receive data → Compute gradients → Update     │
└──────────┬─────────────────────────┬────────────────┘
           │ Send params             │ Send params
     ┌─────▼──────┐           ┌─────▼──────┐
     │  Worker 1  │           │  Worker 2  │
     │  (CPU)     │           │  (CPU)     │
     │ N envs     │           │ N envs     │
     └────────────┘           └────────────┘

Workers interact with environments to collect data, using the latest parameters sent by the Learner
Learner handles gradient computation and parameter updates
Communication is typically implemented via shared memory, gRPC, or ZMQ

Gradient aggregation strategies:

Centralized: Workers send experiences to the Learner, which computes gradients centrally
Distributed: Each Worker computes local gradients, then a global AllReduce aggregation is performed
Asynchronous SGD: Workers push gradients asynchronously (may suffer from the stale gradient problem)

Logging and Evaluation

Training Metrics

Key metrics to monitor during RL training:

Basic metrics:

Metric	Meaning	Expected range/trend
`ep_return`	Total return per episode	Should increase over time
`ep_length`	Episode length	Task-dependent
`fps`	Frames processed per second	Should be stable

Policy metrics (PPO-specific):

Metric	Meaning	Expected range/trend
`policy_loss`	\(-L^{\text{CLIP}}\)	Absolute value should not be too large
`value_loss`	\(L^{VF}\)	Should decrease over time
`entropy`	\(H(\pi_\theta)\)	Should decrease gradually (should not drop sharply)
`approx_kl`	Approximate KL divergence between old and new policies	\(< 0.02\) (typically)
`clip_fraction`	Fraction of samples that were clipped	\(0.1 \sim 0.3\)
`explained_variance`	\(1 - \frac{\text{Var}(R - V)}{\text{Var}(R)}\)	Close to 1 indicates a good Critic

Key warning signs:

entropy drops sharply to near 0 -- the policy is converging prematurely; increase the entropy coefficient
approx_kl consistently exceeds 0.05 -- the update step is too large; reduce the learning rate or increase the number of epochs
clip_fraction is close to 0 -- clipping is not taking effect; \(\epsilon\) may be too large
clip_fraction is close to 1 -- nearly all samples are being clipped; the learning rate is likely too high
explained_variance is persistently negative -- the Critic is worse than random guessing; check the network architecture or learning rate

Evaluation Protocol

Training returns and evaluation returns are different. During training, the policy includes exploration noise (stochastic sampling), whereas evaluation typically uses a deterministic policy.

Deterministic vs Stochastic evaluation:

# During training (Stochastic)
action = policy.sample(obs)        # Sample from the distribution

# During evaluation (Deterministic)
action = policy.mean(obs)          # Take the distribution's mean/mode
# Discrete action space: action = argmax π(a|s)
# Continuous action space: action = μ(s)  (mean of the Gaussian policy)

Independent evaluation environments:

Evaluation should use separate environment instances, isolated from the training environments:

Evaluation environments should use different seeds than training
Evaluation environments should not have a Normalization wrapper (or should use the training statistics)
Each evaluation should run multiple episodes (typically 10--20) and average the results

Best Model Checkpointing:

if mean_eval_return > best_return:
    best_return = mean_eval_return
    save_model(policy_net, value_net, "best_model.pt")

Evaluate at fixed intervals (e.g., every 10 rollouts)
Save the model with the highest evaluation return
Also save the latest model (for resuming training)
Save the normalization statistics (running mean/std); otherwise observations cannot be correctly preprocessed after loading the model

Common Tools

TensorBoard: Natively supported by PyTorch via SummaryWriter for logging scalars, histograms, images, etc. Used by default in Stable Baselines3.

Weights & Biases (W&B): A cloud-based experiment management platform that automatically logs hyperparameters, metrics, and system resources. Supports experiment comparison and hyperparameter sweeps.

MLflow: An open-source experiment tracking tool that supports local deployment, making it well-suited for enterprise environments.

The choice among these three is largely a matter of personal preference and team needs. For individual research, TensorBoard is sufficient. For team collaboration and large-scale experiments, W&B is more convenient. For organizations requiring private deployment, MLflow is the go-to option.

Hyperparameter Tuning Guide

Hyperparameter Sensitivity

Hyperparameter	Typical value	Sensitivity	Tuning advice
Learning Rate	\(3 \times 10^{-4}\)	Very high	Start with \(3 \times 10^{-4}\); if unsuccessful, search \([10^{-4}, 10^{-3}]\)
\(\gamma\) (discount factor)	0.99	Medium	Use 0.99 for long-horizon tasks, 0.95 for short-horizon tasks
\(\lambda\) (GAE)	0.95	Low	0.95 is a safe default
\(\epsilon\) (PPO clip)	0.2	Low	Rarely needs tuning
Entropy Coeff \(c_2\)	0.01	Medium	Increase if exploration is insufficient; decrease if the policy fails to converge
Value Coeff \(c_1\)	0.5	Low	Either 0.5 or 1.0 works fine
Mini-batch Size	64-256	Medium	Too small leads to high variance; too large reduces the number of updates
N Epochs (K)	4-10	Medium	Too large causes overfitting to the rollout data
N Envs	8-128	Low	More is better (limited by CPU)
T Horizon	128-2048	Medium	Too short yields poor GAE estimates; too long increases on-policy bias
Gradient Clip	0.5	Low	0.5 is a safe default

The single most important hyperparameter is the learning rate. If you can only tune one hyperparameter, tune the learning rate. Using default values for the rest will typically yield reasonable results.

Common Failure Modes and Debugging

1. Returns do not improve or are extremely volatile

Verify that the reward function is correctly implemented
Check whether the environment's done signal is correct
Lower the learning rate
Increase the number of parallel environments (more data = lower variance)

2. Returns rise then fall (Catastrophic Forgetting)

PPO: reduce the learning rate, reduce the number of epochs \(K\), reduce \(\epsilon\)
This may be reward hacking -- the policy found an exploit rather than genuinely solving the problem

3. Policy prematurely converges to a suboptimal solution

Increase the entropy coefficient \(c_2\)
Check whether there is sufficient exploration (increase stochasticity)
Consider using a larger network

4. Critic's Explained Variance is low

Increase the Critic's network capacity
Lower the learning rate (the Critic updates may be too aggressive)
Check whether observation normalization is enabled

5. Training crashes early (NaN or Inf)

Check the scale of observations and rewards; enable normalization
Lower the learning rate
Check network initialization
Ensure that computing \(\log \pi\) does not result in \(\log 0\) (add \(\epsilon\))

6. Poor performance on MuJoCo

Ensure you are using a continuous action space (Gaussian policy)
Check action scaling (does the action range match the environment?)
MuJoCo typically requires larger networks ([256, 256] rather than [64, 64])
Make sure observation normalization is enabled

Recommended debugging workflow:

1. First validate code correctness on simple environments
   CartPole (discrete) → Pendulum (continuous) → HalfCheetah (complex continuous)

2. Compare against a known good implementation
   Run SB3's PPO on the same environment with the same hyperparameters and compare return curves
   If SB3 also fails, it's a hyperparameter issue; if SB3 succeeds but yours doesn't, it's a code bug

3. Incrementally add complexity
   Start with all tricks disabled (no normalization, no clipping) to verify the basic training loop
   Then add tricks one at a time and observe the effect of each

4. Visualize everything
   Record videos of the policy's behavior
   Plot heatmaps of the value function
   Inspect whether the action distribution is reasonable