RL Engineering in Practice
The theoretical formulations of reinforcement learning algorithms are often elegant and concise, but the gap between equations and runnable, reproducible code is filled with engineering details. These details are typically glossed over in papers, yet they can have an enormous impact on final performance. This note systematically covers the most essential practical knowledge in RL engineering: from Buffer design for data collection, to Normalization tricks for training stability, to distributed sampling architectures, to logging, evaluation, and hyperparameter tuning guidelines.
The Rollout Mechanism
What Is a Rollout
In on-policy methods (A2C, PPO, etc.), a Rollout refers to the process of interacting with the environment using the current policy \(\pi_\theta\) to collect a batch of experience data. After each rollout, the data is used to update the policy and then discarded (or, in PPO, reused for K epochs before being discarded).
The amount of data in a single rollout is determined by two parameters:
where \(N_{\text{envs}}\) is the number of parallel environments and \(T_{\text{horizon}}\) is the number of steps collected per environment.
Rollout Buffer vs Replay Buffer
These two are the data storage mechanisms for on-policy and off-policy methods respectively, and they are fundamentally different:
| Dimension | Rollout Buffer | Replay Buffer |
|---|---|---|
| Algorithms | A2C, PPO (on-policy) | DQN, SAC (off-policy) |
| Data source | Current policy \(\pi_\theta\) | Mixture of historical policies |
| Storage | Fixed size, cleared after each rollout | Circular queue, continuously stored |
| Data reuse | Discarded after use (or K epochs) | Repeatedly sampled, highly reused |
| Typical size | \(N \times T\) (thousands to tens of thousands) | \(10^5 \sim 10^6\) transitions |
| Sampling | Sequential traversal (shuffled into mini-batches) | Uniform random sampling |
Data Structure of a Rollout Buffer
A typical Rollout Buffer stores the following fields:
class RolloutBuffer:
states: ndarray # (N*T, obs_dim) observed states
actions: ndarray # (N*T, act_dim) actions taken
rewards: ndarray # (N*T,) immediate rewards
dones: ndarray # (N*T,) whether terminated
log_probs: ndarray # (N*T,) log π_old(a|s)
values: ndarray # (N*T,) V_old(s)
advantages: ndarray # (N*T,) GAE advantage estimates (computed afterward)
returns: ndarray # (N*T,) target returns (computed afterward)
Here, log_probs and values are computed with the old parameters during data collection and serve as "reference values" in subsequent policy updates. advantages and returns are computed via backward recursion using GAE after the rollout is complete.
Episode Truncation and Padding
In parallel environments, episodes in different environments have different lengths. When an episode ends in one environment, there are two ways to handle it:
Approach 1: Auto-reset
Most Gym/Gymnasium vectorized environments use this approach. When an episode ends in one environment, it is automatically reset and the initial state of the new episode is returned. Key considerations:
- The next state \(s'\) after
done=Truebelongs to a new episode and should not be used for bootstrapping - GAE computation must use
(1 - done)to "cut off" propagation across episode boundaries - The true "terminal observation" must be recorded, since auto-reset causes the returned \(s'\) to already be the initial state of the new episode
Approach 2: Truncation
When an environment terminates due to reaching a maximum step limit (as opposed to a natural termination), this is called Truncation. Since the episode has not truly ended at truncation, bootstrapping with \(V(s_{\text{last}})\) is required:
In Gymnasium, truncated and terminated are separate signals and must be handled independently.
Experience Replay (Replay Buffer)
Experience replay is a core component of off-policy methods. A major reason DQN can train stably is that experience replay breaks the temporal correlation in the data.
Basic Replay Buffer
Data structure: A fixed-capacity circular buffer that stores transitions \((s, a, r, s', \text{done})\).
Circular Buffer (capacity = C)
Write pointer →
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ t5│ t6│ t7│ t8│ t9│ t0│ t1│ t2│ t3│ t4│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
↑
Oldest data (overwritten next)
Core operations:
- Store: After each interaction step, write \((s, a, r, s', \text{done})\) to the current position and advance the pointer. When full, the oldest data is overwritten.
- Sample: Uniformly at random sample a mini-batch (typically 32--256 transitions) from the buffer for training.
The problem with uniform sampling: All transitions are sampled with equal probability, but not all data has equal "learning value." Some transitions contain a wealth of new information (e.g., encountering a rare state or making a large error), while others are "boring" common transitions. This motivates PER.
Prioritized Experience Replay (PER)
Core idea: Give transitions with higher "learning value" a higher probability of being sampled. "Learning value" is measured by the absolute TD error -- a large TD error means the current value estimate has a large prediction error for that transition, indicating more room to learn from it.
Priority definition:
where \(\delta_i = r_i + \gamma \max_{a'} Q(s_i', a') - Q(s_i, a_i)\) is the TD error, and \(\epsilon > 0\) is a small constant that prevents the priority from being zero (which would mean the transition is never sampled).
Sampling probability:
\(\alpha \in [0, 1]\) controls the strength of prioritization: \(\alpha = 0\) degenerates to uniform sampling, while \(\alpha = 1\) samples fully according to priority.
Importance sampling correction: Priority sampling changes the data distribution, and without correction this introduces bias. PER uses importance sampling weights to correct the gradients:
where \(C\) is the buffer capacity, and \(\beta \in [0, 1]\) controls the degree of correction. During training, \(\beta\) is linearly annealed from an initial value (e.g., 0.4) to 1.0, ensuring that bias is fully eliminated in the later stages of training.
In practice, the weights are normalized: \(w_i \leftarrow w_i / \max_j w_j\), to prevent excessively large weights from causing instability.
Implementation efficiency: A naive implementation requires \(O(C)\) time to compute sampling probabilities. In practice, a Sum Tree (a type of complete binary tree) data structure is used to reduce both sampling and update complexity to \(O(\log C)\).
Sum Tree structure (stores sums of priorities)
[Total = 42]
/ \
[29] [13]
/ \ / \
[13] [16] [3] [10]
/ \ / \ / \ / \
[3][10][12][4] [1][2] [8][2] ← Leaf nodes = priority of each transition
Hindsight Experience Replay (HER)
Applicable scenario: Goal-conditioned reinforcement learning, where the agent must reach a specified goal \(g\), and the policy is \(\pi(a|s, g)\).
Core problem: In sparse reward environments (where positive reward is only given upon reaching the goal), the agent almost never receives positive feedback, making learning impossible to bootstrap.
Core idea: Even if the agent fails to reach the intended goal \(g\), the state \(s_T\) it actually reached can itself serve as a "hindsight goal." By relabeling the failed experience as "successful experience with \(s_T\) as the goal," additional positive samples are created.
The HER process:
Original experience (goal g = [5, 5], actually reached [3, 2], reward = 0):
(s₀, a₀, r=0, s₁, g=[5,5])
(s₁, a₁, r=0, s₂, g=[5,5])
...
(s_{T-1}, a_{T-1}, r=0, s_T=[3,2], g=[5,5]) ← All failures
HER relabeling (replace goal with g' = s_T = [3,2]):
(s₀, a₀, r=0, s₁, g'=[3,2])
(s₁, a₁, r=0, s₂, g'=[3,2])
...
(s_{T-1}, a_{T-1}, r=1, s_T=[3,2], g'=[3,2]) ← The last step "succeeded"!
Goal Sampling Strategy:
- Final: Use the state reached at the end of the episode as the new goal
- Future: Randomly select a state from some point after the current timestep as the new goal
- Episode: Randomly select from anywhere within the episode
- Random: Randomly select from the entire buffer
In practice, the Future strategy generally works best, as it leverages the most relabeled data.
Normalization Tricks
Normalization is a critical safeguard for RL training stability. Unlike supervised learning where data distributions are relatively fixed, in RL the state distribution, reward distribution, and advantage distribution are all constantly shifting (because the policy is changing).
Observation Normalization
Problem: Different state dimensions can have vastly different scales. For example, a robot's joint angles might lie in \([-\pi, \pi]\), while positional coordinates might be in \([-100, 100]\).
Solution: Maintain a running mean and running standard deviation of the observations, and standardize each observation:
Running statistics update (Welford's online algorithm):
Important notes:
- During evaluation, use the statistics accumulated during training; do not update them
- Clipping the normalized values (e.g., to \([-10, 10]\)) can prevent extreme outliers
- In Stable Baselines3, this is implemented via the
VecNormalizewrapper
Reward Normalization
Problem: Reward scales vary enormously across environments. Atari game rewards might be \(\{-1, 0, +1\}\), while MuJoCo task rewards could range over \([-100, 100]\).
Method 1: Reward Scaling
Scale rewards by their running standard deviation (note: do not subtract the mean, because doing so changes the nature of the task -- it turns "obtain positive reward" into "deviate from the mean"):
In practice, a more common approach is to normalize returns \(G_t\): scale rewards by the running standard deviation of \(G_t\):
Method 2: Reward Clipping
Simply clip rewards to a fixed range, such as \([-1, 1]\) or \([-5, 5]\). The original DQN Atari paper used \(\text{clip}(r, -1, 1)\).
- Pros: Simple and effective; suitable when only the direction of the reward matters
- Cons: Discards magnitude information
Advantage Normalization
Standardize advantage values within each mini-batch (per-batch standardization):
Why this works:
- Roughly half the advantages become positive and half negative, yielding more balanced gradient signals
- The gradient scale is decoupled from the advantage scale, reducing sensitivity to the learning rate
- Signal strength is automatically adjusted across different training stages
Note: This introduces a slight bias (since it shifts the mean of the advantages), but in practice the effect is negligible -- the stability gains far outweigh the cost of the bias.
Parallel Sampling
Data collection is the bottleneck of on-policy RL. Parallelizing the sampling process can significantly increase throughput.
Vectorized Environments
SubprocVecEnv (multi-process environments):
Each environment runs in an independent subprocess and communicates with the main process via pipes.
Main process (policy inference + gradient updates)
│
├──[pipe]──> Subprocess 1: Env1.step(a1) → (s1', r1, done1)
├──[pipe]──> Subprocess 2: Env2.step(a2) → (s2', r2, done2)
├──[pipe]──> Subprocess 3: Env3.step(a3) → (s3', r3, done3)
└──[pipe]──> Subprocess N: EnvN.step(aN) → (sN', rN, doneN)
- Pros: True parallelism, utilizing multiple CPU cores
- Cons: Inter-process communication overhead; data serialization/deserialization
- Best for: Computationally heavy environments (e.g., physics simulation, rendering)
DummyVecEnv (single-process environments):
All environments execute sequentially in the same process, simply wrapped with a vectorized interface.
- Pros: No communication overhead; easy to debug
- Cons: No true parallelism
- Best for: Lightweight environments (e.g., CartPole); debugging
Async vs Sync Collection
Synchronous collection:
All environments step simultaneously; the policy is only queried after all environments have finished. This is the standard mode for A2C/PPO.
Time ──>
Env1: ████████░░░░████████░░░░ ← Fast env waits for slow env
Env2: ████████████████████████ ← Slow env
Env3: ██████░░░░░░██████░░░░░░
↑ ↑
Batch inference Batch inference
- Pros: Data is aligned; simple to implement; policy inference can be batched (GPU-efficient)
- Cons: Bottlenecked by the slowest environment
Asynchronous collection:
Environments send data as soon as they finish, without waiting for others. A3C uses this mode.
- Pros: No idle time; throughput is maximized
- Cons: Data alignment is difficult; policy versions may be inconsistent (stale gradient problem)
Distributed Training
When extremely high data throughput is needed, sampling and training can be distributed across multiple machines.
Typical architecture:
┌─────────────────────────────────────────────────────┐
│ Learner (GPU) │
│ Receive data → Compute gradients → Update │
└──────────┬─────────────────────────┬────────────────┘
│ Send params │ Send params
┌─────▼──────┐ ┌─────▼──────┐
│ Worker 1 │ │ Worker 2 │
│ (CPU) │ │ (CPU) │
│ N envs │ │ N envs │
└────────────┘ └────────────┘
- Workers interact with environments to collect data, using the latest parameters sent by the Learner
- Learner handles gradient computation and parameter updates
- Communication is typically implemented via shared memory, gRPC, or ZMQ
Gradient aggregation strategies:
- Centralized: Workers send experiences to the Learner, which computes gradients centrally
- Distributed: Each Worker computes local gradients, then a global AllReduce aggregation is performed
- Asynchronous SGD: Workers push gradients asynchronously (may suffer from the stale gradient problem)
Logging and Evaluation
Training Metrics
Key metrics to monitor during RL training:
Basic metrics:
| Metric | Meaning | Expected range/trend |
|---|---|---|
ep_return |
Total return per episode | Should increase over time |
ep_length |
Episode length | Task-dependent |
fps |
Frames processed per second | Should be stable |
Policy metrics (PPO-specific):
| Metric | Meaning | Expected range/trend |
|---|---|---|
policy_loss |
\(-L^{\text{CLIP}}\) | Absolute value should not be too large |
value_loss |
\(L^{VF}\) | Should decrease over time |
entropy |
\(H(\pi_\theta)\) | Should decrease gradually (should not drop sharply) |
approx_kl |
Approximate KL divergence between old and new policies | \(< 0.02\) (typically) |
clip_fraction |
Fraction of samples that were clipped | \(0.1 \sim 0.3\) |
explained_variance |
\(1 - \frac{\text{Var}(R - V)}{\text{Var}(R)}\) | Close to 1 indicates a good Critic |
Key warning signs:
entropydrops sharply to near 0 -- the policy is converging prematurely; increase the entropy coefficientapprox_klconsistently exceeds 0.05 -- the update step is too large; reduce the learning rate or increase the number of epochsclip_fractionis close to 0 -- clipping is not taking effect; \(\epsilon\) may be too largeclip_fractionis close to 1 -- nearly all samples are being clipped; the learning rate is likely too highexplained_varianceis persistently negative -- the Critic is worse than random guessing; check the network architecture or learning rate
Evaluation Protocol
Training returns and evaluation returns are different. During training, the policy includes exploration noise (stochastic sampling), whereas evaluation typically uses a deterministic policy.
Deterministic vs Stochastic evaluation:
# During training (Stochastic)
action = policy.sample(obs) # Sample from the distribution
# During evaluation (Deterministic)
action = policy.mean(obs) # Take the distribution's mean/mode
# Discrete action space: action = argmax π(a|s)
# Continuous action space: action = μ(s) (mean of the Gaussian policy)
Independent evaluation environments:
Evaluation should use separate environment instances, isolated from the training environments:
- Evaluation environments should use different seeds than training
- Evaluation environments should not have a Normalization wrapper (or should use the training statistics)
- Each evaluation should run multiple episodes (typically 10--20) and average the results
Best Model Checkpointing:
if mean_eval_return > best_return:
best_return = mean_eval_return
save_model(policy_net, value_net, "best_model.pt")
- Evaluate at fixed intervals (e.g., every 10 rollouts)
- Save the model with the highest evaluation return
- Also save the latest model (for resuming training)
- Save the normalization statistics (running mean/std); otherwise observations cannot be correctly preprocessed after loading the model
Common Tools
TensorBoard: Natively supported by PyTorch via SummaryWriter for logging scalars, histograms, images, etc. Used by default in Stable Baselines3.
Weights & Biases (W&B): A cloud-based experiment management platform that automatically logs hyperparameters, metrics, and system resources. Supports experiment comparison and hyperparameter sweeps.
MLflow: An open-source experiment tracking tool that supports local deployment, making it well-suited for enterprise environments.
The choice among these three is largely a matter of personal preference and team needs. For individual research, TensorBoard is sufficient. For team collaboration and large-scale experiments, W&B is more convenient. For organizations requiring private deployment, MLflow is the go-to option.
Hyperparameter Tuning Guide
Hyperparameter Sensitivity
| Hyperparameter | Typical value | Sensitivity | Tuning advice |
|---|---|---|---|
| Learning Rate | \(3 \times 10^{-4}\) | Very high | Start with \(3 \times 10^{-4}\); if unsuccessful, search \([10^{-4}, 10^{-3}]\) |
| \(\gamma\) (discount factor) | 0.99 | Medium | Use 0.99 for long-horizon tasks, 0.95 for short-horizon tasks |
| \(\lambda\) (GAE) | 0.95 | Low | 0.95 is a safe default |
| \(\epsilon\) (PPO clip) | 0.2 | Low | Rarely needs tuning |
| Entropy Coeff \(c_2\) | 0.01 | Medium | Increase if exploration is insufficient; decrease if the policy fails to converge |
| Value Coeff \(c_1\) | 0.5 | Low | Either 0.5 or 1.0 works fine |
| Mini-batch Size | 64-256 | Medium | Too small leads to high variance; too large reduces the number of updates |
| N Epochs (K) | 4-10 | Medium | Too large causes overfitting to the rollout data |
| N Envs | 8-128 | Low | More is better (limited by CPU) |
| T Horizon | 128-2048 | Medium | Too short yields poor GAE estimates; too long increases on-policy bias |
| Gradient Clip | 0.5 | Low | 0.5 is a safe default |
The single most important hyperparameter is the learning rate. If you can only tune one hyperparameter, tune the learning rate. Using default values for the rest will typically yield reasonable results.
Common Failure Modes and Debugging
1. Returns do not improve or are extremely volatile
- Verify that the reward function is correctly implemented
- Check whether the environment's
donesignal is correct - Lower the learning rate
- Increase the number of parallel environments (more data = lower variance)
2. Returns rise then fall (Catastrophic Forgetting)
- PPO: reduce the learning rate, reduce the number of epochs \(K\), reduce \(\epsilon\)
- This may be reward hacking -- the policy found an exploit rather than genuinely solving the problem
3. Policy prematurely converges to a suboptimal solution
- Increase the entropy coefficient \(c_2\)
- Check whether there is sufficient exploration (increase stochasticity)
- Consider using a larger network
4. Critic's Explained Variance is low
- Increase the Critic's network capacity
- Lower the learning rate (the Critic updates may be too aggressive)
- Check whether observation normalization is enabled
5. Training crashes early (NaN or Inf)
- Check the scale of observations and rewards; enable normalization
- Lower the learning rate
- Check network initialization
- Ensure that computing \(\log \pi\) does not result in \(\log 0\) (add \(\epsilon\))
6. Poor performance on MuJoCo
- Ensure you are using a continuous action space (Gaussian policy)
- Check action scaling (does the action range match the environment?)
- MuJoCo typically requires larger networks ([256, 256] rather than [64, 64])
- Make sure observation normalization is enabled
Recommended debugging workflow:
1. First validate code correctness on simple environments
CartPole (discrete) → Pendulum (continuous) → HalfCheetah (complex continuous)
2. Compare against a known good implementation
Run SB3's PPO on the same environment with the same hyperparameters and compare return curves
If SB3 also fails, it's a hyperparameter issue; if SB3 succeeds but yours doesn't, it's a code bug
3. Incrementally add complexity
Start with all tricks disabled (no normalization, no clipping) to verify the basic training loop
Then add tricks one at a time and observe the effect of each
4. Visualize everything
Record videos of the policy's behavior
Plot heatmaps of the value function
Inspect whether the action distribution is reasonable