Safe Reinforcement Learning
Motivation
In real-world applications, maximizing return is not the only objective — policies must also satisfy safety constraints. For example:
- Autonomous driving: Must not collide with pedestrians
- Robotic surgery: Must not exceed torque thresholds
- Financial trading: Must not exceed risk budgets
- Power systems: Must not violate operational limits
Safe Reinforcement Learning (Safe RL) studies how to optimize policies under constraints.
Constrained Markov Decision Process (CMDP)
Definition
A CMDP augments the standard MDP with constraints:
where:
- \(c_i(s, a)\): The \(i\)-th cost function
- \(d_i\): The upper bound for the \(i\)-th constraint
- \(k\): Number of constraints
Difference from Multi-Objective Rewards
| Property | Multi-Objective Rewards | CMDP |
|---|---|---|
| Safety guarantees | No hard guarantees | Constraints must be satisfied |
| Weight tuning | Manual weight adjustment | Constraint thresholds have clear meaning |
| Policy space | Full policy space | Feasible policy subset |
Lagrangian Methods
Basic Idea
Convert the constrained optimization into an unconstrained saddle-point problem:
Primal-Dual Updates
Alternately update policy parameters and Lagrange multipliers:
Policy update (gradient ascent):
Multiplier update (gradient descent, projected onto non-negative domain):
Implementation Details
class LagrangianSafeRL:
def __init__(self, cost_limit, lr_lambda=0.01):
self.cost_limit = cost_limit
self.log_lambda = torch.zeros(1, requires_grad=True)
self.lambda_optimizer = torch.optim.Adam([self.log_lambda], lr=lr_lambda)
def compute_loss(self, reward_loss, cost_value):
# Policy loss = original objective - λ * (cost value - threshold)
lambda_val = self.log_lambda.exp().detach()
total_loss = reward_loss - lambda_val * (cost_value - self.cost_limit)
return total_loss
def update_lambda(self, cost_value):
# Update Lagrange multiplier
lambda_loss = -self.log_lambda.exp() * (cost_value - self.cost_limit)
self.lambda_optimizer.zero_grad()
lambda_loss.backward()
self.lambda_optimizer.step()
Pros and Cons
Pros:
- Simple to implement, compatible with any RL algorithm
- Theoretical guarantees (convergence under convex constraints)
Cons:
- Constraints may be temporarily violated during training
- Multiplier updates may oscillate
- Sensitive to hyperparameters (multiplier learning rate)
CPO (Constrained Policy Optimization)
Motivation
Achiam et al. (2017) proposed CPO to guarantee that every policy update satisfies the constraints.
Core Idea
Incorporate constraints within TRPO's trust region framework:
Solution Method
Solved via the dual problem, finding the optimal update direction within the KL-constrained ellipsoid that also satisfies cost constraints:
- Compute return advantages and cost advantages
- Search for the optimal direction within the feasible region satisfying both KL and cost constraints
- Use line search to ensure constraint satisfaction
Relationship to TRPO/PPO
| Method | Constraint Type | Safety Guarantee |
|---|---|---|
| TRPO | KL divergence only | None |
| PPO | Clipped ratio | None |
| CPO | KL divergence + cost constraints | Per-step guarantee |
Safety Layers
Concept
Dalal et al. (2018) added a safety correction layer after the policy network output:
This projects the RL-output action onto the safe constraint set.
Linearization Approximation
When the constraint function \(g\) is complex, a first-order Taylor expansion is used:
This converts the problem to a QP under linear constraints, which can be solved efficiently.
Advantages
- Compatible with any RL algorithm
- Provides safety guarantees at inference time as well
- Does not require modifying the RL training algorithm
Shielding Methods
Formal Verification Shielding
Define safety specifications using formal methods (e.g., temporal logic) and verify and block unsafe actions at runtime:
Safety specification: Expressed using Linear Temporal Logic (LTL)
Meaning "never collide, and eventually reach the goal."
Shielding mechanism:
- Precompute or compute online the set of safe states
- At each step, check the action proposed by the RL policy
- If the action would lead outside the safe set, replace it with a safe action
Reactive Shield
- Runs online with low latency
- Only considers short-term safety
- May be overly conservative
Predictive Shield
- Considers safety over multiple future steps
- Uses model predictions
- Balances safety and performance
Sim-to-Real Safety Transfer
Challenges
Transferring safe policies from simulation to the real world faces:
- Dynamics gap: Imperfect simulators
- Sensor noise: Real sensors are noisy
- Unmodeled disturbances: Unknown factors in the real environment
Robust Safety Methods
Domain Randomization:
Randomize physical parameters in simulation so that learned safety constraints are more robust.
Robust CMDP:
Optimize and satisfy constraints under worst-case transition dynamics.
Safety Margins:
Add conservative safety margins to constraints to compensate for the sim-to-real gap.
Related Topics
- Safety and Robustness: Safety considerations in deployment
- Reward Engineering: Reward design in constrained optimization
Practical Recommendations
| Requirement | Recommended Method |
|---|---|
| Simple constraints, quick implementation | Lagrangian methods |
| Per-step safety guarantee | CPO |
| Compatibility with existing RL | Safety layers |
| Provable safety | Formal shielding |
| Sim-to-Real | Robust CMDP + safety margins |
References
- Altman, "Constrained Markov Decision Processes" (1999)
- Achiam et al., "Constrained Policy Optimization" (ICML 2017)
- Dalal et al., "Safe Exploration in Continuous Action Spaces" (2018)
- Alshiekh et al., "Safe Reinforcement Learning via Shielding" (AAAI 2018)
- Ray et al., "Benchmarking Safe Exploration in Deep Reinforcement Learning" (2019)
- Tessler et al., "Reward Constrained Policy Optimization" (ICLR 2019)