Safe Reinforcement Learning

Motivation

In real-world applications, maximizing return is not the only objective — policies must also satisfy safety constraints. For example:

Autonomous driving: Must not collide with pedestrians
Robotic surgery: Must not exceed torque thresholds
Financial trading: Must not exceed risk budgets
Power systems: Must not violate operational limits

Safe Reinforcement Learning (Safe RL) studies how to optimize policies under constraints.

Constrained Markov Decision Process (CMDP)

Definition

A CMDP augments the standard MDP with constraints:

\[\max_\pi J(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]\]

\[\text{s.t.} \quad J_{c_i}(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t c_i(s_t, a_t)\right] \leq d_i, \quad i = 1, \ldots, k\]

where:

\(c_i(s, a)\): The \(i\)-th cost function
\(d_i\): The upper bound for the \(i\)-th constraint
\(k\): Number of constraints

Difference from Multi-Objective Rewards

Property	Multi-Objective Rewards	CMDP
Safety guarantees	No hard guarantees	Constraints must be satisfied
Weight tuning	Manual weight adjustment	Constraint thresholds have clear meaning
Policy space	Full policy space	Feasible policy subset

Lagrangian Methods

Basic Idea

Convert the constrained optimization into an unconstrained saddle-point problem:

\[\mathcal{L}(\pi, \lambda) = J(\pi) - \sum_{i=1}^k \lambda_i (J_{c_i}(\pi) - d_i)\]

\[\max_\pi \min_{\lambda \geq 0} \mathcal{L}(\pi, \lambda)\]

Primal-Dual Updates

Alternately update policy parameters and Lagrange multipliers:

Policy update (gradient ascent):

\[\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \mathcal{L}(\pi_\theta, \lambda)\]

Multiplier update (gradient descent, projected onto non-negative domain):

\[\lambda_i \leftarrow \max\left(0, \lambda_i + \alpha_\lambda (J_{c_i}(\pi) - d_i)\right)\]

Implementation Details

class LagrangianSafeRL:
    def __init__(self, cost_limit, lr_lambda=0.01):
        self.cost_limit = cost_limit
        self.log_lambda = torch.zeros(1, requires_grad=True)
        self.lambda_optimizer = torch.optim.Adam([self.log_lambda], lr=lr_lambda)

    def compute_loss(self, reward_loss, cost_value):
        # Policy loss = original objective - λ * (cost value - threshold)
        lambda_val = self.log_lambda.exp().detach()
        total_loss = reward_loss - lambda_val * (cost_value - self.cost_limit)
        return total_loss

    def update_lambda(self, cost_value):
        # Update Lagrange multiplier
        lambda_loss = -self.log_lambda.exp() * (cost_value - self.cost_limit)
        self.lambda_optimizer.zero_grad()
        lambda_loss.backward()
        self.lambda_optimizer.step()

Pros and Cons

Pros:

Simple to implement, compatible with any RL algorithm
Theoretical guarantees (convergence under convex constraints)

Cons:

Constraints may be temporarily violated during training
Multiplier updates may oscillate
Sensitive to hyperparameters (multiplier learning rate)

CPO (Constrained Policy Optimization)

Motivation

Achiam et al. (2017) proposed CPO to guarantee that every policy update satisfies the constraints.

Core Idea

Incorporate constraints within TRPO's trust region framework:

\[\max_\pi \; \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}(s, a)\right]\]

\[\text{s.t.} \quad \bar{D}_{\text{KL}}(\pi \| \pi_{\text{old}}) \leq \delta\]

\[\quad \quad J_{c_i}(\pi_{\text{old}}) + \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}_{c_i}(s, a)\right] \leq d_i\]

Solution Method

Solved via the dual problem, finding the optimal update direction within the KL-constrained ellipsoid that also satisfies cost constraints:

Compute return advantages and cost advantages
Search for the optimal direction within the feasible region satisfying both KL and cost constraints
Use line search to ensure constraint satisfaction

Relationship to TRPO/PPO

Method	Constraint Type	Safety Guarantee
TRPO	KL divergence only	None
PPO	Clipped ratio	None
CPO	KL divergence + cost constraints	Per-step guarantee

Safety Layers

Concept

Dalal et al. (2018) added a safety correction layer after the policy network output:

\[a_{\text{safe}} = \arg\min_{a'} \|a' - a_{\text{RL}}\|^2 \quad \text{s.t.} \quad g(s, a') \leq 0\]

This projects the RL-output action onto the safe constraint set.

Linearization Approximation

When the constraint function \(g\) is complex, a first-order Taylor expansion is used:

\[g(s, a') \approx g(s, a_{\text{RL}}) + \nabla_a g(s, a_{\text{RL}})^T (a' - a_{\text{RL}})\]

This converts the problem to a QP under linear constraints, which can be solved efficiently.

Advantages

Compatible with any RL algorithm
Provides safety guarantees at inference time as well
Does not require modifying the RL training algorithm

Shielding Methods

Formal Verification Shielding

Define safety specifications using formal methods (e.g., temporal logic) and verify and block unsafe actions at runtime:

Safety specification: Expressed using Linear Temporal Logic (LTL)

\[\varphi = \Box(\neg \text{collision}) \land \Diamond(\text{goal})\]

Meaning "never collide, and eventually reach the goal."

Shielding mechanism:

Precompute or compute online the set of safe states
At each step, check the action proposed by the RL policy
If the action would lead outside the safe set, replace it with a safe action

Reactive Shield

Runs online with low latency
Only considers short-term safety
May be overly conservative

Predictive Shield

Considers safety over multiple future steps
Uses model predictions
Balances safety and performance

Sim-to-Real Safety Transfer

Challenges

Transferring safe policies from simulation to the real world faces:

Dynamics gap: Imperfect simulators
Sensor noise: Real sensors are noisy
Unmodeled disturbances: Unknown factors in the real environment

Robust Safety Methods

Domain Randomization:

Randomize physical parameters in simulation so that learned safety constraints are more robust.

Robust CMDP:

\[\max_\pi \min_{P \in \mathcal{P}} J(\pi, P) \quad \text{s.t.} \quad \max_{P \in \mathcal{P}} J_c(\pi, P) \leq d\]

Optimize and satisfy constraints under worst-case transition dynamics.

Safety Margins:

Add conservative safety margins to constraints to compensate for the sim-to-real gap.

Safety and Robustness: Safety considerations in deployment
Reward Engineering: Reward design in constrained optimization

Practical Recommendations

Requirement	Recommended Method
Simple constraints, quick implementation	Lagrangian methods
Per-step safety guarantee	CPO
Compatibility with existing RL	Safety layers
Provable safety	Formal shielding
Sim-to-Real	Robust CMDP + safety margins

References

Altman, "Constrained Markov Decision Processes" (1999)
Achiam et al., "Constrained Policy Optimization" (ICML 2017)
Dalal et al., "Safe Exploration in Continuous Action Spaces" (2018)
Alshiekh et al., "Safe Reinforcement Learning via Shielding" (AAAI 2018)
Ray et al., "Benchmarking Safe Exploration in Deep Reinforcement Learning" (2019)
Tessler et al., "Reward Constrained Policy Optimization" (ICLR 2019)

Safe Reinforcement Learning

Motivation

Constrained Markov Decision Process (CMDP)

Definition

Difference from Multi-Objective Rewards

Lagrangian Methods

Basic Idea

Primal-Dual Updates

Implementation Details

Pros and Cons

CPO (Constrained Policy Optimization)

Motivation

Core Idea

Solution Method

Relationship to TRPO/PPO

Safety Layers

Concept

Linearization Approximation

Advantages

Shielding Methods

Formal Verification Shielding

Reactive Shield

Predictive Shield

Sim-to-Real Safety Transfer

Challenges

Robust Safety Methods

Related Topics

Practical Recommendations

References

评论 #