Skip to content

Safe Reinforcement Learning

Motivation

In real-world applications, maximizing return is not the only objective — policies must also satisfy safety constraints. For example:

  • Autonomous driving: Must not collide with pedestrians
  • Robotic surgery: Must not exceed torque thresholds
  • Financial trading: Must not exceed risk budgets
  • Power systems: Must not violate operational limits

Safe Reinforcement Learning (Safe RL) studies how to optimize policies under constraints.

Constrained Markov Decision Process (CMDP)

Definition

A CMDP augments the standard MDP with constraints:

\[\max_\pi J(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]\]
\[\text{s.t.} \quad J_{c_i}(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t c_i(s_t, a_t)\right] \leq d_i, \quad i = 1, \ldots, k\]

where:

  • \(c_i(s, a)\): The \(i\)-th cost function
  • \(d_i\): The upper bound for the \(i\)-th constraint
  • \(k\): Number of constraints

Difference from Multi-Objective Rewards

Property Multi-Objective Rewards CMDP
Safety guarantees No hard guarantees Constraints must be satisfied
Weight tuning Manual weight adjustment Constraint thresholds have clear meaning
Policy space Full policy space Feasible policy subset

Lagrangian Methods

Basic Idea

Convert the constrained optimization into an unconstrained saddle-point problem:

\[\mathcal{L}(\pi, \lambda) = J(\pi) - \sum_{i=1}^k \lambda_i (J_{c_i}(\pi) - d_i)\]
\[\max_\pi \min_{\lambda \geq 0} \mathcal{L}(\pi, \lambda)\]

Primal-Dual Updates

Alternately update policy parameters and Lagrange multipliers:

Policy update (gradient ascent):

\[\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \mathcal{L}(\pi_\theta, \lambda)\]

Multiplier update (gradient descent, projected onto non-negative domain):

\[\lambda_i \leftarrow \max\left(0, \lambda_i + \alpha_\lambda (J_{c_i}(\pi) - d_i)\right)\]

Implementation Details

class LagrangianSafeRL:
    def __init__(self, cost_limit, lr_lambda=0.01):
        self.cost_limit = cost_limit
        self.log_lambda = torch.zeros(1, requires_grad=True)
        self.lambda_optimizer = torch.optim.Adam([self.log_lambda], lr=lr_lambda)

    def compute_loss(self, reward_loss, cost_value):
        # Policy loss = original objective - λ * (cost value - threshold)
        lambda_val = self.log_lambda.exp().detach()
        total_loss = reward_loss - lambda_val * (cost_value - self.cost_limit)
        return total_loss

    def update_lambda(self, cost_value):
        # Update Lagrange multiplier
        lambda_loss = -self.log_lambda.exp() * (cost_value - self.cost_limit)
        self.lambda_optimizer.zero_grad()
        lambda_loss.backward()
        self.lambda_optimizer.step()

Pros and Cons

Pros:

  • Simple to implement, compatible with any RL algorithm
  • Theoretical guarantees (convergence under convex constraints)

Cons:

  • Constraints may be temporarily violated during training
  • Multiplier updates may oscillate
  • Sensitive to hyperparameters (multiplier learning rate)

CPO (Constrained Policy Optimization)

Motivation

Achiam et al. (2017) proposed CPO to guarantee that every policy update satisfies the constraints.

Core Idea

Incorporate constraints within TRPO's trust region framework:

\[\max_\pi \; \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}(s, a)\right]\]
\[\text{s.t.} \quad \bar{D}_{\text{KL}}(\pi \| \pi_{\text{old}}) \leq \delta\]
\[\quad \quad J_{c_i}(\pi_{\text{old}}) + \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_{\pi_{\text{old}}}} \left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} A^{\pi_{\text{old}}}_{c_i}(s, a)\right] \leq d_i\]

Solution Method

Solved via the dual problem, finding the optimal update direction within the KL-constrained ellipsoid that also satisfies cost constraints:

  1. Compute return advantages and cost advantages
  2. Search for the optimal direction within the feasible region satisfying both KL and cost constraints
  3. Use line search to ensure constraint satisfaction

Relationship to TRPO/PPO

Method Constraint Type Safety Guarantee
TRPO KL divergence only None
PPO Clipped ratio None
CPO KL divergence + cost constraints Per-step guarantee

Safety Layers

Concept

Dalal et al. (2018) added a safety correction layer after the policy network output:

\[a_{\text{safe}} = \arg\min_{a'} \|a' - a_{\text{RL}}\|^2 \quad \text{s.t.} \quad g(s, a') \leq 0\]

This projects the RL-output action onto the safe constraint set.

Linearization Approximation

When the constraint function \(g\) is complex, a first-order Taylor expansion is used:

\[g(s, a') \approx g(s, a_{\text{RL}}) + \nabla_a g(s, a_{\text{RL}})^T (a' - a_{\text{RL}})\]

This converts the problem to a QP under linear constraints, which can be solved efficiently.

Advantages

  • Compatible with any RL algorithm
  • Provides safety guarantees at inference time as well
  • Does not require modifying the RL training algorithm

Shielding Methods

Formal Verification Shielding

Define safety specifications using formal methods (e.g., temporal logic) and verify and block unsafe actions at runtime:

Safety specification: Expressed using Linear Temporal Logic (LTL)

\[\varphi = \Box(\neg \text{collision}) \land \Diamond(\text{goal})\]

Meaning "never collide, and eventually reach the goal."

Shielding mechanism:

  1. Precompute or compute online the set of safe states
  2. At each step, check the action proposed by the RL policy
  3. If the action would lead outside the safe set, replace it with a safe action

Reactive Shield

  • Runs online with low latency
  • Only considers short-term safety
  • May be overly conservative

Predictive Shield

  • Considers safety over multiple future steps
  • Uses model predictions
  • Balances safety and performance

Sim-to-Real Safety Transfer

Challenges

Transferring safe policies from simulation to the real world faces:

  • Dynamics gap: Imperfect simulators
  • Sensor noise: Real sensors are noisy
  • Unmodeled disturbances: Unknown factors in the real environment

Robust Safety Methods

Domain Randomization:

Randomize physical parameters in simulation so that learned safety constraints are more robust.

Robust CMDP:

\[\max_\pi \min_{P \in \mathcal{P}} J(\pi, P) \quad \text{s.t.} \quad \max_{P \in \mathcal{P}} J_c(\pi, P) \leq d\]

Optimize and satisfy constraints under worst-case transition dynamics.

Safety Margins:

Add conservative safety margins to constraints to compensate for the sim-to-real gap.

Practical Recommendations

Requirement Recommended Method
Simple constraints, quick implementation Lagrangian methods
Per-step safety guarantee CPO
Compatibility with existing RL Safety layers
Provable safety Formal shielding
Sim-to-Real Robust CMDP + safety margins

References

  • Altman, "Constrained Markov Decision Processes" (1999)
  • Achiam et al., "Constrained Policy Optimization" (ICML 2017)
  • Dalal et al., "Safe Exploration in Continuous Action Spaces" (2018)
  • Alshiekh et al., "Safe Reinforcement Learning via Shielding" (AAAI 2018)
  • Ray et al., "Benchmarking Safe Exploration in Deep Reinforcement Learning" (2019)
  • Tessler et al., "Reward Constrained Policy Optimization" (ICLR 2019)

评论 #