Skip to content

Reward Engineering

Overview

The reward function is at the heart of reinforcement learning — it defines the agent's optimization objective. However, designing a correct and efficient reward function is often the most challenging part of RL applications. This section discusses key topics including reward shaping, reward curricula, multi-objective rewards, and reward hacking.

Reward Shaping

Basic Concept

Reward shaping guides learning by adding supplementary reward signals without changing the optimal policy.

The original reward \(R(s, a, s')\) is modified to:

\[R'(s, a, s') = R(s, a, s') + F(s, s')\]

where \(F(s, s')\) is the shaping function.

Potential-Based Reward Shaping

Ng et al. (1999) proved that as long as the shaping function takes the following form, the optimal policy is preserved:

\[F(s, s') = \gamma \Phi(s') - \Phi(s)\]

where \(\Phi: \mathcal{S} \to \mathbb{R}\) is a potential function.

Theorem: Under potential-based reward shaping, the original MDP and the shaped MDP share the same set of optimal policies.

Intuition: The potential function is analogous to potential energy in physics — the potential difference depends only on the start and end points, not the path, and therefore does not introduce "shortcuts."

Practical example:

# Potential function for maze navigation: negative distance to goal
def potential(state, goal):
    return -np.linalg.norm(state - goal)

# Shaped reward
def shaped_reward(s, s_next, gamma, original_reward):
    F = gamma * potential(s_next) - potential(s)
    return original_reward + F

Risks of Non-Potential Shaping

If \(F\) does not satisfy the potential function form, it may:

  • Alter the optimal policy
  • Introduce cyclic behavior (the agent loops through high-reward regions)
  • Converge to suboptimal solutions

Sparse vs. Dense Rewards

Sparse Rewards

Rewards are given only at key events (e.g., reaching the goal, completing the task):

\[R(s, a) = \begin{cases} 1 & \text{if } s \in \mathcal{G} \\ 0 & \text{otherwise} \end{cases}\]

Pros:

  • Clear objective, less prone to introducing bias
  • Closer to the true task definition

Cons:

  • Extremely sparse learning signal, making exploration difficult
  • Requires many interactions before the first reward is obtained

Dense Rewards

Informative feedback is provided at every step:

\[R(s, a) = -\|s - g\| + \alpha \cdot v_{\text{forward}} - \beta \cdot \|a\|^2\]

Pros:

  • High learning efficiency
  • Rich gradient signal

Cons:

  • Prone to designer bias
  • May lead to reward hacking

Hybrid Strategies

In practice, hybrid approaches are common:

  1. Sparse goal reward + potential-based shaping
  2. Curriculum rewards: Gradual transition from dense to sparse
  3. Hierarchical rewards: Dense rewards for subgoals, sparse reward for the final goal

Reward Curriculum

Basic Idea

Dynamically adjusting the reward function as training progresses, guiding the agent from simple to complex:

\[R_t(s, a) = (1 - \lambda_t) R_{\text{dense}}(s, a) + \lambda_t R_{\text{sparse}}(s, a)\]

where \(\lambda_t\) increases from 0 to 1 over the course of training.

Reward Annealing

Similar to simulated annealing, gradually reducing the weight of auxiliary reward signals:

  1. Early stage: Provide rich intermediate rewards to guide learning
  2. Middle stage: Gradually reduce auxiliary rewards
  3. Late stage: Retain only the true task reward

Automatic Curricula

Meta-learning or evolutionary strategies can be used to automatically design reward curricula, avoiding manual tuning.

Multi-Objective Rewards

Linear Weighting

The simplest multi-objective approach:

\[R(s, a) = \sum_{i=1}^{k} w_i R_i(s, a)\]

Issue: Weight selection is difficult, and different reward scales may vary substantially.

Constrained Optimization

Some objectives can be converted into constraints (see Safe Reinforcement Learning):

\[\max_\pi J_{\text{main}}(\pi) \quad \text{s.t.} \quad J_{c_i}(\pi) \leq d_i\]

Pareto Optimality

Seek a set of Pareto-optimal policies across multiple objectives rather than a single optimum.

Practical Reward Composition

def compute_reward(state, action, next_state):
    # Task reward (sparse)
    task_reward = 10.0 if is_goal(next_state) else 0.0

    # Progress reward (dense)
    progress = distance(state, goal) - distance(next_state, goal)

    # Safety penalty
    safety_penalty = -100.0 if is_unsafe(next_state) else 0.0

    # Energy cost
    energy_cost = -0.01 * np.sum(action ** 2)

    return task_reward + 1.0 * progress + safety_penalty + energy_cost

Reward from Human Feedback

Connection to RLHF

Human feedback can serve as a source of reward signals (see LLM Post-Training for details):

  1. Collect human preference comparisons over behaviors
  2. Train a reward model \(r_\phi(s, a)\)
  3. Use the learned reward model to train the policy

Reward from VLMs

Vision-Language Models (VLMs) can serve as reward functions:

  • Describe the goal in natural language
  • The VLM evaluates how well the current state matches the goal description
  • Provides dense, semantic-level reward signals

Example:

\[r(s) = \text{sim}(\text{VLM}(s), \text{goal\_description})\]

Advantages:

  • No need for hand-crafted rewards
  • Handles complex semantic objectives
  • Naturally supports open-world tasks

Reward Hacking

Definition

Reward hacking occurs when the agent finds a way to maximize the reward function without accomplishing the designer's intended task.

Classic Examples

Task Designed Reward Hacking Behavior
Boat racing Collect coins Circle back to collect the same coins
Cleaning robot -(amount of dust) Push dust out of sight
Soccer Ball possession time Dribble in place, never shoot
Code generation Pass tests Hard-code test answers

Causes of Reward Hacking

  1. Gap between proxy metrics and true objectives: The reward function is an approximation of the real goal
  2. Out-of-distribution behavior: The policy may discover exploits outside the training distribution
  3. Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"

Mitigation Strategies

1. Diversified Reward Signals

Use multiple complementary reward components to reduce the risk of any single metric being hacked.

2. Adversarial Reward Design

Proactively anticipate possible hacking behaviors during reward design and add safeguards.

3. Reward Model Ensembles

Use an ensemble of reward models to reduce the bias of any single model:

\[r(s, a) = \frac{1}{K} \sum_{k=1}^{K} r_k(s, a)\]

4. Human-in-the-Loop

Periodically have humans review the agent's behavior to detect and correct reward hacking.

5. Constrained Optimization

Impose safety conditions and behavioral norms as hard constraints rather than relying solely on rewards.

Practical Guide

Reward Function Design Workflow

  1. Clarify the task objective: Precisely describe the desired behavior in natural language
  2. Design initial reward: Start simple, preferring sparse rewards
  3. Add shaping: If learning efficiency is insufficient, add potential-based shaping
  4. Test robustness: Check for possible reward hacking
  5. Iterate: Adjust rewards based on the agent's actual behavior

Common Pitfalls

  • Scale mismatch between reward components
  • Forgetting rewards in terminal conditions
  • Dense rewards introducing unnecessary bias
  • Lacking penalties for undesired behaviors

References

  • Ng et al., "Policy Invariance Under Reward Transformations" (ICML 1999)
  • Amodei et al., "Concrete Problems in AI Safety" (2016)
  • Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017)
  • Clark & Amodei, "Faulty Reward Functions in the Wild" (2016)
  • Ma et al., "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)

评论 #