Reward Engineering

Overview

The reward function is at the heart of reinforcement learning — it defines the agent's optimization objective. However, designing a correct and efficient reward function is often the most challenging part of RL applications. This section discusses key topics including reward shaping, reward curricula, multi-objective rewards, and reward hacking.

Reward Shaping

Basic Concept

Reward shaping guides learning by adding supplementary reward signals without changing the optimal policy.

The original reward \(R(s, a, s')\) is modified to:

\[R'(s, a, s') = R(s, a, s') + F(s, s')\]

where \(F(s, s')\) is the shaping function.

Potential-Based Reward Shaping

Ng et al. (1999) proved that as long as the shaping function takes the following form, the optimal policy is preserved:

\[F(s, s') = \gamma \Phi(s') - \Phi(s)\]

where \(\Phi: \mathcal{S} \to \mathbb{R}\) is a potential function.

Theorem: Under potential-based reward shaping, the original MDP and the shaped MDP share the same set of optimal policies.

Intuition: The potential function is analogous to potential energy in physics — the potential difference depends only on the start and end points, not the path, and therefore does not introduce "shortcuts."

Practical example:

# Potential function for maze navigation: negative distance to goal
def potential(state, goal):
    return -np.linalg.norm(state - goal)

# Shaped reward
def shaped_reward(s, s_next, gamma, original_reward):
    F = gamma * potential(s_next) - potential(s)
    return original_reward + F

Risks of Non-Potential Shaping

If \(F\) does not satisfy the potential function form, it may:

Alter the optimal policy
Introduce cyclic behavior (the agent loops through high-reward regions)
Converge to suboptimal solutions

Sparse vs. Dense Rewards

Sparse Rewards

Rewards are given only at key events (e.g., reaching the goal, completing the task):

\[R(s, a) = \begin{cases} 1 & \text{if } s \in \mathcal{G} \\ 0 & \text{otherwise} \end{cases}\]

Pros:

Clear objective, less prone to introducing bias
Closer to the true task definition

Cons:

Extremely sparse learning signal, making exploration difficult
Requires many interactions before the first reward is obtained

Dense Rewards

Informative feedback is provided at every step:

\[R(s, a) = -\|s - g\| + \alpha \cdot v_{\text{forward}} - \beta \cdot \|a\|^2\]

Pros:

High learning efficiency
Rich gradient signal

Cons:

Prone to designer bias
May lead to reward hacking

Hybrid Strategies

In practice, hybrid approaches are common:

Sparse goal reward + potential-based shaping
Curriculum rewards: Gradual transition from dense to sparse
Hierarchical rewards: Dense rewards for subgoals, sparse reward for the final goal

Reward Curriculum

Basic Idea

Dynamically adjusting the reward function as training progresses, guiding the agent from simple to complex:

\[R_t(s, a) = (1 - \lambda_t) R_{\text{dense}}(s, a) + \lambda_t R_{\text{sparse}}(s, a)\]

where \(\lambda_t\) increases from 0 to 1 over the course of training.

Reward Annealing

Similar to simulated annealing, gradually reducing the weight of auxiliary reward signals:

Early stage: Provide rich intermediate rewards to guide learning
Middle stage: Gradually reduce auxiliary rewards
Late stage: Retain only the true task reward

Automatic Curricula

Meta-learning or evolutionary strategies can be used to automatically design reward curricula, avoiding manual tuning.

Multi-Objective Rewards

Linear Weighting

The simplest multi-objective approach:

\[R(s, a) = \sum_{i=1}^{k} w_i R_i(s, a)\]

Issue: Weight selection is difficult, and different reward scales may vary substantially.

Constrained Optimization

Some objectives can be converted into constraints (see Safe Reinforcement Learning):

\[\max_\pi J_{\text{main}}(\pi) \quad \text{s.t.} \quad J_{c_i}(\pi) \leq d_i\]

Pareto Optimality

Seek a set of Pareto-optimal policies across multiple objectives rather than a single optimum.

Practical Reward Composition

def compute_reward(state, action, next_state):
    # Task reward (sparse)
    task_reward = 10.0 if is_goal(next_state) else 0.0

    # Progress reward (dense)
    progress = distance(state, goal) - distance(next_state, goal)

    # Safety penalty
    safety_penalty = -100.0 if is_unsafe(next_state) else 0.0

    # Energy cost
    energy_cost = -0.01 * np.sum(action ** 2)

    return task_reward + 1.0 * progress + safety_penalty + energy_cost

Reward from Human Feedback

Connection to RLHF

Human feedback can serve as a source of reward signals (see LLM Post-Training for details):

Collect human preference comparisons over behaviors
Train a reward model \(r_\phi(s, a)\)
Use the learned reward model to train the policy

Reward from VLMs

Vision-Language Models (VLMs) can serve as reward functions:

Describe the goal in natural language
The VLM evaluates how well the current state matches the goal description
Provides dense, semantic-level reward signals

Example:

\[r(s) = \text{sim}(\text{VLM}(s), \text{goal\_description})\]

Advantages:

No need for hand-crafted rewards
Handles complex semantic objectives
Naturally supports open-world tasks

Reward Hacking

Definition

Reward hacking occurs when the agent finds a way to maximize the reward function without accomplishing the designer's intended task.

Classic Examples

Task	Designed Reward	Hacking Behavior
Boat racing	Collect coins	Circle back to collect the same coins
Cleaning robot	-(amount of dust)	Push dust out of sight
Soccer	Ball possession time	Dribble in place, never shoot
Code generation	Pass tests	Hard-code test answers

Causes of Reward Hacking

Gap between proxy metrics and true objectives: The reward function is an approximation of the real goal
Out-of-distribution behavior: The policy may discover exploits outside the training distribution
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"

Mitigation Strategies

1. Diversified Reward Signals

Use multiple complementary reward components to reduce the risk of any single metric being hacked.

2. Adversarial Reward Design

Proactively anticipate possible hacking behaviors during reward design and add safeguards.

3. Reward Model Ensembles

Use an ensemble of reward models to reduce the bias of any single model:

\[r(s, a) = \frac{1}{K} \sum_{k=1}^{K} r_k(s, a)\]

4. Human-in-the-Loop

Periodically have humans review the agent's behavior to detect and correct reward hacking.

5. Constrained Optimization

Impose safety conditions and behavioral norms as hard constraints rather than relying solely on rewards.

Practical Guide

Reward Function Design Workflow

Clarify the task objective: Precisely describe the desired behavior in natural language
Design initial reward: Start simple, preferring sparse rewards
Add shaping: If learning efficiency is insufficient, add potential-based shaping
Test robustness: Check for possible reward hacking
Iterate: Adjust rewards based on the agent's actual behavior

Common Pitfalls

Scale mismatch between reward components
Forgetting rewards in terminal conditions
Dense rewards introducing unnecessary bias
Lacking penalties for undesired behaviors

References

Ng et al., "Policy Invariance Under Reward Transformations" (ICML 1999)
Amodei et al., "Concrete Problems in AI Safety" (2016)
Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017)
Clark & Amodei, "Faulty Reward Functions in the Wild" (2016)
Ma et al., "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)