Reward Engineering
Overview
The reward function is at the heart of reinforcement learning — it defines the agent's optimization objective. However, designing a correct and efficient reward function is often the most challenging part of RL applications. This section discusses key topics including reward shaping, reward curricula, multi-objective rewards, and reward hacking.
Reward Shaping
Basic Concept
Reward shaping guides learning by adding supplementary reward signals without changing the optimal policy.
The original reward \(R(s, a, s')\) is modified to:
where \(F(s, s')\) is the shaping function.
Potential-Based Reward Shaping
Ng et al. (1999) proved that as long as the shaping function takes the following form, the optimal policy is preserved:
where \(\Phi: \mathcal{S} \to \mathbb{R}\) is a potential function.
Theorem: Under potential-based reward shaping, the original MDP and the shaped MDP share the same set of optimal policies.
Intuition: The potential function is analogous to potential energy in physics — the potential difference depends only on the start and end points, not the path, and therefore does not introduce "shortcuts."
Practical example:
# Potential function for maze navigation: negative distance to goal
def potential(state, goal):
return -np.linalg.norm(state - goal)
# Shaped reward
def shaped_reward(s, s_next, gamma, original_reward):
F = gamma * potential(s_next) - potential(s)
return original_reward + F
Risks of Non-Potential Shaping
If \(F\) does not satisfy the potential function form, it may:
- Alter the optimal policy
- Introduce cyclic behavior (the agent loops through high-reward regions)
- Converge to suboptimal solutions
Sparse vs. Dense Rewards
Sparse Rewards
Rewards are given only at key events (e.g., reaching the goal, completing the task):
Pros:
- Clear objective, less prone to introducing bias
- Closer to the true task definition
Cons:
- Extremely sparse learning signal, making exploration difficult
- Requires many interactions before the first reward is obtained
Dense Rewards
Informative feedback is provided at every step:
Pros:
- High learning efficiency
- Rich gradient signal
Cons:
- Prone to designer bias
- May lead to reward hacking
Hybrid Strategies
In practice, hybrid approaches are common:
- Sparse goal reward + potential-based shaping
- Curriculum rewards: Gradual transition from dense to sparse
- Hierarchical rewards: Dense rewards for subgoals, sparse reward for the final goal
Reward Curriculum
Basic Idea
Dynamically adjusting the reward function as training progresses, guiding the agent from simple to complex:
where \(\lambda_t\) increases from 0 to 1 over the course of training.
Reward Annealing
Similar to simulated annealing, gradually reducing the weight of auxiliary reward signals:
- Early stage: Provide rich intermediate rewards to guide learning
- Middle stage: Gradually reduce auxiliary rewards
- Late stage: Retain only the true task reward
Automatic Curricula
Meta-learning or evolutionary strategies can be used to automatically design reward curricula, avoiding manual tuning.
Multi-Objective Rewards
Linear Weighting
The simplest multi-objective approach:
Issue: Weight selection is difficult, and different reward scales may vary substantially.
Constrained Optimization
Some objectives can be converted into constraints (see Safe Reinforcement Learning):
Pareto Optimality
Seek a set of Pareto-optimal policies across multiple objectives rather than a single optimum.
Practical Reward Composition
def compute_reward(state, action, next_state):
# Task reward (sparse)
task_reward = 10.0 if is_goal(next_state) else 0.0
# Progress reward (dense)
progress = distance(state, goal) - distance(next_state, goal)
# Safety penalty
safety_penalty = -100.0 if is_unsafe(next_state) else 0.0
# Energy cost
energy_cost = -0.01 * np.sum(action ** 2)
return task_reward + 1.0 * progress + safety_penalty + energy_cost
Reward from Human Feedback
Connection to RLHF
Human feedback can serve as a source of reward signals (see LLM Post-Training for details):
- Collect human preference comparisons over behaviors
- Train a reward model \(r_\phi(s, a)\)
- Use the learned reward model to train the policy
Reward from VLMs
Vision-Language Models (VLMs) can serve as reward functions:
- Describe the goal in natural language
- The VLM evaluates how well the current state matches the goal description
- Provides dense, semantic-level reward signals
Example:
Advantages:
- No need for hand-crafted rewards
- Handles complex semantic objectives
- Naturally supports open-world tasks
Reward Hacking
Definition
Reward hacking occurs when the agent finds a way to maximize the reward function without accomplishing the designer's intended task.
Classic Examples
| Task | Designed Reward | Hacking Behavior |
|---|---|---|
| Boat racing | Collect coins | Circle back to collect the same coins |
| Cleaning robot | -(amount of dust) | Push dust out of sight |
| Soccer | Ball possession time | Dribble in place, never shoot |
| Code generation | Pass tests | Hard-code test answers |
Causes of Reward Hacking
- Gap between proxy metrics and true objectives: The reward function is an approximation of the real goal
- Out-of-distribution behavior: The policy may discover exploits outside the training distribution
- Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"
Mitigation Strategies
1. Diversified Reward Signals
Use multiple complementary reward components to reduce the risk of any single metric being hacked.
2. Adversarial Reward Design
Proactively anticipate possible hacking behaviors during reward design and add safeguards.
3. Reward Model Ensembles
Use an ensemble of reward models to reduce the bias of any single model:
4. Human-in-the-Loop
Periodically have humans review the agent's behavior to detect and correct reward hacking.
5. Constrained Optimization
Impose safety conditions and behavioral norms as hard constraints rather than relying solely on rewards.
Practical Guide
Reward Function Design Workflow
- Clarify the task objective: Precisely describe the desired behavior in natural language
- Design initial reward: Start simple, preferring sparse rewards
- Add shaping: If learning efficiency is insufficient, add potential-based shaping
- Test robustness: Check for possible reward hacking
- Iterate: Adjust rewards based on the agent's actual behavior
Common Pitfalls
- Scale mismatch between reward components
- Forgetting rewards in terminal conditions
- Dense rewards introducing unnecessary bias
- Lacking penalties for undesired behaviors
References
- Ng et al., "Policy Invariance Under Reward Transformations" (ICML 1999)
- Amodei et al., "Concrete Problems in AI Safety" (2016)
- Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017)
- Clark & Amodei, "Faulty Reward Functions in the Wild" (2016)
- Ma et al., "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)