TD3 and DDPG
Overview
DDPG (Deep Deterministic Policy Gradient) and TD3 (Twin Delayed DDPG) are the most classic Actor-Critic algorithms for continuous action spaces. DDPG pioneered deep deterministic policy gradient methods, while TD3 addresses DDPG's overestimation and training instability through three key improvements.
This article covers the principles, implementation, and comparison of both algorithms in detail.
1. DDPG: Deep Deterministic Policy Gradient
1.1 Background and Motivation
DQN achieved tremendous success in discrete action spaces but cannot directly handle continuous action spaces (since \(\max_a Q(s,a)\) requires optimization in continuous spaces). DDPG combines:
- Deterministic Policy Gradient (DPG) theory (Silver et al., 2014)
- DQN's training stabilization tricks: Experience replay + target networks
- Actor-Critic architecture
1.2 Deterministic Policy Gradient Theorem
For a deterministic policy \(\mu_\theta(s)\), the policy gradient is:
Intuition: Adjust the policy output in the direction that increases the Q-value.
Comparison with stochastic policy gradient:
| Dimension | Stochastic Policy Gradient | Deterministic Policy Gradient |
|---|---|---|
| Policy | \(\pi_\theta(a\|s)\) probability distribution | \(\mu_\theta(s)\) deterministic mapping |
| Gradient | Requires integration over actions | Only needs gradient at the action |
| Exploration | Built-in randomness | Requires external exploration noise |
| Sample efficiency | Lower | Higher |
1.3 DDPG Architecture
graph TB
subgraph Actor["Actor (Policy Network)"]
S1[State s] --> MU["μ_θ(s)"]
MU --> A[Action a]
end
subgraph Critic["Critic (Q-Network)"]
S2[State s] --> Q["Q_φ(s, a)"]
A2[Action a] --> Q
Q --> V[Q-value]
end
subgraph Target["Target Networks"]
MU_T["μ_θ'(s')"]
Q_T["Q_φ'(s', a')"]
end
A --> A2
style Actor fill:#e3f2fd
style Critic fill:#fff3e0
style Target fill:#f3e5f5
1.4 DDPG Algorithm Details
Four Networks
| Network | Symbol | Role |
|---|---|---|
| Actor | \(\mu_\theta(s)\) | Output deterministic actions |
| Critic | \(Q_\phi(s,a)\) | Evaluate state-action pair values |
| Target Actor | \(\mu_{\theta'}(s)\) | Compute target actions |
| Target Critic | \(Q_{\phi'}(s,a)\) | Compute target Q-values |
Critic Update
Minimize TD error:
where the target value is:
Actor Update
Maximize the Q-value assessed by the Critic:
Target Network Soft Update
where \(\tau \ll 1\) (typically 0.005).
Exploration Strategy
Add noise during training:
Typically using Ornstein-Uhlenbeck (OU) process or simple Gaussian noise.
1.5 DDPG Pseudocode
# Initialization
actor = Actor(state_dim, action_dim)
critic = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic = copy(critic)
replay_buffer = ReplayBuffer(capacity=1e6)
for episode in range(num_episodes):
state = env.reset()
for step in range(max_steps):
# Select action + exploration noise
action = actor(state) + noise.sample()
action = clip(action, action_low, action_high)
# Environment interaction
next_state, reward, done = env.step(action)
replay_buffer.add(state, action, reward, next_state, done)
# Sample from replay buffer
batch = replay_buffer.sample(batch_size=256)
# Critic update
target_action = target_actor(batch.next_state)
target_q = batch.reward + gamma * target_critic(
batch.next_state, target_action) * (1 - batch.done)
critic_loss = MSE(critic(batch.state, batch.action), target_q)
update(critic, critic_loss)
# Actor update
actor_loss = -critic(batch.state, actor(batch.state)).mean()
update(actor, actor_loss)
# Target network soft update
soft_update(target_actor, actor, tau=0.005)
soft_update(target_critic, critic, tau=0.005)
1.6 DDPG's Problems
- Q-value overestimation: Critic tends to overestimate Q-values, leading to suboptimal policies
- Training instability: Sensitive to hyperparameters, prone to divergence
- Fragile exploration: OU noise performance is inconsistent
- Actor-Critic coupling: Critic errors propagate to the Actor
2. TD3: Twin Delayed DDPG
2.1 Core Ideas
TD3 (Fujimoto et al., 2018) proposes three key improvements to address DDPG's problems:
- Clipped Double Q-Learning
- Delayed Policy Updates
- Target Policy Smoothing
2.2 Improvement 1: Clipped Double Q-Learning
Problem: A single Q-network tends to overestimate.
Solution: Use two independent Critic networks and take the minimum Q-value:
Rationale: Similar to Double DQN but more conservative. Taking the minimum effectively suppresses overestimation, even though it may introduce slight underestimation (underestimation is typically safer than overestimation).
2.3 Improvement 2: Delayed Policy Updates
Problem: Updating the Actor based on an inaccurate Critic causes policy oscillation.
Solution: Update the Actor only every \(d\) Critic updates (typically \(d=2\)).
Rationale: Let the Critic stabilize first, then use more accurate Q-values to guide Actor updates.
for step in range(total_steps):
# Update Critic every step
update_critic()
# Update Actor and targets every d steps
if step % d == 0:
update_actor()
soft_update_targets()
2.4 Improvement 3: Target Policy Smoothing
Problem: Target Q-values overfit to specific actions.
Solution: Add clipped noise to target actions:
Rationale: Similar to regularization, making the Q-function smoother in action space and reducing overfitting to individual actions.
2.5 TD3 Complete Algorithm
where:
Critic loss (two Critics updated separately):
Actor loss (updated every \(d\) steps):
2.6 TD3 Pseudocode
# Initialization
actor = Actor(state_dim, action_dim)
critic_1 = Critic(state_dim, action_dim)
critic_2 = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic_1 = copy(critic_1)
target_critic_2 = copy(critic_2)
for step in range(total_steps):
# Select action + exploration noise
action = actor(state) + N(0, sigma)
# Interact and store
next_state, reward, done = env.step(action)
replay_buffer.add(state, action, reward, next_state, done)
batch = replay_buffer.sample(batch_size)
# Compute target (Target Policy Smoothing)
noise = clip(N(0, sigma_tilde), -c, c)
target_action = clip(target_actor(batch.next_state) + noise,
action_low, action_high)
# Clipped Double Q-Learning
target_q1 = target_critic_1(batch.next_state, target_action)
target_q2 = target_critic_2(batch.next_state, target_action)
target_q = batch.reward + gamma * min(target_q1, target_q2) * (1 - batch.done)
# Update both Critics
loss_1 = MSE(critic_1(batch.state, batch.action), target_q)
loss_2 = MSE(critic_2(batch.state, batch.action), target_q)
update(critic_1, loss_1)
update(critic_2, loss_2)
# Delayed Policy Update
if step % policy_delay == 0:
actor_loss = -critic_1(batch.state, actor(batch.state)).mean()
update(actor, actor_loss)
soft_update(target_actor, actor, tau)
soft_update(target_critic_1, critic_1, tau)
soft_update(target_critic_2, critic_2, tau)
2.7 Hyperparameters
| Hyperparameter | Typical Value | Description |
|---|---|---|
| \(\gamma\) | 0.99 | Discount factor |
| \(\tau\) | 0.005 | Target network soft update rate |
| \(\sigma\) | 0.1 | Exploration noise std |
| \(\tilde{\sigma}\) | 0.2 | Target policy smoothing noise |
| \(c\) | 0.5 | Noise clipping range |
| \(d\) | 2 | Policy update delay |
| batch size | 256 | Batch size |
| buffer size | \(10^6\) | Replay buffer size |
| lr (actor) | \(3 \times 10^{-4}\) | Actor learning rate |
| lr (critic) | \(3 \times 10^{-4}\) | Critic learning rate |
3. DDPG vs TD3 vs SAC Comparison
3.1 Core Differences
| Feature | DDPG | TD3 | SAC |
|---|---|---|---|
| Number of Critics | 1 | 2 | 2 |
| Policy type | Deterministic | Deterministic | Stochastic (max entropy) |
| Q-value estimation | Tends to overestimate | Clipped minimum | Clipped minimum |
| Policy update frequency | Every step | Delayed (\(d\) steps) | Every step |
| Target smoothing | None | Yes (added noise) | None (entropy regularization) |
| Exploration mechanism | External noise (OU/Gaussian) | External noise (Gaussian) | Entropy maximization (intrinsic) |
| Temperature parameter | None | None | \(\alpha\) (auto-tunable) |
| Training stability | Poor | Good | Best |
3.2 Performance Comparison
Typical performance on MuJoCo continuous control benchmarks:
| Environment | DDPG | TD3 | SAC |
|---|---|---|---|
| HalfCheetah | ~8,000 | ~10,000 | ~11,000 |
| Ant | ~1,000 | ~4,500 | ~5,500 |
| Walker2d | ~2,000 | ~4,500 | ~5,000 |
| Humanoid | ~500 | ~5,000 | ~6,000 |
Note
The above values are typical references; actual performance varies significantly with hyperparameters and random seeds.
3.3 Selection Guidelines
Algorithm selection for continuous control:
├── Need most stable training → SAC
├── Need simple implementation → TD3
├── Need deterministic policy → TD3 / DDPG
├── Need automatic exploration tuning → SAC
└── As baseline → TD3 (simple, effective)
4. Practical Tips
4.1 Common Issues and Solutions
| Issue | Possible Cause | Solution |
|---|---|---|
| Q-value explosion | Overestimation | Use TD3/double Q-networks |
| Training doesn't converge | Learning rate too high | Lower learning rate, increase batch size |
| Policy oscillation | Actor-Critic out of sync | Delayed policy updates |
| Insufficient exploration | Noise too small | Increase noise, or use SAC |
| Actions out of range | Missing clipping | Use tanh + scaling in output layer |
4.2 Network Architecture Recommendations
# Actor Network
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, action_dim),
nn.Tanh() # Output [-1, 1]
)
self.max_action = max_action
def forward(self, state):
return self.max_action * self.net(state)
# Critic Network
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
def forward(self, state, action):
return self.net(torch.cat([state, action], dim=-1))
4.3 Training Tips
- Normalize observations: Apply running mean/variance normalization to states
- Reward scaling: Scale rewards to a reasonable range
- Warm-up period: Use random policy for the first N steps to fill the replay buffer
- Gradient clipping: Prevent gradient explosion
- Multi-seed evaluation: Run multiple random seeds and average results
5. From TD3 to More Advanced Methods
5.1 Development Lineage
DPG (2014)
→ DDPG (2015): + Deep Networks + Target Net + Replay
→ TD3 (2018): + Double Q + Delay + Smoothing
→ SAC (2018): + Maximum Entropy + Stochastic Policy
→ DrQ (2021): + Data Augmentation
→ RLPD (2023): + Pre-training Data
5.2 Key Differences with SAC
SAC uses a stochastic policy and maximum entropy framework:
Compared to TD3, SAC automatically balances exploration and exploitation through entropy regularization, typically offering greater stability. See SAC Algorithm Details for more.
References
- Lillicrap, T. et al. (2016). Continuous control with deep reinforcement learning. ICLR 2016.
- Silver, D. et al. (2014). Deterministic Policy Gradient Algorithms. ICML 2014.
- Fujimoto, S. et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
- Haarnoja, T. et al. (2018). Soft Actor-Critic. ICML 2018.
Further Reading
- SAC Algorithm Details — Maximum entropy Actor-Critic
- PPO Algorithm — Another mainstream policy optimization method
- Deep RL Introduction — DQN and foundational concepts
- TRPO and Natural Gradient — Trust region methods