Skip to content

TD3 and DDPG

Overview

DDPG (Deep Deterministic Policy Gradient) and TD3 (Twin Delayed DDPG) are the most classic Actor-Critic algorithms for continuous action spaces. DDPG pioneered deep deterministic policy gradient methods, while TD3 addresses DDPG's overestimation and training instability through three key improvements.

This article covers the principles, implementation, and comparison of both algorithms in detail.


1. DDPG: Deep Deterministic Policy Gradient

1.1 Background and Motivation

DQN achieved tremendous success in discrete action spaces but cannot directly handle continuous action spaces (since \(\max_a Q(s,a)\) requires optimization in continuous spaces). DDPG combines:

  • Deterministic Policy Gradient (DPG) theory (Silver et al., 2014)
  • DQN's training stabilization tricks: Experience replay + target networks
  • Actor-Critic architecture

1.2 Deterministic Policy Gradient Theorem

For a deterministic policy \(\mu_\theta(s)\), the policy gradient is:

\[\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu} \left[ \nabla_a Q^\mu(s,a) \big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \right]\]

Intuition: Adjust the policy output in the direction that increases the Q-value.

Comparison with stochastic policy gradient:

Dimension Stochastic Policy Gradient Deterministic Policy Gradient
Policy \(\pi_\theta(a\|s)\) probability distribution \(\mu_\theta(s)\) deterministic mapping
Gradient Requires integration over actions Only needs gradient at the action
Exploration Built-in randomness Requires external exploration noise
Sample efficiency Lower Higher

1.3 DDPG Architecture

graph TB
    subgraph Actor["Actor (Policy Network)"]
        S1[State s] --> MU["μ_θ(s)"]
        MU --> A[Action a]
    end

    subgraph Critic["Critic (Q-Network)"]
        S2[State s] --> Q["Q_φ(s, a)"]
        A2[Action a] --> Q
        Q --> V[Q-value]
    end

    subgraph Target["Target Networks"]
        MU_T["μ_θ'(s')"]
        Q_T["Q_φ'(s', a')"]
    end

    A --> A2

    style Actor fill:#e3f2fd
    style Critic fill:#fff3e0
    style Target fill:#f3e5f5

1.4 DDPG Algorithm Details

Four Networks

Network Symbol Role
Actor \(\mu_\theta(s)\) Output deterministic actions
Critic \(Q_\phi(s,a)\) Evaluate state-action pair values
Target Actor \(\mu_{\theta'}(s)\) Compute target actions
Target Critic \(Q_{\phi'}(s,a)\) Compute target Q-values

Critic Update

Minimize TD error:

\[\mathcal{L}(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_\phi(s,a) - y \right)^2 \right]\]

where the target value is:

\[y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s'))\]

Actor Update

Maximize the Q-value assessed by the Critic:

\[\nabla_\theta J = \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_a Q_\phi(s,a) \big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \right]\]

Target Network Soft Update

\[\theta' \leftarrow \tau \theta + (1-\tau) \theta'\]
\[\phi' \leftarrow \tau \phi + (1-\tau) \phi'\]

where \(\tau \ll 1\) (typically 0.005).

Exploration Strategy

Add noise during training:

\[a = \mu_\theta(s) + \mathcal{N}(0, \sigma)\]

Typically using Ornstein-Uhlenbeck (OU) process or simple Gaussian noise.

1.5 DDPG Pseudocode

# Initialization
actor = Actor(state_dim, action_dim)
critic = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic = copy(critic)
replay_buffer = ReplayBuffer(capacity=1e6)

for episode in range(num_episodes):
    state = env.reset()
    for step in range(max_steps):
        # Select action + exploration noise
        action = actor(state) + noise.sample()
        action = clip(action, action_low, action_high)

        # Environment interaction
        next_state, reward, done = env.step(action)
        replay_buffer.add(state, action, reward, next_state, done)

        # Sample from replay buffer
        batch = replay_buffer.sample(batch_size=256)

        # Critic update
        target_action = target_actor(batch.next_state)
        target_q = batch.reward + gamma * target_critic(
            batch.next_state, target_action) * (1 - batch.done)
        critic_loss = MSE(critic(batch.state, batch.action), target_q)
        update(critic, critic_loss)

        # Actor update
        actor_loss = -critic(batch.state, actor(batch.state)).mean()
        update(actor, actor_loss)

        # Target network soft update
        soft_update(target_actor, actor, tau=0.005)
        soft_update(target_critic, critic, tau=0.005)

1.6 DDPG's Problems

  1. Q-value overestimation: Critic tends to overestimate Q-values, leading to suboptimal policies
  2. Training instability: Sensitive to hyperparameters, prone to divergence
  3. Fragile exploration: OU noise performance is inconsistent
  4. Actor-Critic coupling: Critic errors propagate to the Actor

2. TD3: Twin Delayed DDPG

2.1 Core Ideas

TD3 (Fujimoto et al., 2018) proposes three key improvements to address DDPG's problems:

  1. Clipped Double Q-Learning
  2. Delayed Policy Updates
  3. Target Policy Smoothing

2.2 Improvement 1: Clipped Double Q-Learning

Problem: A single Q-network tends to overestimate.

Solution: Use two independent Critic networks and take the minimum Q-value:

\[y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}')\]

Rationale: Similar to Double DQN but more conservative. Taking the minimum effectively suppresses overestimation, even though it may introduce slight underestimation (underestimation is typically safer than overestimation).

2.3 Improvement 2: Delayed Policy Updates

Problem: Updating the Actor based on an inaccurate Critic causes policy oscillation.

Solution: Update the Actor only every \(d\) Critic updates (typically \(d=2\)).

Rationale: Let the Critic stabilize first, then use more accurate Q-values to guide Actor updates.

for step in range(total_steps):
    # Update Critic every step
    update_critic()

    # Update Actor and targets every d steps
    if step % d == 0:
        update_actor()
        soft_update_targets()

2.4 Improvement 3: Target Policy Smoothing

Problem: Target Q-values overfit to specific actions.

Solution: Add clipped noise to target actions:

\[\tilde{a}' = \mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), \quad \epsilon \sim \mathcal{N}(0, \tilde{\sigma})\]

Rationale: Similar to regularization, making the Q-function smoother in action space and reducing overfitting to individual actions.

2.5 TD3 Complete Algorithm

\[\text{Target: } y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}')\]

where:

\[\tilde{a}' = \mu_{\theta'}(s') + \text{clip}(\mathcal{N}(0, \tilde{\sigma}), -c, c)\]

Critic loss (two Critics updated separately):

\[\mathcal{L}(\phi_i) = \mathbb{E} \left[ \left( Q_{\phi_i}(s,a) - y \right)^2 \right], \quad i=1,2\]

Actor loss (updated every \(d\) steps):

\[J(\theta) = -\mathbb{E}_{s \sim \mathcal{D}} \left[ Q_{\phi_1}(s, \mu_\theta(s)) \right]\]

2.6 TD3 Pseudocode

# Initialization
actor = Actor(state_dim, action_dim)
critic_1 = Critic(state_dim, action_dim)
critic_2 = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic_1 = copy(critic_1)
target_critic_2 = copy(critic_2)

for step in range(total_steps):
    # Select action + exploration noise
    action = actor(state) + N(0, sigma)

    # Interact and store
    next_state, reward, done = env.step(action)
    replay_buffer.add(state, action, reward, next_state, done)
    batch = replay_buffer.sample(batch_size)

    # Compute target (Target Policy Smoothing)
    noise = clip(N(0, sigma_tilde), -c, c)
    target_action = clip(target_actor(batch.next_state) + noise,
                         action_low, action_high)

    # Clipped Double Q-Learning
    target_q1 = target_critic_1(batch.next_state, target_action)
    target_q2 = target_critic_2(batch.next_state, target_action)
    target_q = batch.reward + gamma * min(target_q1, target_q2) * (1 - batch.done)

    # Update both Critics
    loss_1 = MSE(critic_1(batch.state, batch.action), target_q)
    loss_2 = MSE(critic_2(batch.state, batch.action), target_q)
    update(critic_1, loss_1)
    update(critic_2, loss_2)

    # Delayed Policy Update
    if step % policy_delay == 0:
        actor_loss = -critic_1(batch.state, actor(batch.state)).mean()
        update(actor, actor_loss)

        soft_update(target_actor, actor, tau)
        soft_update(target_critic_1, critic_1, tau)
        soft_update(target_critic_2, critic_2, tau)

2.7 Hyperparameters

Hyperparameter Typical Value Description
\(\gamma\) 0.99 Discount factor
\(\tau\) 0.005 Target network soft update rate
\(\sigma\) 0.1 Exploration noise std
\(\tilde{\sigma}\) 0.2 Target policy smoothing noise
\(c\) 0.5 Noise clipping range
\(d\) 2 Policy update delay
batch size 256 Batch size
buffer size \(10^6\) Replay buffer size
lr (actor) \(3 \times 10^{-4}\) Actor learning rate
lr (critic) \(3 \times 10^{-4}\) Critic learning rate

3. DDPG vs TD3 vs SAC Comparison

3.1 Core Differences

Feature DDPG TD3 SAC
Number of Critics 1 2 2
Policy type Deterministic Deterministic Stochastic (max entropy)
Q-value estimation Tends to overestimate Clipped minimum Clipped minimum
Policy update frequency Every step Delayed (\(d\) steps) Every step
Target smoothing None Yes (added noise) None (entropy regularization)
Exploration mechanism External noise (OU/Gaussian) External noise (Gaussian) Entropy maximization (intrinsic)
Temperature parameter None None \(\alpha\) (auto-tunable)
Training stability Poor Good Best

3.2 Performance Comparison

Typical performance on MuJoCo continuous control benchmarks:

Environment DDPG TD3 SAC
HalfCheetah ~8,000 ~10,000 ~11,000
Ant ~1,000 ~4,500 ~5,500
Walker2d ~2,000 ~4,500 ~5,000
Humanoid ~500 ~5,000 ~6,000

Note

The above values are typical references; actual performance varies significantly with hyperparameters and random seeds.

3.3 Selection Guidelines

Algorithm selection for continuous control:
  ├── Need most stable training → SAC
  ├── Need simple implementation → TD3
  ├── Need deterministic policy → TD3 / DDPG
  ├── Need automatic exploration tuning → SAC
  └── As baseline → TD3 (simple, effective)

4. Practical Tips

4.1 Common Issues and Solutions

Issue Possible Cause Solution
Q-value explosion Overestimation Use TD3/double Q-networks
Training doesn't converge Learning rate too high Lower learning rate, increase batch size
Policy oscillation Actor-Critic out of sync Delayed policy updates
Insufficient exploration Noise too small Increase noise, or use SAC
Actions out of range Missing clipping Use tanh + scaling in output layer

4.2 Network Architecture Recommendations

# Actor Network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Tanh()  # Output [-1, 1]
        )
        self.max_action = max_action

    def forward(self, state):
        return self.max_action * self.net(state)

# Critic Network
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, state, action):
        return self.net(torch.cat([state, action], dim=-1))

4.3 Training Tips

  1. Normalize observations: Apply running mean/variance normalization to states
  2. Reward scaling: Scale rewards to a reasonable range
  3. Warm-up period: Use random policy for the first N steps to fill the replay buffer
  4. Gradient clipping: Prevent gradient explosion
  5. Multi-seed evaluation: Run multiple random seeds and average results

5. From TD3 to More Advanced Methods

5.1 Development Lineage

DPG (2014)
  → DDPG (2015): + Deep Networks + Target Net + Replay
    → TD3 (2018): + Double Q + Delay + Smoothing
      → SAC (2018): + Maximum Entropy + Stochastic Policy
        → DrQ (2021): + Data Augmentation
          → RLPD (2023): + Pre-training Data

5.2 Key Differences with SAC

SAC uses a stochastic policy and maximum entropy framework:

\[J(\theta) = \mathbb{E}\left[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]\]

Compared to TD3, SAC automatically balances exploration and exploitation through entropy regularization, typically offering greater stability. See SAC Algorithm Details for more.


References

  • Lillicrap, T. et al. (2016). Continuous control with deep reinforcement learning. ICLR 2016.
  • Silver, D. et al. (2014). Deterministic Policy Gradient Algorithms. ICML 2014.
  • Fujimoto, S. et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
  • Haarnoja, T. et al. (2018). Soft Actor-Critic. ICML 2018.

Further Reading


评论 #