TD3 and DDPG

Overview

DDPG (Deep Deterministic Policy Gradient) and TD3 (Twin Delayed DDPG) are the most classic Actor-Critic algorithms for continuous action spaces. DDPG pioneered deep deterministic policy gradient methods, while TD3 addresses DDPG's overestimation and training instability through three key improvements.

This article covers the principles, implementation, and comparison of both algorithms in detail.

1. DDPG: Deep Deterministic Policy Gradient

1.1 Background and Motivation

DQN achieved tremendous success in discrete action spaces but cannot directly handle continuous action spaces (since \(\max_a Q(s,a)\) requires optimization in continuous spaces). DDPG combines:

Deterministic Policy Gradient (DPG) theory (Silver et al., 2014)
DQN's training stabilization tricks: Experience replay + target networks
Actor-Critic architecture

1.2 Deterministic Policy Gradient Theorem

For a deterministic policy \(\mu_\theta(s)\), the policy gradient is:

\[\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\mu} \left[ \nabla_a Q^\mu(s,a) \big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \right]\]

Intuition: Adjust the policy output in the direction that increases the Q-value.

Comparison with stochastic policy gradient:

Dimension	Stochastic Policy Gradient	Deterministic Policy Gradient
Policy	\(\pi_\theta(a\\|s)\) probability distribution	\(\mu_\theta(s)\) deterministic mapping
Gradient	Requires integration over actions	Only needs gradient at the action
Exploration	Built-in randomness	Requires external exploration noise
Sample efficiency	Lower	Higher

1.3 DDPG Architecture

graph TB
    subgraph Actor["Actor (Policy Network)"]
        S1[State s] --> MU["μ_θ(s)"]
        MU --> A[Action a]
    end

    subgraph Critic["Critic (Q-Network)"]
        S2[State s] --> Q["Q_φ(s, a)"]
        A2[Action a] --> Q
        Q --> V[Q-value]
    end

    subgraph Target["Target Networks"]
        MU_T["μ_θ'(s')"]
        Q_T["Q_φ'(s', a')"]
    end

    A --> A2

    style Actor fill:#e3f2fd
    style Critic fill:#fff3e0
    style Target fill:#f3e5f5

1.4 DDPG Algorithm Details

Four Networks

Network	Symbol	Role
Actor	\(\mu_\theta(s)\)	Output deterministic actions
Critic	\(Q_\phi(s,a)\)	Evaluate state-action pair values
Target Actor	\(\mu_{\theta'}(s)\)	Compute target actions
Target Critic	\(Q_{\phi'}(s,a)\)	Compute target Q-values

Critic Update

Minimize TD error:

\[\mathcal{L}(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_\phi(s,a) - y \right)^2 \right]\]

where the target value is:

\[y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s'))\]

Actor Update

Maximize the Q-value assessed by the Critic:

\[\nabla_\theta J = \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_a Q_\phi(s,a) \big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \right]\]

Target Network Soft Update

\[\theta' \leftarrow \tau \theta + (1-\tau) \theta'\]

\[\phi' \leftarrow \tau \phi + (1-\tau) \phi'\]

where \(\tau \ll 1\) (typically 0.005).

Exploration Strategy

Add noise during training:

\[a = \mu_\theta(s) + \mathcal{N}(0, \sigma)\]

Typically using Ornstein-Uhlenbeck (OU) process or simple Gaussian noise.

1.5 DDPG Pseudocode

# Initialization
actor = Actor(state_dim, action_dim)
critic = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic = copy(critic)
replay_buffer = ReplayBuffer(capacity=1e6)

for episode in range(num_episodes):
    state = env.reset()
    for step in range(max_steps):
        # Select action + exploration noise
        action = actor(state) + noise.sample()
        action = clip(action, action_low, action_high)

        # Environment interaction
        next_state, reward, done = env.step(action)
        replay_buffer.add(state, action, reward, next_state, done)

        # Sample from replay buffer
        batch = replay_buffer.sample(batch_size=256)

        # Critic update
        target_action = target_actor(batch.next_state)
        target_q = batch.reward + gamma * target_critic(
            batch.next_state, target_action) * (1 - batch.done)
        critic_loss = MSE(critic(batch.state, batch.action), target_q)
        update(critic, critic_loss)

        # Actor update
        actor_loss = -critic(batch.state, actor(batch.state)).mean()
        update(actor, actor_loss)

        # Target network soft update
        soft_update(target_actor, actor, tau=0.005)
        soft_update(target_critic, critic, tau=0.005)

1.6 DDPG's Problems

Q-value overestimation: Critic tends to overestimate Q-values, leading to suboptimal policies
Training instability: Sensitive to hyperparameters, prone to divergence
Fragile exploration: OU noise performance is inconsistent
Actor-Critic coupling: Critic errors propagate to the Actor

2. TD3: Twin Delayed DDPG

2.1 Core Ideas

TD3 (Fujimoto et al., 2018) proposes three key improvements to address DDPG's problems:

Clipped Double Q-Learning
Delayed Policy Updates
Target Policy Smoothing

2.2 Improvement 1: Clipped Double Q-Learning

Problem: A single Q-network tends to overestimate.

Solution: Use two independent Critic networks and take the minimum Q-value:

\[y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}')\]

Rationale: Similar to Double DQN but more conservative. Taking the minimum effectively suppresses overestimation, even though it may introduce slight underestimation (underestimation is typically safer than overestimation).

2.3 Improvement 2: Delayed Policy Updates

Problem: Updating the Actor based on an inaccurate Critic causes policy oscillation.

Solution: Update the Actor only every \(d\) Critic updates (typically \(d=2\)).

Rationale: Let the Critic stabilize first, then use more accurate Q-values to guide Actor updates.

for step in range(total_steps):
    # Update Critic every step
    update_critic()

    # Update Actor and targets every d steps
    if step % d == 0:
        update_actor()
        soft_update_targets()

2.4 Improvement 3: Target Policy Smoothing

Problem: Target Q-values overfit to specific actions.

Solution: Add clipped noise to target actions:

\[\tilde{a}' = \mu_{\theta'}(s') + \text{clip}(\epsilon, -c, c), \quad \epsilon \sim \mathcal{N}(0, \tilde{\sigma})\]

Rationale: Similar to regularization, making the Q-function smoother in action space and reducing overfitting to individual actions.

2.5 TD3 Complete Algorithm

\[\text{Target: } y = r + \gamma \min_{i=1,2} Q_{\phi'_i}(s', \tilde{a}')\]

where:

\[\tilde{a}' = \mu_{\theta'}(s') + \text{clip}(\mathcal{N}(0, \tilde{\sigma}), -c, c)\]

Critic loss (two Critics updated separately):

\[\mathcal{L}(\phi_i) = \mathbb{E} \left[ \left( Q_{\phi_i}(s,a) - y \right)^2 \right], \quad i=1,2\]

Actor loss (updated every \(d\) steps):

\[J(\theta) = -\mathbb{E}_{s \sim \mathcal{D}} \left[ Q_{\phi_1}(s, \mu_\theta(s)) \right]\]

2.6 TD3 Pseudocode

# Initialization
actor = Actor(state_dim, action_dim)
critic_1 = Critic(state_dim, action_dim)
critic_2 = Critic(state_dim, action_dim)
target_actor = copy(actor)
target_critic_1 = copy(critic_1)
target_critic_2 = copy(critic_2)

for step in range(total_steps):
    # Select action + exploration noise
    action = actor(state) + N(0, sigma)

    # Interact and store
    next_state, reward, done = env.step(action)
    replay_buffer.add(state, action, reward, next_state, done)
    batch = replay_buffer.sample(batch_size)

    # Compute target (Target Policy Smoothing)
    noise = clip(N(0, sigma_tilde), -c, c)
    target_action = clip(target_actor(batch.next_state) + noise,
                         action_low, action_high)

    # Clipped Double Q-Learning
    target_q1 = target_critic_1(batch.next_state, target_action)
    target_q2 = target_critic_2(batch.next_state, target_action)
    target_q = batch.reward + gamma * min(target_q1, target_q2) * (1 - batch.done)

    # Update both Critics
    loss_1 = MSE(critic_1(batch.state, batch.action), target_q)
    loss_2 = MSE(critic_2(batch.state, batch.action), target_q)
    update(critic_1, loss_1)
    update(critic_2, loss_2)

    # Delayed Policy Update
    if step % policy_delay == 0:
        actor_loss = -critic_1(batch.state, actor(batch.state)).mean()
        update(actor, actor_loss)

        soft_update(target_actor, actor, tau)
        soft_update(target_critic_1, critic_1, tau)
        soft_update(target_critic_2, critic_2, tau)

2.7 Hyperparameters

Hyperparameter	Typical Value	Description
\(\gamma\)	0.99	Discount factor
\(\tau\)	0.005	Target network soft update rate
\(\sigma\)	0.1	Exploration noise std
\(\tilde{\sigma}\)	0.2	Target policy smoothing noise
\(c\)	0.5	Noise clipping range
\(d\)	2	Policy update delay
batch size	256	Batch size
buffer size	\(10^6\)	Replay buffer size
lr (actor)	\(3 \times 10^{-4}\)	Actor learning rate
lr (critic)	\(3 \times 10^{-4}\)	Critic learning rate

3. DDPG vs TD3 vs SAC Comparison

3.1 Core Differences

Feature	DDPG	TD3	SAC
Number of Critics	1	2	2
Policy type	Deterministic	Deterministic	Stochastic (max entropy)
Q-value estimation	Tends to overestimate	Clipped minimum	Clipped minimum
Policy update frequency	Every step	Delayed (\(d\) steps)	Every step
Target smoothing	None	Yes (added noise)	None (entropy regularization)
Exploration mechanism	External noise (OU/Gaussian)	External noise (Gaussian)	Entropy maximization (intrinsic)
Temperature parameter	None	None	\(\alpha\) (auto-tunable)
Training stability	Poor	Good	Best

3.2 Performance Comparison

Typical performance on MuJoCo continuous control benchmarks:

Environment	DDPG	TD3	SAC
HalfCheetah	~8,000	~10,000	~11,000
Ant	~1,000	~4,500	~5,500
Walker2d	~2,000	~4,500	~5,000
Humanoid	~500	~5,000	~6,000

Note

The above values are typical references; actual performance varies significantly with hyperparameters and random seeds.

3.3 Selection Guidelines

Algorithm selection for continuous control:
  ├── Need most stable training → SAC
  ├── Need simple implementation → TD3
  ├── Need deterministic policy → TD3 / DDPG
  ├── Need automatic exploration tuning → SAC
  └── As baseline → TD3 (simple, effective)

4. Practical Tips

4.1 Common Issues and Solutions

Issue	Possible Cause	Solution
Q-value explosion	Overestimation	Use TD3/double Q-networks
Training doesn't converge	Learning rate too high	Lower learning rate, increase batch size
Policy oscillation	Actor-Critic out of sync	Delayed policy updates
Insufficient exploration	Noise too small	Increase noise, or use SAC
Actions out of range	Missing clipping	Use tanh + scaling in output layer

4.2 Network Architecture Recommendations

# Actor Network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim),
            nn.Tanh()  # Output [-1, 1]
        )
        self.max_action = max_action

    def forward(self, state):
        return self.max_action * self.net(state)

# Critic Network
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, state, action):
        return self.net(torch.cat([state, action], dim=-1))

4.3 Training Tips

Normalize observations: Apply running mean/variance normalization to states
Reward scaling: Scale rewards to a reasonable range
Warm-up period: Use random policy for the first N steps to fill the replay buffer
Gradient clipping: Prevent gradient explosion
Multi-seed evaluation: Run multiple random seeds and average results

5. From TD3 to More Advanced Methods

5.1 Development Lineage

DPG (2014)
  → DDPG (2015): + Deep Networks + Target Net + Replay
    → TD3 (2018): + Double Q + Delay + Smoothing
      → SAC (2018): + Maximum Entropy + Stochastic Policy
        → DrQ (2021): + Data Augmentation
          → RLPD (2023): + Pre-training Data

5.2 Key Differences with SAC

SAC uses a stochastic policy and maximum entropy framework:

\[J(\theta) = \mathbb{E}\left[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))\right]\]

Compared to TD3, SAC automatically balances exploration and exploitation through entropy regularization, typically offering greater stability. See SAC Algorithm Details for more.

References

Lillicrap, T. et al. (2016). Continuous control with deep reinforcement learning. ICLR 2016.
Silver, D. et al. (2014). Deterministic Policy Gradient Algorithms. ICML 2014.
Fujimoto, S. et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
Haarnoja, T. et al. (2018). Soft Actor-Critic. ICML 2018.