SAC
DQN solved the problem of discrete action spaces, prompting researchers to tackle the more realistic challenge of continuous control. Today, continuous control is largely dominated by two algorithms: PPO (on-policy) and SAC (off-policy). However, the early development of this field went through many iterations. This chapter focuses on SAC as the culminating algorithm and briefly covers the most essential concepts along the off-policy trajectory.
SAC (Soft Actor-Critic), proposed by Haarnoja et al. in 2018, is one of the most powerful off-policy algorithms for continuous control. Its core innovation is incorporating the Maximum Entropy principle into the Actor-Critic framework, so that the policy not only pursues high returns but also maintains high stochasticity (high entropy). This leads to better exploration, stronger robustness, and smoother training.
Development History
The evolutionary lineage of SAC's off-policy branch is:
It is important to note that up through TD3, the approach follows the Deterministic Policy Gradient route. SAC fundamentally belongs to a different theoretical lineage than TD3:
- TD3 is based on deterministic policy gradient
- SAC is based on maximum entropy RL
However, in modern engineering practice, SAC incorporates many of TD3's engineering techniques (such as Twin Q-networks) to form a more mature implementation.
DPG - Deterministic Policy Gradient
In standard policy gradient methods, the policy \(\pi_\theta(a|s)\) is a Stochastic Policy — given a state \(s\), it outputs a probability distribution over actions and samples an action from it.
In 2014, Silver et al. proposed the Deterministic Policy Gradient (DPG) theorem. A deterministic policy \(\mu_\theta(s)\) directly outputs a specific action value without any sampling:
The DPG theorem proves that the gradient of a deterministic policy is:
Intuition: The meaning of this gradient is as follows — first, the Q-network tells us "in which direction should we adjust the action to increase the Q-value" (\(\nabla_a Q\)), then the chain rule tells the policy network "to change the action output in that direction, how should the network parameters be adjusted" (\(\nabla_\theta \mu_\theta\)).
The advantage of DPG over stochastic policy gradients is that it does not require integration over the action space. Stochastic policy gradients need to compute expectations over all possible actions, which is extremely difficult in continuous, high-dimensional action spaces. Deterministic policy gradients only need to compute the gradient at the current action point, making them far more computationally efficient.
DDPG - Deep Deterministic Policy Gradient
DDPG was proposed by Lillicrap et al. in 2015 ("Continuous control with deep reinforcement learning") and is the result of combining DPG with deep learning. It can be understood as "DQN for continuous action spaces."
Core components of DDPG:
- Actor network \(\mu_\theta(s)\): Takes a state as input and directly outputs a deterministic action vector
- Critic network \(Q_\phi(s, a)\): Takes a state-action pair as input and outputs a Q-value
- Target networks: Both the Actor and Critic have target network copies for training stability (identical in concept to the Target Network in DQN)
- Replay Buffer: Stores \((s, a, r, s', \text{done})\) tuples for random sampling during training
- Exploration noise: Since a deterministic policy has no inherent randomness, noise must be manually added to actions for exploration
DDPG training logic:
- Critic update: Minimize the TD error, with target values computed by the target networks
- Actor update: Adjust the policy in the direction of increasing Q-values
- Target network soft update: Instead of periodic hard copies as in DQN, an exponential moving average is used for gradual updates
where \(\tau\) is typically a small value such as \(0.005\).
Problems with DDPG:
- Q-value overestimation: Like DQN, the \(\max\) operation causes systematic overestimation of Q-values, but this problem is more severe in DDPG
- Extreme sensitivity to hyperparameters: Slight misconfigurations of learning rate, noise magnitude, batch size, etc. can cause training failure
- Highly unstable training: Policy collapse and Q-value divergence occur frequently
- Fragile exploration: Gaussian noise as an exploration mechanism is primitive and performs poorly in complex environments
TD3 - Twin Delayed DDPG
TD3 was proposed by Fujimoto et al. in 2018 and addresses DDPG's issues through three simple yet effective techniques:
Technique 1: Twin Q-Networks
Two independent Critic networks \(Q_{\phi_1}\) and \(Q_{\phi_2}\) are trained simultaneously, and the smaller of the two values is used when computing the target:
The idea of taking the minimum is directly inspired by Double DQN: since a single Q-network tends to overestimate, training two independent Q-networks and taking the smaller value effectively suppresses overestimation. This does not completely eliminate the bias but shifts it from "overestimation" to "slight underestimation," which is much safer.
Technique 2: Delayed Policy Updates
The Actor network is updated less frequently than the Critic network. Typically, the Actor is updated once for every two Critic updates.
The rationale is that the Actor relies on the Critic's Q-values to guide its updates. If the Critic has not yet stabilized, the Actor is effectively learning from an unreliable "teacher." Therefore, letting the Critic train for a few more steps first — until Q-value estimates become more accurate — before updating the Actor leads to better results.
Technique 3: Target Policy Smoothing
When computing target Q-values, clipped noise is added to the target policy's actions:
This effectively performs a local averaging of target Q-values, preventing the Critic from producing inflated Q-value estimates at sharp peaks.
TD3 substantially improves upon DDPG's stability and performance through these three techniques. SAC directly borrows TD3's Twin Q-Networks technique in its practical implementation.
Theoretical SAC
SAC's theoretical foundation lies not in deterministic policy gradients but in Maximum Entropy Reinforcement Learning — an entirely different theoretical framework.
Maximum Entropy Reinforcement Learning
Standard RL objective:
Standard RL cares about one thing only: maximizing cumulative reward.
Maximum entropy RL objective: Maximize cumulative reward while simultaneously maximizing the entropy of the policy:
where \(\alpha > 0\) is the temperature parameter, controlling the importance of entropy; \(\mathcal{H}(\pi(\cdot|s_t)) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s_t)]\) is the entropy of the policy at state \(s_t\).
Why add entropy? This is not an arbitrary decision but is motivated by deep theoretical and practical considerations:
1. Exploration
Deterministic policies in standard RL (such as DDPG) require manually added exploration noise, and the design of this noise is quite delicate. Maximum entropy RL uses the entropy term in the objective function to make the policy inherently stochastic, enabling automatic exploration. More importantly, this exploration is "meaningful" — the policy maintains a uniform distribution among actions with similar Q-values, rather than simply adding task-irrelevant Gaussian noise.
2. Robustness
In many real-world tasks, the environment contains uncertainty (e.g., variations in friction coefficients, sensor noise). A deterministic policy may exploit a specific feature of the training environment and fail when the environment changes even slightly. A maximum entropy policy tends to learn all feasible solutions rather than relying on just one, making it more robust to environmental perturbations.
3. Multi-modality
Many tasks have multiple equivalent optimal policies. For instance, going around an obstacle can be done from the left or from the right. A deterministic policy can only learn one of them, whereas a maximum entropy policy can maintain multiple solutions simultaneously, assigning each an appropriate probability. This is especially valuable when the policy is used as initialization for subsequent tasks (transfer learning).
4. Smoother Optimization Landscape
The addition of the entropy term effectively "softens" the objective function. Originally sharp local optima are smoothed out, making the optimization process more stable.
Soft Bellman Equation
Under the maximum entropy framework, the traditional Bellman equation requires corresponding modifications.
Standard Bellman equation:
Soft Bellman equation: The \(\max\) is replaced by a \(\text{softmax}\) (i.e., log-sum-exp), and an entropy term is incorporated:
where the soft value function is defined as:
Intuition: In the standard Bellman equation, we greedily select the action with the highest Q-value (\(\max\)). In the Soft Bellman equation, we no longer pick the single best action but instead use a "soft" approach: all actions have a chance of being selected — actions with higher Q-values are more likely to be chosen, but actions with lower Q-values are not entirely excluded. This "soft" selection is realized through the policy's entropy.
Soft Q-Learning
With the Soft Bellman equation in hand, we can derive the optimal policy under the maximum entropy framework.
Under the Soft Bellman equation, the optimal policy satisfies:
where \(Z(s) = \sum_a \exp(Q_{\text{soft}}^*(s, a) / \alpha)\) is the partition function, ensuring probability normalization.
This form is a Boltzmann distribution (also known as a Gibbs distribution). The temperature parameter \(\alpha\) controls the "sharpness" of the distribution:
- As \(\alpha \to 0\): the policy degenerates into a greedy policy (selecting only the highest Q-value action), recovering standard RL
- As \(\alpha \to \infty\): the policy approaches a uniform distribution (all actions equally probable), becoming completely random
Soft Q-Learning (2017, Haarnoja et al.) is the algorithm proposed on this theoretical foundation. However, Soft Q-Learning faces computational difficulties in high-dimensional continuous action spaces — the partition function \(Z(s)\) requires integration over the continuous action space, which cannot be computed exactly.
This is the core problem that SAC aims to solve.
Soft Actor-Critic
The key insight of SAC (2018, Haarnoja et al.) is: there is no need to explicitly compute the partition function; instead, a parameterized policy network can be trained to approximate the optimal soft policy.
SAC simultaneously trains three groups of networks:
- Two Q-networks (Twin Critics): \(Q_{\phi_1}(s, a)\) and \(Q_{\phi_2}(s, a)\), borrowing TD3's Twin technique
- Policy network (Actor): \(\pi_\theta(a|s)\), which outputs a Gaussian distribution (mean and standard deviation)
- (In SAC v1 only) Value network: \(V_\psi(s)\), removed in SAC v2
Soft Policy Evaluation (Q-network update):
The Q-networks aim to converge to the fixed point of the Soft Bellman equation. The loss function is:
where the target value \(y\) is:
Note the difference from the standard TD target: there is an additional \(-\alpha \log \pi_\theta(a'|s')\) term, which is the entropy reward. The action \(a'\) is sampled from the current policy rather than computed by a deterministic policy.
Soft Policy Improvement (Policy network update):
The policy network's objective is to minimize the KL divergence between the policy and the Boltzmann distribution of the Soft Q-function:
where \(Q(s, a) = \min_{i=1,2} Q_{\phi_i}(s, a)\) takes the smaller value from the two Q-networks.
Intuition: This loss function accomplishes two things — (1) it encourages the policy to select actions with high Q-values (the \(-Q(s,a)\) term); (2) it simultaneously maintains the policy's stochasticity (the \(\alpha \log \pi_\theta\) term, i.e., negative entropy). The balance between these two objectives is controlled by \(\alpha\).
Reparameterization Trick:
Since the action \(a\) is sampled from the policy \(\pi_\theta\), it is impossible to directly differentiate through the sampling operation (sampling is non-differentiable). SAC uses the reparameterization trick to address this:
The policy network outputs a mean \(\mu_\theta(s)\) and standard deviation \(\sigma_\theta(s)\), then transforms standard normal noise \(\epsilon\) through an affine transformation to produce the action \(a\).
The elegance of reparameterization lies in: transferring the randomness from the policy network to the external noise \(\epsilon\). This way, given a fixed \(\epsilon\), \(a\) is a deterministic, differentiable function of \(\theta\), allowing standard backpropagation. This is identical to the reparameterization trick used in VAEs.
Since actions in continuous action spaces typically have bounded ranges (e.g., \([-1, 1]\)), SAC applies a \(\tanh\) squashing function to the Gaussian distribution's output:
This requires a corresponding Jacobian correction to the log-probability:
where \(u = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon\) is the pre-\(\tanh\) value and \(D\) is the action dimensionality.
SAC v2 - Automatic Temperature Tuning
In SAC v1, the temperature parameter \(\alpha\) is a hyperparameter that must be manually tuned. This is inconvenient because:
- The optimal \(\alpha\) varies greatly across tasks
- Even within the same task, different training phases may require different values of \(\alpha\) (more exploration early on, more precise policies later)
SAC v2 (2019, Haarnoja et al., "Soft Actor-Critic Algorithms and Applications") introduced two key modifications:
Modification 1: Automatic Entropy Tuning
The temperature adjustment is formulated as a constrained optimization problem:
where \(\mathcal{H}_0\) is the lower bound on the target entropy. For continuous action spaces, a common setting is:
That is, the target entropy equals the negative of the action space dimensionality. For example, if the action space is 6-dimensional, the target entropy is \(-6\).
Through dual transformation, the loss function for \(\alpha\) is:
Intuition:
- If the current policy's entropy \(-\log \pi > \mathcal{H}_0\) (the policy is sufficiently random), \(\alpha\) decreases, relaxing the randomness requirement so the policy can focus more on returns
- If the current policy's entropy \(-\log \pi < \mathcal{H}_0\) (the policy is too deterministic), \(\alpha\) increases, forcing the policy to become more random
This achieves automatic temperature regulation: early in training, when the policy naturally has high randomness, \(\alpha\) tends to be small; later in training, as the policy becomes more deterministic, \(\alpha\) automatically increases to maintain necessary exploration.
Modification 2: Removal of the Value Network
SAC v1 included a separate Value Network \(V_\psi(s)\). SAC v2 found this to be redundant, since the Soft Value can be derived directly from the Q-networks and the policy network:
Removing the Value Network simplifies the architecture, reduces the number of parameters to maintain, and leads to more stable training. Target networks only need to perform soft updates on the Q-networks.
Modern SAC
Integrating all components, the complete training procedure of modern SAC (i.e., SAC v2) is as follows:
Network Architecture:
| Network | Input | Output | Count |
|---|---|---|---|
| Q-network \(Q_{\phi_i}\) | \((s, a)\) concatenated | Scalar Q-value | 2 (Twin) |
| Target Q-network \(Q_{\phi_i^-}\) | \((s, a)\) concatenated | Scalar Q-value | 2 (soft update) |
| Policy network \(\pi_\theta\) | \(s\) | \((\mu, \log\sigma)\) | 1 |
| Temperature parameter \(\log \alpha\) | - | - | 1 learnable scalar |
Training Pseudocode:
SAC Algorithm (Modern Version):
1. Initialize Q_ϕ1, Q_ϕ2, π_θ, log α
2. Initialize targets: Q_ϕ1^- = Q_ϕ1, Q_ϕ2^- = Q_ϕ2
3. Initialize replay buffer D = {}
4. for each environment step do
5. a ~ π_θ(·|s) # Sample action from policy
6. s', r, done = env.step(a) # Interact with environment
7. D ← D ∪ {(s, a, r, s', done)} # Store in replay buffer
8. # Sample a batch from the replay buffer
9. (s, a, r, s', done) ~ D
10. # ---- Update Q-networks ----
11. a' ~ π_θ(·|s') # Sample next action with current policy
12. y = r + γ(1-done) * (min Q_ϕi^-(s',a') - α log π_θ(a'|s'))
13. Update ϕ1, ϕ2 to minimize (Q_ϕi(s,a) - y)^2
14. # ---- Update policy network ----
15. a_new ~ π_θ(·|s) # Reparameterized sampling
16. Update θ to minimize α log π_θ(a_new|s) - min Q_ϕi(s, a_new)
17. # ---- Update temperature ----
18. Update α to minimize -α(log π_θ(a_new|s) + H_0)
19. # ---- Soft update target networks ----
20. ϕi^- ← τϕi + (1-τ)ϕi^-
21. end for
Key Formulas Summary
| Formula Name | Mathematical Expression |
|---|---|
| Maximum entropy objective | \(J(\pi) = \mathbb{E}\left[\sum_t \gamma^t (r_t + \alpha \mathcal{H}(\pi(\cdot\|s_t)))\right]\) |
| Q-network target | \(y = r + \gamma(\min_{j} Q_{\phi_j^-}(s', a') - \alpha \log \pi(a'\|s'))\) |
| Q-network loss | \(L(\phi_i) = \mathbb{E}[(Q_{\phi_i}(s,a) - y)^2]\) |
| Policy loss | \(J_\pi(\theta) = \mathbb{E}_s[\mathbb{E}_{a \sim \pi}[\alpha \log \pi(a\|s) - \min_i Q_{\phi_i}(s,a)]]\) |
| Temperature loss | \(J(\alpha) = \mathbb{E}[-\alpha(\log \pi(a\|s) + \mathcal{H}_0)]\) |
| Soft Value | \(V(s) = \mathbb{E}_{a \sim \pi}[Q(s,a) - \alpha \log \pi(a\|s)]\) |
| Reparameterization | \(a = \tanh(\mu_\theta(s) + \sigma_\theta(s) \odot \epsilon)\) |
| Target soft update | \(\phi^- \leftarrow \tau\phi + (1-\tau)\phi^-\) |
Recommended Hyperparameters
| Hyperparameter | Symbol | Typical Value | Description |
|---|---|---|---|
| Learning rate | \(\alpha_{lr}\) | \(3 \times 10^{-4}\) | Shared across Actor/Critic/Alpha |
| Discount factor | \(\gamma\) | \(0.99\) | Discount for future rewards |
| Soft update coefficient | \(\tau\) | \(0.005\) | Target network update rate |
| Target entropy | \(\mathcal{H}_0\) | \(-\dim(\mathcal{A})\) | Target for automatic temperature tuning |
| Batch size | \(B\) | \(256\) | Number of samples per replay buffer draw |
| Replay buffer size | $ | \mathcal{D} | $ |
| Network width | - | \(256\) | Number of neurons per hidden layer |
| Network depth | - | \(2\) | Number of hidden layers |
| Learning start steps | - | \(10^4\) | Random data collected before training begins |
| Update frequency | - | 1 per step | One update per environment interaction step |
A major advantage of SAC is that its hyperparameters rarely need tuning. The default values listed above achieve excellent performance on the vast majority of continuous control tasks (MuJoCo, etc.) right out of the box. This stands in stark contrast to DDPG/TD3, which require careful hyperparameter tuning.
SAC vs PPO vs DQN
| Comparison Dimension | DQN | PPO | SAC |
|---|---|---|---|
| Year | 2013/2015 | 2017 | 2018 |
| Action space | Discrete | Discrete/Continuous | Continuous (native) |
| Policy type | Implicit (greedy over Q-values) | Stochastic policy | Stochastic policy (max entropy) |
| On/Off-Policy | Off-policy | On-policy | Off-policy |
| Sample efficiency | Medium | Low | High |
| Training stability | Medium (requires Target Net) | High (clipping protection) | High (entropy regularization + Twin Q) |
| Core networks | Q-network + Target Q | Actor + Critic | Twin Q + Actor + \(\alpha\) |
| Experience replay | Required | Not required | Required |
| Exploration mechanism | \(\epsilon\)-greedy | Inherent policy stochasticity | Maximum entropy + policy stochasticity |
| Tuning difficulty | Medium | Easy | Very easy |
| Parallelization | Not well-suited | Very well-suited (multi-env parallel) | Not well-suited |
| Typical applications | Atari games and other discrete tasks | RLHF, game AI, robotics | Robot control, continuous control |
| Mathematical foundation | Bellman equation + function approximation | Policy gradient + trust region | Maximum entropy RL + policy gradient |
Selection Guidelines:
- Discrete action spaces (e.g., board games, Atari): DQN family or PPO
- Continuous control + abundant samples (e.g., simulation environments): PPO (stable, easily parallelizable)
- Continuous control + limited samples (e.g., real robots): SAC (high sample efficiency)
- Maximum robustness with minimal tuning: SAC (automatic temperature tuning, good default hyperparameters)
- LLM RLHF: PPO (discrete token space, easily parallelizable, mature tooling ecosystem)