TD(lambda)

In previous chapters, we learned about TD(0), which updates based on looking just one step ahead, and MC, which waits until the end of an entire episode before updating. These two methods represent opposite extremes. The central question of this chapter is: Is there a method that can smoothly interpolate between these two extremes? The answer is TD(lambda), which unifies TD(0) and MC within a single framework through the mechanism of eligibility traces.

From n-step to lambda-return

Recap of n-step Returns

In TD(0), we construct the target value using one-step bootstrapping:

\[ G_t^{(1)} = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) \]

A natural generalization is to extend this to \(n\) steps:

\[ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}) \]

That is, we use actual rewards for the first \(n\) steps and bootstrap with an estimated value from step \(n+1\) onward. When \(n \to \infty\) (all the way to the end of the episode), \(G_t^{(\infty)} = G_t\), which reduces to the MC return.

Different values of \(n\) perform differently across tasks. Sometimes \(n=1\) (TD(0)) works best; other times \(n=4\) or \(n=8\) is better. So the question arises: can we leverage all n-step returns simultaneously?

Definition of the lambda-return

The key idea behind the lambda-return is to take a weighted average of all n-step returns, with weights that decay geometrically:

\[ G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} \]

where \(\lambda \in [0, 1]\). The coefficient \((1-\lambda)\) is a normalization constant ensuring the weights sum to 1:

\[ (1-\lambda)(1 + \lambda + \lambda^2 + \dots) = (1-\lambda) \cdot \frac{1}{1-\lambda} = 1 \]

Intuitively:

\(G_t^{(1)}\) (1-step return) receives weight \((1-\lambda)\)
\(G_t^{(2)}\) (2-step return) receives weight \((1-\lambda)\lambda\)
\(G_t^{(3)}\) (3-step return) receives weight \((1-\lambda)\lambda^2\)
...and so on, with weights decaying exponentially

The Meaning of lambda

\(\lambda\) is a knob that controls "how far ahead to look":

\(\lambda = 0\): All weight is concentrated on \(G_t^{(1)}\), so \(G_t^0 = G_t^{(1)} = R_{t+1} + \gamma \hat{v}(S_{t+1})\). This is TD(0).
\(\lambda = 1\): Weight spreads uniformly out to infinity, yielding \(G_t^1 = G_t\) (the full MC return). This is Monte Carlo.
\(0 < \lambda < 1\): A compromise between TD(0) and MC. Short-horizon n-step returns receive larger weights, while long-horizon ones receive smaller weights.

In practice, commonly used values of \(\lambda\) fall between 0.8 and 0.95. A larger \(\lambda\) means greater reliance on actual returns (low bias, high variance), while a smaller \(\lambda\) means greater reliance on bootstrapping (high bias, low variance).

Eligibility Traces

The lambda-return is conceptually elegant, but computing \(G_t^\lambda\) directly requires knowledge of all future rewards — just like MC, it requires waiting until the end of the episode. Eligibility traces provide an online, incremental way to achieve an equivalent effect.

The core idea behind eligibility traces is to maintain an additional variable \(\mathbf{e}_t\) for each state (or each parameter component), which records "how recently and how frequently a state has been visited." If a state was visited recently, its eligibility trace is high, indicating that it bears significant "responsibility" for the current TD error; over time, the trace gradually decays.

Accumulating Traces

This is the most basic form of eligibility trace. The update rule at each step is:

\[ \mathbf{e}_t = \gamma \lambda \mathbf{e}_{t-1} + \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

with initial value \(\mathbf{e}_0 = \mathbf{0}\).

The meaning of this formula:

\(\gamma \lambda \mathbf{e}_{t-1}\): The old eligibility trace decays at a rate of \(\gamma\lambda\). \(\gamma\) comes from the discount factor, and \(\lambda\) from the weight decay of the lambda-return. Their product means that the further back a visit occurred, the less influence it has.
\(\nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t)\): The gradient of the current state is "accumulated" into the eligibility trace. If the same state is visited repeatedly, its eligibility trace grows larger and larger.

In the tabular case (one independent parameter per state), the accumulating trace simplifies to:

\[ e_t(s) = \begin{cases} \gamma \lambda \, e_{t-1}(s) + 1 & \text{if } s = S_t \\ \gamma \lambda \, e_{t-1}(s) & \text{if } s \neq S_t \end{cases} \]

Replacing Traces

The idea behind replacing traces is that when a state is revisited, instead of accumulating, the eligibility trace is reset directly to 1:

\[ e_t(s) = \begin{cases} 1 & \text{if } s = S_t \\ \gamma \lambda \, e_{t-1}(s) & \text{if } s \neq S_t \end{cases} \]

Replacing traces sometimes outperform accumulating traces because they prevent the eligibility trace from becoming excessively large due to repeated visits. Intuitively, when you revisit the same state repeatedly, it is more sensible to "restart the clock" rather than "stack up credit."

Dutch Traces

Dutch traces are a more computationally efficient form of eligibility traces, proposed by Hado van Hasselt and Marco Wiering. The update rule is:

\[ \mathbf{e}_t = \gamma \lambda \mathbf{e}_{t-1} + (1 - \alpha \gamma \lambda \, \mathbf{e}_{t-1}^\top \mathbf{x}_t) \mathbf{x}_t \]

where \(\mathbf{x}_t = \mathbf{x}(S_t)\) is the feature vector of the current state and \(\alpha\) is the learning rate.

The main advantage of Dutch traces lies in computational efficiency: under linear function approximation, the TD(lambda) algorithm with Dutch traces is exactly equivalent (not merely approximately equivalent) to the offline lambda-return algorithm, while maintaining the computational efficiency of online updates.

The TD(lambda) Algorithm

Forward View — lambda-return

The forward view directly uses the lambda-return as the update target:

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [G_t^\lambda - \hat{v}(S_t, \mathbf{w}_t)] \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

"Forward" means that, standing at time \(t\), we look forward (into the future), aggregate all future reward information into \(G_t^\lambda\), and use it to update the value of the current state.

The problem with the forward view is that computing \(G_t^\lambda\) requires all rewards from time \(t\) to the end of the episode, making it inherently an offline algorithm — it must wait until the episode terminates.

Backward View — Eligibility Traces

The backward view uses eligibility traces to achieve an equivalent update:

\[ \delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \]

\[ \mathbf{e}_t = \gamma \lambda \mathbf{e}_{t-1} + \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \delta_t \mathbf{e}_t \]

"Backward" means that when a TD error \(\delta_t\) is produced, we look backward (into the past) and distribute this error to all recently visited states according to their eligibility traces. States with high eligibility traces receive a larger share of the error; those with low traces receive less.

The major advantage of the backward view is that it is an online algorithm — it can update immediately after each step, without waiting for the episode to end.

Equivalence of Forward and Backward Views

This is one of the most fundamental results in TD(lambda) theory:

Under linear function approximation, the forward view (lambda-return algorithm) and the backward view (TD(lambda) algorithm) produce exactly the same total parameter update by the end of an episode.

In other words, if we sum up all the incremental updates from the backward view across every step, the result equals the batch update from the forward view at the end of the episode.

The significance of this equivalence is that we use the forward view to understand what the algorithm is optimizing (the lambda-return), and the backward view to implement the computation efficiently (online, incremental updates).

Sarsa(lambda)

Extending TD(lambda) from the prediction problem to the control problem yields Sarsa(lambda).

Algorithm Pseudocode

初始化参数向量 w, 资格迹向量 e = 0
对每个episode:
    初始化状态 S, 选择动作 A (epsilon-greedy)
    对每一步:
        执行动作 A, 观察 R, S'
        选择动作 A' (epsilon-greedy based on q_hat(S', ., w))

        计算TD Error:
            delta = R + gamma * q_hat(S', A', w) - q_hat(S, A, w)
            (若S'是终止状态, 则 delta = R - q_hat(S, A, w))

        更新资格迹:
            e = gamma * lambda * e + grad_w q_hat(S, A, w)

        更新参数:
            w = w + alpha * delta * e

        S <- S', A <- A'

        若S'是终止状态, 退出循环

    重置资格迹 e = 0

The difference from standard SARSA is that, at each step, the TD error does not merely update the parameters corresponding to the current \((S, A)\) pair. Instead, the error is "broadcast" through the eligibility trace \(\mathbf{e}\) to all recently visited state-action pairs.

Similarly, Q-learning can be combined with eligibility traces to produce Watkins's Q(lambda). However, since Q-learning is an off-policy method, its combination with eligibility traces is theoretically more complex (involving issues related to the Deadly Triad).

Practical Advice

Choosing lambda

The optimal value of \(\lambda\) depends on the specific task, but there are some rules of thumb:

\(\lambda = 0\) (TD(0)): Highest bias, lowest variance. Suitable for tasks with dense rewards and accurate state representations.
\(\lambda = 1\) (MC): Lowest bias, highest variance. Suitable for tasks with sparse rewards and short episodes.
\(\lambda \in [0.8, 0.95]\): Often the best choice in practice. For example, in Sutton's Mountain Car experiments, \(\lambda \approx 0.9\) typically outperforms both extremes.

A practical tuning strategy is to start with \(\lambda = 0.9\), lower it if variance is too high, and raise it if learning is too slow.

Comparison with n-step TD

Property	n-step TD	TD(lambda)
Hyperparameter	\(n\) (discrete integer)	\(\lambda\) (continuous real number)
Multi-step information	Uses only the \(n\)-step return	Weighted average of all \(n\)-step returns
Implementation	Requires storing \(n\) transitions	Eligibility traces — only one extra vector
Computational efficiency	Updates one state per step	Updates all "eligible" states per step
Theoretical properties	A special case of the lambda-return	A more general framework

TD(lambda) can be viewed as a "soft" version of n-step TD: n-step TD concentrates all weight on a single value of \(n\), whereas TD(lambda) distributes the weight smoothly across all possible values of \(n\). This smoothing often leads to more stable learning.

Intuition Behind Eligibility Traces

Eligibility traces can be understood through an everyday analogy:

Imagine you are training a cat. The cat performs a sequence of actions (jumps onto the table, knocks over a cup, runs to the kitchen), and then you discover the water on the floor and punish it. Which action should the cat attribute the punishment to?

TD(0) punishes only the most recent action (running to the kitchen) — clearly unreasonable
MC distributes punishment equally across all actions — but jumping onto the table may have been innocent
TD(lambda) distributes punishment according to temporal recency: knocking over the cup (the most recently relevant action) receives the largest punishment, with other actions receiving progressively less based on how far back in time they occurred

This is the essence of eligibility traces: they are a credit assignment mechanism that distributes "blame or credit" based on temporal proximity.

Mathematical Intuition: Why gamma*lambda Is the Decay Factor

In the eligibility trace update formula, the decay factor is \(\gamma \lambda\) rather than \(\gamma\) or \(\lambda\) alone. These two factors serve distinct roles:

\(\gamma\) comes from the value function's discounting: future rewards should inherently be discounted, with greater discounts for more distant rewards
\(\lambda\) comes from the geometric weights of the lambda-return: we choose to assign greater weight to shorter-horizon n-step returns

Their product means that a state's "eligibility" decays exponentially at rate \(\gamma\lambda\). If \(\gamma = 0.99\) and \(\lambda = 0.9\), then the per-step decay rate is \(0.891\). A state visited 10 steps ago would have its eligibility trace decayed to approximately \(0.891^{10} \approx 0.32\), retaining only about one-third of its original "credit."

This also explains why TD(lambda) reduces to the two extremes:

When \(\lambda = 0\), \(\gamma \lambda = 0\), so the eligibility trace is completely zeroed out at every step — only the current state has a nonzero trace. The effect is equivalent to TD(0), which updates only the most recently visited state.
When \(\lambda = 1\), \(\gamma \lambda = \gamma\), so the eligibility trace decays at the same rate as the discount factor. Looking back at the end of the episode, the credit assigned to each state is exactly proportional to that state's contribution to \(G_t\) under the MC method.

Summary

TD(lambda) is the bridge connecting TD(0) and MC, providing the ability to move smoothly along the bias-variance spectrum:

\[ \text{TD(0)} \xleftarrow{\lambda = 0} \text{TD}(\lambda) \xrightarrow{\lambda = 1} \text{MC} \]

Key takeaways:

The lambda-return is a geometrically weighted average of all n-step returns, providing a more robust target than any single n-step return
Eligibility traces are the online implementation mechanism for the lambda-return, tracking each state's credit through a decaying "memory vector"
The forward view and the backward view are mathematically equivalent — the former is used to understand the optimization objective, the latter for efficient implementation
The choice of \(\lambda\) is fundamentally a bias-variance tradeoff: small \(\lambda\) favors bootstrapping (low variance, high bias), while large \(\lambda\) favors MC sampling (high variance, low bias)