N-Step TD Learning

N-step TD learning has proven highly effective in modern deep learning. It is worth noting that \(TD(\lambda)\) is a special case of n-step methods, introduced in Chapter 12 of the Sutton & Barto textbook. In modern deep learning practice, n-step TD is used far more frequently than TD(\(\lambda\)), so we follow the textbook's order and cover the two topics separately.

Principles

Bias vs Variance

Before diving into multi-step TD, we must first understand the distinction between Bias and Variance.

Variance describes how much the results fluctuate if you repeat the same experiment multiple times. Therefore:

MC has high variance: we must wait until the end of an episode to collect all rewards. Even with the same starting point, different episodes can produce vastly different outcomes due to the many sources of randomness along the way.
TD(0) has low variance: we only look at the current step and the next step, so the update is barely affected by what happens further down the line. This makes the updates very stable, hence low variance.

Bias describes whether there is a systematic error between your estimated target and the true value:

MC is unbiased: it uses the sum of rewards that actually occurred — the true return \(G_t\) — with no "guesswork" involved. Although the process is noisy and has high variance, every reward \(R\) is a real sample. Over a large number of episodes, the average will converge to the true value.
TD is biased: its update uses an estimate of \(V(s_{t+1})\). If the initial estimate is wrong, this introduces a systematic inaccuracy.

Note that the estimated value in TD(0) comes from the table-lookup operation in our implementation. Because we need the Q-value of the next state, we query the Q-table (or a neural network). This lookup is the source of the estimate: stale data currently stored in the Q-table. Since this stale data is inaccurate before learning has converged, it introduces bias.

The Q-table records our current "belief" about the value of each state. At the start of training, we have no idea what each state is worth, so we assign every state an initial value (zero or random). The reason TD can still converge is that we continuously incorporate real immediate rewards:

\[ Target = \underbrace{R_t}_{\text{真实的反馈}} + \gamma \underbrace{V(s_{t+1})}_{\text{错误的估计}} \]

After tens of thousands of iterations, the corrective signal from these real rewards overwhelms the initial estimation errors, allowing both sides of the equation to reach the equilibrium described by the Bellman equation.

Inductive Bias

From our study of TD learning, we know:

Dimension	Monte Carlo (MC)	One-Step TD
Approach	Uses all rewards up to the end of the episode.	Uses only the one-step reward and the estimated value of the next state.
Bias	Unbiased. It uses actually observed rewards with no estimates.	Biased. It relies on a not-yet-accurate value estimate \(V(s_{t+1})\).
Variance	High variance. Randomness accumulates at every step, leading to large fluctuations.	Low variance. Only one transition is considered, making updates very stable.

The significance of multi-step TD lies in combining the best of both worlds: use \(n\) steps of real rewards (to reduce bias), then append a state-value estimate (to reduce variance).

We can observe that because we initialize the Q-table ourselves, this initialization process shapes the agent's initial view of the world. Different initializations lead to different biases:

Zero Initialization bias: you assume the world is mundane until rewards prove otherwise. This can cause the agent to lack motivation to explore early on.
Optimistic Initial Values bias: you assign every state a very high score (e.g., 100). The agent discovers that the actual rewards are always lower than expected, so it keeps trying unexplored paths in search of the "imagined" high reward. This effectively serves as an exploration-encouraging bias.
Experience-based initialization bias: if you preset values based on human knowledge, you are forcing the agent to start thinking "standing on the shoulders of giants."

The TD algorithm is biased because it "updates guesses with guesses."

Since the update target for \(Q(s_t, a_t)\) contains \(Q(s_{t+1}, a_{t+1})\), if the initial values (bias) are wildly off, the agent will need a very long time for the immediate rewards \(R\) to correct these errors.
This "initial bias" dominates early in training; its influence diminishes only as more sampled data (ground truth) accumulates.

In short, we must recognize that the initial Q-table is not a blank slate — it is a set of prior assumptions. The act of assigning values to it is essentially us, as designers, using inductive bias to steer the agent's learning trajectory.

Multi-Step TD

The reason temporal-difference algorithms are characterized as "biased" is that their update targets include the estimated value of the next state (i.e., \(V(s_{t+1})\) or a \(Q\) value looked up from a table) rather than the state's true eventual return. If the Q-table's initial belief about a certain state is wrong (biased), this error propagates backward along the state trajectory through the update formula — like an infectious disease — affecting the evaluation of every related state. Since the target value you use as a "reference standard" contains Q-table data that is itself biased, the updated result will naturally inherit and perpetuate that bias. If we want to reduce bias, we need to incorporate more real rewards — and this is precisely what motivates multi-step TD.

Multi-Step SARSA

1-step SARSA target:

\[ G_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) \]

n-step SARSA target:

\[ G_t = r_t + \gamma r_{t+1} + \dots + \gamma^n Q(s_{t+n}, a_{t+n}) \]

Multi-step SARSA update rule:

\[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)] \]

In the multi-step algorithm, the first two terms inside the brackets (\(r_t + \gamma Q \dots\)) are replaced by the longer n-step \(G_t\) defined above.

In the \(n\)-step SARSA target formula, we can clearly see two parts with distinct roles:

\[ G_t^{(n)} = \underbrace{R_t + \gamma R_{t+1} + \dots + \gamma^{n-1} R_{t+n-1}}_{\text{第一部分：真实奖励序列}} + \underbrace{\gamma^n Q(s_{t+n}, a_{t+n})}_{\text{第二部分：末端估计值}} \]

The first part is responsible for reducing bias. This sequence of \(R\) values represents feedback the agent actually sampled from the environment. The larger \(n\) is, the more real feedback the formula contains and the less it depends on potentially incorrect estimates. Because real rewards are "ground truth," including more of them brings the estimated target closer to the true value, thereby reducing bias.

The second part is responsible for controlling variance. At step \(n\), instead of looking further ahead, we simply look up a \(Q\) value from the table to cap the return. If we kept looking all the way to the end (i.e., MC), every subsequent action and every stochastic transition in the environment would add noise. By "truncating" at step \(n\) and substituting a relatively stable estimate, we successfully block the endless accumulation of downstream randomness, thereby controlling variance.