TD Learning (Temporal Difference Learning)

We have already learned the concept of sampling through MC methods, and how to accomplish control tasks using an on-policy approach. We no longer need complete knowledge of the environment (model-free); instead, we interact directly with the environment and learn from the data collected during these interactions.

In on-policy MC ES control, when computing G, we rely entirely on actual sampled experience. For example, if we bump into a wall twice in a maze and lose 2 points, then our total return G reflects exactly those two penalties. In other words, whatever actually happens is what we use to update the Q table. In off-policy MC, we use the Importance Sampling Ratio to re-weight the sampled G.

In this chapter, we introduce the concept of TD learning along with two classic TD methods:

On-policy SARSA
Off-policy Q-learning

From MC to TD: Derivation

Bootstrapping

The practice of "using a current Q estimate to update a previous Q estimate" is called Bootstrapping in reinforcement learning.

Temporal difference learning combines ideas from both Monte Carlo methods and dynamic programming:

It does not require prior knowledge of the environment.
Following the logic of the Bellman equation, it uses the value estimate of a successor state to update the value estimate of the current state.

Let us revisit the MC method: it uses the actual cumulative return \(G_t\). It trusts no one's predictions -- only the final outcome.

\[ V(S_t) \leftarrow V(S_t) + \alpha [ \underbrace{G_t}_{\text{目标值}} - V(S_t) ] \]

Whether on-policy MC or off-policy MC, the update logic follows the incremental update formula above. Note that the target is denoted \(G_t\) (Return): the actual cumulative reward starting from time step \(t\) all the way to the end of the episode (i.e., \(R_{t+1} + \gamma R_{t+2} + \dots\)).

In MC methods, we directly use the actually sampled values to update the V or Q table. In TD, however, we use the immediate reward plus an estimated value of the future. It believes that "predictions can correct predictions." In short:

MC uses actually observed data.
TD uses estimated data.

For example, in a grid world:

start = (0, 0)
goal = (2, 2)

reward = [
    [-1, -1, -1],
    [-1, -1, -1],
    [-1, -1, 10]
]

actions = {
    'up':    (-1,  0),  # dr, dc
    'down':  (1,  0),
    'right': (0,  1),
    'left':  (0, -1)
}

Suppose the agent has just completed a move: from \((0,0)\) to \((0,1)\).

\(S_t\) (cur_state): The cell the agent was just in, i.e., \((0,0)\).
\(A_t\) (cur_action): The action the agent just took, i.e., 'right'.
\(R_{t+1}\) (reward): The reward received from that step, i.e., \(-1\).
\(S_{t+1}\) (next_state): The cell where the agent now stands, i.e., \((0,1)\).
\(A_{t+1}\) (next_action): The action the agent plans to take next at \((0,1)\) (note: in Q-learning, we typically pick the action with the highest \(Q\) value at \((0,1)\)).

In TD, once the agent takes one step and arrives at next_state, it immediately looks back and evaluates "how good was that last move?"

Observed fact (\(R_{t+1}\)): \(-1\) (the actual reward received).
Available intelligence (\(Q(S_{t+1}, A_{t+1})\)): \(-5\) (the current estimate of the future value from next_state).
New conclusion (TD Target): \(Target = -1 + (-5) = -6\).

The agent reasons: since moving right from \((0,0)\) yields \(-1\), and the destination \((0,1)\) is estimated to be worth \(-5\) going forward, then "moving right at \((0,0)\)" as a whole is currently worth \(-6\) points.

What we update is \(Q(S_t, A_t)\), i.e., the action value at the previous cell:

\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ \underbrace{(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}))}_{\text{TD Target (新结论)}} - \underbrace{Q(S_t, A_t)}_{\text{旧偏见}} ] \]

Plugging in the example above:

\[ Q(0,0, \text{'right'}) \leftarrow 0 + \alpha [ \underbrace{(-1 + (-5))}_{-6} - 0 ] \]

Key observations:

Timing of evaluation: The evaluation must happen after taking the step, standing at next_state \((0,1)\) and looking back.
What is being evaluated: We are evaluating cur_state \((0,0)\) and cur_action 'right'. Only after completing the step do you know what reward it gave and where it brought you.

Let us organize the logic. Suppose we move from a to b:

We use the actual reward at b.
Combined with b's existing Q value.
To update a's Q value.

That is:

\[ Q(a, \text{action}) \leftarrow Q(a, \text{action}) + \alpha [ \underbrace{(R_b + \gamma Q(b, \text{next\_action}))}_{\text{由 } b \text{ 带来的新证据}} - Q(a, \text{action}) ] \]

This is "bootstrapping": rather than waiting until the end of the entire episode, we use the "current assessment" of cell \(b\) to inform the update at cell \(a\).

The term Q(b, next_action) can be a bit confusing. This confusion arises from the difference between V tables and Q tables:

\(V\) table: Each cell (state) has only a single number. For example, \(V(b) = -5\).
\(Q\) table: Each cell (state) contains scores for all 4 actions (up, down, left, right).

When you want to "use \(b\)'s existing \(Q\) value" to update \(a\), you realize there is no single number to grab because \(b\) has four scores! At this point you must make a decision: which action's score at \(b\) should represent \(b\)'s worth?

The next_action here is essentially the "representative" you pick from \(b\)'s set of action values. As we will see, different methods of choosing this representative lead to two different algorithms:

Pick the action with the highest Q value at b -- i.e., assume we will always take the smartest move going forward. Regardless of what step is actually taken next, we always select the one that looks best. This is off-policy Q-learning. (Due to epsilon-greedy exploration, the next actual step may not be the one with the highest Q value.)
Look at what we actually plan to do next at b. If the next step at b is to go right, then we use "go right" as the next_action here. Since this uses the actually executed action, this is on-policy SARSA.

Discount Factor

In many reinforcement learning tasks, there is no clear terminal state (e.g., a robot patrolling a factory forever).

Without discounting (\(\gamma = 1\)): If the task runs indefinitely, the cumulative return \(G_t\) would grow to infinity.
With discounting (\(0 \le \gamma < 1\)): Mathematically, the infinite sum is guaranteed to converge to a finite constant. This makes computation tractable and keeps the algorithm stable.

Suppose you are the robot in the Grid World, and you have two choices:

Option A: Receive 10 points right now.
Option B: Receive 10 points after 1000 steps.

You would obviously choose A. Why? Because before reaching that 1000th step, many things could go wrong: the program might crash, there could be a power outage, or the environment might change. The more distant a reward, the less likely it is to materialize. \(\gamma\) acts as a "trust coefficient" -- with each additional step into the future, we discount our trust in that reward a little more.

In your Grid World, every step costs 1 point.

If you set \(\gamma = 0\): The robot becomes extremely myopic; it only cares about the immediate reward from the next cell and completely ignores the long term.
If you set \(\gamma = 1\): The robot treats "losing 1 point now" and "losing 1 point 100 steps later" as equally bad. It may wander aimlessly through the maze because it does not perceive wasting time as a loss.

The significance of introducing \(\gamma < 1\)* is that it forces the agent to recognize that *receiving a reward sooner is better than receiving it later. This naturally guides the agent toward finding the shortest-path optimal solution.

Bellman Equation

The Bellman equation is as follows:

\[ V = \mathbb{E}[R + \gamma V'] \]

This is the theoretically perfect form and the soul of reinforcement learning. Using a recursive, "nesting-doll" logic, it simplifies the complex future into a simple "today + tomorrow":

\[ \underbrace{V}_{\substack{\text{当前的价值} \\ \text{(今天的账面价值)}}} = \underbrace{\mathbb{E}}_{\substack{\text{求平均} \\ \text{(考虑所有可能性)}}} [ \underbrace{R}_{\substack{\text{即时奖励} \\ \text{(今天到手的钱)}}} + \underbrace{\gamma}_{\substack{\text{折扣率} \\ \text{(未来的钱要打折)}}} \underbrace{V'}_{\substack{\text{下一个状态的价值} \\ \text{(明天的预期收益)}}} ] \]

This equation describes an "ideal equilibrium":

If your knowledge is perfect, then the "total value" (\(V\)) of your current position should equal exactly the "hard cash" you receive by taking one step (\(R\)), plus the "discounted estimate" of your destination's future value (\(\gamma V'\)).
Early in the learning process, \(V\) is essentially a random guess, so the two sides of the equation will certainly not be equal. The algorithm's goal is to keep adjusting \(V\) until equality holds.
It reveals that the value function is self-consistent across time -- today's value is determined by tomorrow's, tomorrow's by the day after, and so on.

Each element carries deep meaning:

\(V\) (Current Value): Your assessment of the current state \(s\). For example, standing one step before the goal, this \(V\) should be high.
\(\mathbb{E}\) (Expectation): In practice, a single step might lead to different outcomes (due to environmental stochasticity), so we compute an average.
\(R\) (Immediate Reward): The reward or penalty from this step. It is the only "real substance" in the equation -- everything else (\(V\)) is estimated.
\(\gamma\) (Discount Factor): If \(\gamma = 0.9\), it means that 100 dollars tomorrow is worth only 90 dollars today. It forces the agent to balance long-term goals with short-term considerations.
\(V'\) (Next Value): This is the most elegant part. We do not need to account for the entire future; we simply look up the value of the "landing state" in the value table.

The essential insight of the Bellman equation can be summarized as:

Current estimate = Current reward + Future estimate

In reinforcement learning, our goal is to learn a state value function \(V_\pi(s)\)* that represents the expected cumulative future reward obtainable from state *\(s\) when following policy \(\pi\):

\[ 状态s的价值 = 从该状态开始往后所有收益的平均值（期望） \]

In the equation below, \(G_t\) is the actual total reward obtained in MC methods. In practice, we track it with a variable and compute \(G\) after the episode ends:

\[ V_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] \]

The meaning of each symbol in the equation:

\[ \underbrace{V_{\pi}(s)}_{\substack{\text{在策略 } \pi \text{ 下} \\ \text{状态 } s \text{ 的价值}}} = \underbrace{\mathbb{E}_{\pi}}_{\substack{\text{基于策略 } \pi \\ \text{计算期望（平均值）}}} [ \underbrace{G_t}_{\substack{\text{从此刻起一直到} \\ \text{结束的累计收益}}} | \underbrace{S_t = s}_{\substack{\text{给定前提：} \\ \text{现在处于状态 } s \text{}}} ] \]

Expanding the cumulative return \(G_t\) -- here we introduce the discount factor \(\gamma^k\) so that rewards closer to the present receive greater weight. In other words, "a dollar tomorrow is worth less than a dollar today":

\[ V_\pi(s) = \mathbb{E}_\pi[\sum_{k=0}^{\infty} \gamma^k R_{t+k} | S_t = s] \]

Extracting the immediate reward (splitting \(G_t\) into the current reward and future rewards):

\[ V_\pi(s) = \mathbb{E}_\pi[R_t + \gamma \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s] \]

Observe the right-hand portion of the equation above:

\[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \]

This series is actually the total cumulative reward starting from time step \(t+1\) (the next moment). It is precisely \(G_{t+1}\) (the return from time step \(t+1\)). Therefore, after extracting the immediate reward, we can simplify the expression to:

\[ V_\pi(s) = \mathbb{E}_\pi [R_t + \gamma G_{t+1} | S_t = s] \]

We already know from the beginning that:

\[ \mathbb{E}[G_{t+1}] = V_\pi(S_{t+1}) \]

So we can write the expression directly as:

\[ V_\pi(s) = \mathbb{E}_\pi[R_t + \gamma V_\pi(S_{t+1}) | S_t = s] \]

This is essentially a Bellman equation.

TD(0) Method

From the derivation above, we know:

\[ V_\pi(s) = \mathbb{E}_\pi[G_t | S_t = s] \]

\[ = \mathbb{E}_\pi[R_t + \gamma V_\pi(S_{t+1}) | S_t = s] \]

The Monte Carlo method uses the first line as the update target, while the temporal difference method uses the second line. From the mathematical principles, we can also see the different philosophies of the two algorithms: MC pursues the most direct and complete expected outcome, whereas TD is based on the recursive form of the Bellman equation.

At this point, we should understand what "temporal difference" means:

\[ \text{TD} = \underbrace{\text{Temporal}}_{\text{时刻 } t \text{ 到 } t+1 \text{}} + \underbrace{\text{Difference}}_{\text{目标值} - \text{当前值} \text{}} \]

Temporal refers to the passage of time steps -- the algorithm focuses on two different points in time: the current moment \(t\) and the next moment \(t+1\). TD only needs to take a single step before it can perform an update.
Difference refers to computing the error -- the core update signal is the TD Error, i.e., the difference between the new estimate (Target) and the old estimate:

\[ \underbrace{Q}_{\text{新值}} \leftarrow \underbrace{Q}_{\text{旧值}} + \underbrace{\alpha}_{\text{步长}} \times \underbrace{(Target - Q)}_{\text{差分（误差）}} \]

The distinction between algorithms ultimately comes down to how the difference is handled:

SARSA's difference comes from the actual choice at the next step.
Q-learning's difference comes from the theoretically optimal choice at the next step.

We can see that different approaches to handling the difference stem mathematically from different methods of computing the Target value. In other words, no matter how the algorithm varies, the core underlying logic of TD(0) is always this formula:

\[ V(S_t) \leftarrow V(S_t) + \alpha [ \underbrace{R_{t+1} + \gamma V(S_{t+1})}_{\text{TD Target (目标值)}} - V(S_t) ] \]

We need to distinguish between TD Target and TD Error:

TD Target: \(R_{t+1} + \gamma V(S_{t+1})\). This is the "newly observed truth": the reward from taking one step, plus the estimate of the remaining journey.
TD Error: \(\delta_t = \text{Target} - V(S_t)\). This represents the gap between "reality" and "expectation."

Therefore, the mathematical difference between algorithms is really just a difference in the Target:

For MC, the Target is the cumulative return \(G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\)
For Q-learning, \(Target = R_{t+1} + \gamma \max_a Q(S_{t+1}, a)\)
For SARSA, \(Target = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})\)

SARSA

On-policy SARSA

\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) ] \]

Key point: Here, \(A_{t+1}\) is the next action actually selected according to the current policy.

Q-learning

Off-policy Q-learning

\[ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) ] \]

Key point: Regardless of which action is actually taken, the update always assumes the next step will choose the best possible action.