Function Approximation

In the preceding chapters, we have been using tabular methods to store value functions: a Q-table or a V-table, with one entry for each state (or state-action pair). This approach works perfectly well in small tasks like Grid World, but it breaks down rapidly when we move toward real-world problems. This chapter covers the core material from Sutton & Barto Chapters 9--11: function approximation.

Why Function Approximation Is Needed

Curse of Dimensionality

Consider a Go board: on a 19x19 board, each position can be in one of three states (black, white, or empty), giving a total state count of roughly \(3^{361}\)—an astronomical number. Even for simpler Atari games, with a screen resolution of 210x160 and 256 possible colors per pixel, the state space is \(256^{210 \times 160}\). Clearly, it is impossible to maintain a separate Q-value for every single state.

The core assumption behind tabular methods is that every state is independent—experience at state A provides no information whatsoever about state B. In reality, however, similar states tend to have similar values. In the Atari game Pong, for example, the ball being in the upper-left corner of the screen versus one pixel to the right of that position are two "different states" whose optimal actions are virtually identical.

Generalization

The key advantage of function approximation is generalization: after visiting one state and updating its value, the value estimates of similar states also improve. This is because we no longer store a separate number for each state; instead, we fit the entire value function with a parameterized function (with parameters \(\mathbf{w}\)):

\[ \hat{v}(s, \mathbf{w}) \approx v_\pi(s) \]

\[ \hat{q}(s, a, \mathbf{w}) \approx q_\pi(s, a) \]

Here \(\mathbf{w}\) is the parameter vector. By adjusting \(\mathbf{w}\), we aim for \(\hat{v}\) to approximate the true value function as closely as possible across all states.

Value Function Approximation

Problem Formulation

Our goal is to find a parameter vector \(\mathbf{w}\) such that the approximate function \(\hat{v}(s, \mathbf{w})\) is as close as possible to the true value \(v_\pi(s)\). A natural objective function (mean squared error) can be defined as:

\[ \overline{VE}(\mathbf{w}) = \sum_{s \in \mathcal{S}} \mu(s) [v_\pi(s) - \hat{v}(s, \mathbf{w})]^2 \]

where \(\mu(s)\) is the state distribution, reflecting how much we care about the error at each state. This objective is known as the Mean Squared Value Error in reinforcement learning.

We minimize this objective using SGD (stochastic gradient descent):

\[ \mathbf{w}_{t+1} = \mathbf{w}_t - \frac{1}{2} \alpha \nabla_\mathbf{w} [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)]^2 \]

\[ = \mathbf{w}_t + \alpha [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)] \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

The problem is that we do not know the true \(v_\pi(S_t)\). This is precisely where different algorithms diverge: what do we use as a substitute for \(v_\pi(S_t)\)?

Linear Function Approximation

The simplest form of function approximation is linear. We first construct a feature vector for each state \(s\):

\[ \mathbf{x}(s) = (x_1(s), x_2(s), \dots, x_d(s))^\top \]

Then we take the inner product of the parameter vector \(\mathbf{w}\) with the feature vector:

\[ \hat{v}(s, \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s) = \sum_{i=1}^{d} w_i \cdot x_i(s) \]

In this form, the gradient is particularly simple:

\[ \nabla_\mathbf{w} \hat{v}(s, \mathbf{w}) = \mathbf{x}(s) \]

And the SGD update rule becomes:

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [v_\pi(S_t) - \hat{v}(S_t, \mathbf{w}_t)] \cdot \mathbf{x}(S_t) \]

The advantage of linear function approximation is that it comes with strong theoretical convergence guarantees: for on-policy prediction, linear TD(0) is guaranteed to converge to a neighborhood of the global optimum. Its disadvantage is equally clear: limited representational capacity and the need for manually designed features.

Nonlinear Function Approximation (Neural Networks)

When we replace \(\hat{v}(s, \mathbf{w})\) with a neural network, we enter the realm of deep reinforcement learning. Neural networks can automatically learn features from raw inputs (such as pixels), eliminating the need for manual feature engineering.

Take DQN (Deep Q-Network) as an example. Its core idea is to approximate the Q-function with a convolutional neural network:

\[ \hat{q}(s, a, \mathbf{w}) \approx q_*(s, a) \]

The network takes the game screen (state \(s\)) as input and outputs Q-values for each action. DQN's success relies on two key techniques:

Experience Replay: Transitions \((s, a, r, s')\) are stored in a replay buffer, and training samples are drawn randomly. This breaks the temporal correlations between consecutive samples, making the i.i.d. assumption of SGD more reasonable.
Target Network: A slowly updated target network is used to compute the TD target, avoiding the instability of "chasing a moving target."

The theoretical guarantees for nonlinear function approximation are far weaker than in the linear case, but in practice the approach is remarkably powerful.

Policy Approximation

Parameterized Policy

In addition to approximating the value function, we can also parameterize the policy directly. Using parameters \(\boldsymbol{\theta}\), we define a parameterized policy:

\[ \pi(a|s, \boldsymbol{\theta}) = \Pr\{A_t = a | S_t = s, \boldsymbol{\theta}_t = \boldsymbol{\theta}\} \]

For discrete action spaces, the most common parameterization is the softmax policy:

\[ \pi(a|s, \boldsymbol{\theta}) = \frac{e^{h(s, a, \boldsymbol{\theta})}}{\sum_{b} e^{h(s, b, \boldsymbol{\theta})}} \]

where \(h(s, a, \boldsymbol{\theta})\) is a preference function (which can be a linear function \(\boldsymbol{\theta}^\top \mathbf{x}(s,a)\) or a neural network).

For continuous action spaces (e.g., robotic control), a Gaussian policy is commonly used:

\[ \pi(a|s, \boldsymbol{\theta}) = \frac{1}{\sigma(s, \boldsymbol{\theta})\sqrt{2\pi}} \exp\left(-\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2\sigma(s, \boldsymbol{\theta})^2}\right) \]

where the mean \(\mu\) and standard deviation \(\sigma\) are both functions of the state, controlled by the parameters \(\boldsymbol{\theta}\).

Policy approximation and value function approximation are complementary: in the Actor-Critic framework, both are used simultaneously. This is discussed in greater detail in the chapter on policy gradient methods.

Prediction

The prediction problem aims to learn the value function \(v_\pi\) for a given policy \(\pi\). Under function approximation, this means adjusting the parameters \(\mathbf{w}\) so that \(\hat{v}(s, \mathbf{w}) \approx v_\pi(s)\).

Gradient Monte Carlo

The most straightforward approach is to substitute the MC sampled return \(G_t\) for the true \(v_\pi(S_t)\):

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [G_t - \hat{v}(S_t, \mathbf{w}_t)] \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

Since \(G_t\) is an unbiased estimate of \(v_\pi(S_t)\) (\(\mathbb{E}[G_t | S_t = s] = v_\pi(s)\)), this update direction is, in expectation, the direction of the true gradient. Gradient MC is therefore a true gradient method and, under mild conditions, is guaranteed to converge to a local optimum (a global optimum in the linear case).

The drawbacks of Gradient MC mirror those of MC methods in general: high variance and the requirement to wait until the end of an episode before updating.

Semi-gradient TD

Semi-gradient TD uses the TD target \(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)\) as a substitute for \(v_\pi(S_t)\):

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [\underbrace{R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)}_{\text{TD Target}} - \hat{v}(S_t, \mathbf{w}_t)] \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}_t) \]

This method is called semi-gradient because the TD target itself depends on the parameters \(\mathbf{w}\) (through \(\hat{v}(S_{t+1}, \mathbf{w}_t)\)), yet when computing the gradient we do not differentiate through the TD target—we only differentiate \(\hat{v}(S_t, \mathbf{w}_t)\). In other words, during gradient computation the bootstrapped target is treated as a constant, ignoring its dependence on \(\mathbf{w}\).

Why do this? If we were to differentiate through the TD target as well, we would obtain a more complex update rule (known as the residual gradient method), which in practice performs worse. Although semi-gradient TD is not true gradient descent, on-policy semi-gradient TD(0) with linear function approximation has been proven to converge.

A quick comparison:

Method	Substitute for \(v_\pi(S_t)\)	True gradient?	Requires full episode?
Gradient MC	\(G_t\) (actual return)	Yes	Yes
Semi-gradient TD(0)	\(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\)	No	No

Convergence in Function Approximation

Function approximation introduces convergence issues that do not exist in tabular methods. Of particular importance is the Deadly Triad:

When the following three factors are present simultaneously, learning algorithms can diverge (parameters \(\mathbf{w}\) grow toward infinity):

Function Approximation: using a parameterized function instead of a table

Bootstrapping: updating estimates from other estimates (as in TD methods)

Off-policy Learning: learning about a target policy \(\pi\) from data generated by a behavior policy \(b\)

Removing any one of these three factors is sufficient to guarantee convergence:

Remove function approximation (use a table): Tabular Q-learning converges
Remove bootstrapping (use MC): Gradient MC converges
Remove off-policy learning (use on-policy): Semi-gradient TD(0) on-policy converges

This explains why DQN requires so many tricks (Experience Replay, Target Network, gradient clipping, etc.) to train stably—DQN happens to exhibit all three deadly factors simultaneously.

Control

The control problem aims to learn the optimal policy directly. Under function approximation, we use \(\hat{q}(s, a, \mathbf{w})\) to approximate the optimal action-value function \(q_*(s, a)\).

Episodic Semi-gradient Sarsa

Semi-gradient TD(0) can be extended from prediction to control by replacing \(\hat{v}\) with \(\hat{q}\) and using the SARSA update:

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha [R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla_\mathbf{w} \hat{q}(S_t, A_t, \mathbf{w}_t) \]

Here \(A_{t+1}\) is the action actually selected by the current policy (typically \(\epsilon\)-greedy with respect to \(\hat{q}\)). This is an on-policy method.

Algorithm outline:

初始化参数向量 w
对每个episode:
    初始化状态 S，根据 epsilon-greedy 选择动作 A
    对每一步:
        执行 A，观察 R, S'
        如果 S' 是终止状态:
            w <- w + alpha * [R - q_hat(S, A, w)] * grad q_hat(S, A, w)
            退出循环
        根据 epsilon-greedy 选择 A'
        w <- w + alpha * [R + gamma * q_hat(S', A', w) - q_hat(S, A, w)] * grad q_hat(S, A, w)
        S <- S', A <- A'

The Precursor to Deep Q-Networks

The evolution from Semi-gradient Sarsa to DQN can be summarized as follows:

Replace the linear \(\hat{q}\) with a neural network
Switch from on-policy (SARSA) to off-policy (Q-learning), i.e., replace \(\hat{q}(S', A', \mathbf{w})\) with \(\max_{a'} \hat{q}(S', a', \mathbf{w})\)
Add Experience Replay and a Target Network to combat the Deadly Triad

The DQN TD target is:

\[ Y_t = R_{t+1} + \gamma \max_{a'} \hat{q}(S_{t+1}, a', \mathbf{w}^{-}) \]

where \(\mathbf{w}^{-}\) denotes the parameters of the target network, which are copied from the main network \(\mathbf{w}\) only every few steps. This effectively turns a rapidly moving target into a relatively stationary one, thereby stabilizing training.

Feature Construction

In linear function approximation, the design of the feature vector \(\mathbf{x}(s)\) is critical. Good features can enable linear methods to match the performance of nonlinear ones; poor features can make learning difficult or even impossible.

Polynomial Features

For a continuous state \(s = (s_1, s_2)\), polynomial features are constructed by taking various power combinations of the state variables. For example, a second-order polynomial:

\[ \mathbf{x}(s) = (1, s_1, s_2, s_1 s_2, s_1^2, s_2^2)^\top \]

The advantage is simplicity and intuitiveness; the disadvantage is that the number of features grows exponentially with the polynomial degree and state dimensionality, making it impractical in high-dimensional spaces.

Fourier Features

Fourier basis functions can be used as features:

\[ x_i(s) = \cos(\pi \mathbf{c}_i^\top s) \]

where \(\mathbf{c}_i\) is a frequency vector. Fourier features can capture periodic patterns and global structure in the value function. On many benchmark tasks, Fourier features outperform polynomial features and require less manual tuning.

Tile Coding

Tile coding is one of the most widely used linear feature construction methods in practice. The idea is intuitive:

Cover the continuous state space with multiple overlapping grids (tilings)
Each tiling partitions the state space into several tiles
For a given state, it falls into exactly one tile in each tiling
The feature vector is a binary vector: the components corresponding to activated tiles are 1, and all others are 0

For example, consider a two-dimensional state space \([0,1] \times [0,1]\) covered by 8 tilings, each with \(4 \times 4 = 16\) tiles. The feature vector has dimension \(8 \times 16 = 128\), with exactly 8 components equal to 1.

Advantages of tile coding:

Computationally efficient: the feature vector is a sparse binary vector, so the inner product reduces to a simple table lookup and summation
Controllable resolution: the degree of generalization is controlled by adjusting tile size
Multi-resolution: different tilings can use different tile sizes, enabling a combination of coarse and fine resolution

Illustration of tile coding:

Tiling 1:          Tiling 2 (偏移):
+---+---+---+      +---+---+---+
|   |   |   |        |   |   |   |
+---+---+---+      +---+---+---+
|   | * |   |        | * |   |   |
+---+---+---+      +---+---+---+

状态 * 在Tiling 1中激活第5号tile
状态 * 在Tiling 2中激活第4号tile
特征向量中只有第5号和第4号位置为1

Under linear function approximation:

\[ \hat{v}(s, \mathbf{w}) = \mathbf{w}^\top \mathbf{x}(s) = \sum_{i \in \text{active tiles}} w_i \]

That is, the value estimate equals the sum of weights corresponding to all activated tiles. During updates, only the weights of the activated tiles need to be adjusted, keeping the computation minimal.

Summary

Function approximation is the bridge from tabular RL to modern deep reinforcement learning. The core ideas can be summarized as follows:

Concept	Tabular Methods	Function Approximation
Storage	One entry per state	Parameterized function \(\hat{v}(s, \mathbf{w})\)
Generalization	None (states are independent)	Yes (similar states share parameters)
Scalability	Limited to small state spaces	Applicable to large-scale / continuous spaces
Convergence guarantees	Relatively strong	Require additional conditions
Update mechanism	Directly modify the table	Adjust parameters \(\mathbf{w}\) via SGD

From the perspective of this chapter, all previously studied algorithms (MC, TD(0), SARSA, Q-learning) can be viewed as special cases of function approximation—they simply use a giant lookup table as their "function." Modern methods such as DQN, PPO, and SAC operate within the function approximation framework, employing neural networks as the approximator along with various techniques to overcome convergence challenges.