Automatic Differentiation

Automatic differentiation is the source of "learning" in deep learning. Its core concepts include the chain rule, partial derivatives, gradients, and backpropagation. In engineering practice, gradient descent directly determines whether a model can converge.

What Is Automatic Differentiation

Automatic Differentiation (AD) is neither symbolic differentiation nor numerical differentiation. Rather, it is a technique that uses computational graphs to compute derivatives both precisely and efficiently.

The core idea is simple: any complex function, no matter how complicated, is ultimately composed of a finite number of elementary operations (addition, subtraction, multiplication, division, exponentiation, logarithm, trigonometric functions, etc.). Since we know the derivative of each elementary operation, we can "chain" them together via the chain rule to automatically obtain the exact derivative of the entire function.

Why is automatic differentiation so critical to deep learning? Because backpropagation is essentially reverse-mode automatic differentiation. When you call loss.backward() in PyTorch, automatic differentiation is what runs under the hood. Without it, we would be unable to efficiently compute gradients for the millions of parameters in a deep neural network.

Comparison of Three Differentiation Methods

Method	Principle	Precision	Computational Complexity	Pros and Cons
Manual differentiation	Derive formulas by hand	Exact	High human cost	Error-prone; does not scale to complex networks
Numerical Differentiation	Finite differences \(\frac{f(x+h)-f(x)}{h}\)	Approximate, with truncation error	\(O(n)\) forward passes (\(n\) = number of parameters)	Easy to implement but slow and imprecise; commonly used for gradient checking
Symbolic Differentiation	Apply algebraic rules to differentiate expressions	Exact	Expressions may grow exponentially	Suitable for simple formulas, but complex functions lead to "expression swell"
Automatic Differentiation	Apply the chain rule step by step on a computational graph	Exact (to machine precision)	Same order as forward propagation	Combines precision with efficiency; the standard choice for deep learning

Example problem with numerical differentiation: Computing \(f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}\) (central differences) — if \(h\) is too large, the truncation error is large; if \(h\) is too small, floating-point round-off error dominates. For a network with \(10^8\) parameters, \(10^8\) forward passes are needed to compute the full gradient, which is completely infeasible in practice.

Example problem with symbolic differentiation: Symbolically differentiating \(f(x) = \sin(e^{x^2 + \ln x})\) expands each step into new sub-expressions, and the result can grow far longer than the original function. This is the expression swell problem.

Automatic differentiation cleverly avoids both issues: it computes step by step at the numerical level (no symbolic expansion) while using exact derivative rules (no finite-difference approximation).

Computational Graph

The computational graph is the core data structure of automatic differentiation. It is a directed acyclic graph (DAG) in which:

Nodes represent variables or operations
Edges represent the flow of data

Taking the function \(f(x, y) = (x + y) \cdot \sin(x)\) as an example, the forward computation can be decomposed into:

\[ v_1 = x, \quad v_2 = y \]

\[ v_3 = v_1 + v_2 \quad \text{（加法）} \]

\[ v_4 = \sin(v_1) \quad \text{（正弦）} \]

\[ v_5 = v_3 \cdot v_4 \quad \text{（乘法）} \]

Setting \(x = \frac{\pi}{2}, \; y = 1\), the numerical forward pass yields:

Step	Expression	Value
\(v_1 = x\)	—	\(1.571\)
\(v_2 = y\)	—	\(1.000\)
\(v_3 = v_1 + v_2\)	\(\frac{\pi}{2} + 1\)	\(2.571\)
\(v_4 = \sin(v_1)\)	\(\sin(\frac{\pi}{2})\)	\(1.000\)
\(v_5 = v_3 \cdot v_4\)	\(2.571 \times 1.000\)	\(2.571\)

The forward pass computes the function value while simultaneously building the computational graph. The next question is: how do we use this graph to compute derivatives?

Forward Mode AD

The core idea of forward-mode automatic differentiation is to carry a "derivative component" alongside the value during forward propagation.

Dual Numbers

The mathematical foundation of forward mode is dual numbers. A dual number is defined as \(a + b\epsilon\), where \(\epsilon^2 = 0\) (analogous to the imaginary unit \(i^2 = -1\), but here the square is zero).

Taylor-expanding a function \(f\) over dual numbers gives:

\[ f(a + b\epsilon) = f(a) + f'(a) \cdot b\epsilon \]

In other words, if we substitute \(x = a + 1 \cdot \epsilon\) into \(f\), the real part of the result is the function value \(f(a)\), and the coefficient of \(\epsilon\) is the derivative \(f'(a)\). A single forward pass yields both the function value and the derivative simultaneously.

Forward Mode Computation

Using the same example \(f(x, y) = (x + y) \cdot \sin(x)\), we compute \(\frac{\partial f}{\partial x}\).

Set the seeds: \(\dot{v}_1 = \frac{\partial x}{\partial x} = 1\), \(\dot{v}_2 = \frac{\partial y}{\partial x} = 0\) (here \(\dot{v}\) denotes the derivative with respect to \(x\)).

Step	Value	Derivative \(\dot{v}_i = \frac{\partial v_i}{\partial x}\)
\(v_1 = x\)	\(1.571\)	\(\dot{v}_1 = 1\)
\(v_2 = y\)	\(1.000\)	\(\dot{v}_2 = 0\)
\(v_3 = v_1 + v_2\)	\(2.571\)	\(\dot{v}_3 = \dot{v}_1 + \dot{v}_2 = 1\)
\(v_4 = \sin(v_1)\)	\(1.000\)	\(\dot{v}_4 = \cos(v_1) \cdot \dot{v}_1 = \cos(\frac{\pi}{2}) \cdot 1 \approx 0\)
\(v_5 = v_3 \cdot v_4\)	\(2.571\)	\(\dot{v}_5 = \dot{v}_3 \cdot v_4 + v_3 \cdot \dot{v}_4 = 1 \times 1 + 2.571 \times 0 = 1\)

Therefore \(\frac{\partial f}{\partial x}\bigg|_{x=\frac{\pi}{2}, y=1} = 1\).

Characteristics of Forward Mode

Each forward pass computes one column of the Jacobian matrix: fixing one input variable yields the partial derivatives of all outputs with respect to it
With \(n\) inputs, \(n\) forward passes are needed to obtain the full Jacobian
Best suited for: input dimension < output dimension (i.e., \(n \ll m\))

Reverse Mode AD

Reverse mode is at the heart of deep learning. In deep learning, we typically have a large number of parameters (high input dimension \(n\)) but only a single scalar loss function (output dimension \(m = 1\)). Reverse mode requires only a single backward pass to compute all parameter gradients — this is exactly why it is called backpropagation.

Reverse Mode Computation

Reverse mode proceeds in two phases:

Forward Pass: Compute the values of all intermediate variables and record the computational graph
Backward Pass: Starting from the output, propagate backward along the computational graph, computing the adjoint of each node: \(\bar{v}_i = \frac{\partial f}{\partial v_i}\)

Continuing with the example \(f(x, y) = (x + y) \cdot \sin(x)\), with \(x = \frac{\pi}{2}, \; y = 1\).

Forward pass (same as above, obtaining all \(v_i\) values).

Backward pass starts from the output: \(\bar{v}_5 = \frac{\partial f}{\partial v_5} = 1\) (the derivative of the output with respect to itself is 1).

Step	Backpropagation Rule	Value
\(\bar{v}_5\)	\(\frac{\partial f}{\partial f} = 1\)	\(1\)
\(\bar{v}_3\)	\(\bar{v}_5 \cdot \frac{\partial v_5}{\partial v_3} = \bar{v}_5 \cdot v_4\)	\(1 \times 1.000 = 1\)
\(\bar{v}_4\)	\(\bar{v}_5 \cdot \frac{\partial v_5}{\partial v_4} = \bar{v}_5 \cdot v_3\)	\(1 \times 2.571 = 2.571\)
\(\bar{v}_1\)	\(\bar{v}_3 \cdot \frac{\partial v_3}{\partial v_1} + \bar{v}_4 \cdot \frac{\partial v_4}{\partial v_1}\)	\(1 \times 1 + 2.571 \times \cos(\frac{\pi}{2}) = 1\)
\(\bar{v}_2\)	\(\bar{v}_3 \cdot \frac{\partial v_3}{\partial v_2}\)	\(1 \times 1 = 1\)

A single backward pass yields both gradients simultaneously:

\[ \frac{\partial f}{\partial x} = \bar{v}_1 = 1, \quad \frac{\partial f}{\partial y} = \bar{v}_2 = 1 \]

Characteristics of Reverse Mode

Each backward pass computes one row of the Jacobian matrix: fixing one output yields its partial derivatives with respect to all inputs
With \(m\) outputs, \(m\) backward passes are needed
Best suited for: output dimension < input dimension (i.e., \(m \ll n\))
In deep learning, \(m = 1\) (scalar loss), so a single backward pass suffices to obtain all parameter gradients

Forward Mode vs. Reverse Mode

Aspect	Forward Mode	Reverse Mode
Propagation direction	Same as the computation direction	Opposite to the computation direction
Each pass computes	One column of the Jacobian	One row of the Jacobian
Cost for full Jacobian	\(n\) passes	\(m\) passes
Best suited for	\(n \ll m\)	\(m \ll n\) (deep learning)
Extra memory	Almost none	Must store intermediate values from the forward pass

Chain Rule and Backpropagation

Mathematical Derivation

The mathematical essence of reverse-mode automatic differentiation is the backward application of the chain rule.

For a composite function \(f = f_L \circ f_{L-1} \circ \cdots \circ f_1\), the chain rule tells us:

\[ \frac{\partial f}{\partial \mathbf{x}} = \frac{\partial f_L}{\partial \mathbf{h}_{L-1}} \cdot \frac{\partial f_{L-1}}{\partial \mathbf{h}_{L-2}} \cdots \frac{\partial f_1}{\partial \mathbf{x}} \]

Forward mode multiplies from right to left (starting from the input side), while reverse mode multiplies from left to right (starting from the output side).

Example: A Simple Two-Layer Network

Consider a two-layer network:

\[ \mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), \quad \hat{y} = W_2 \mathbf{h} + \mathbf{b}_2, \quad L = \frac{1}{2}(\hat{y} - y)^2 \]

where \(\sigma\) is the activation function. The backpropagation computation proceeds as follows:

Step 1. Compute the derivative of the loss with respect to the output:

\[ \frac{\partial L}{\partial \hat{y}} = \hat{y} - y \]

Step 2. Propagate to the second-layer parameters:

\[ \frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \mathbf{h}^T, \quad \frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \hat{y}} \]

Step 3. Continue propagating to the hidden layer:

\[ \frac{\partial L}{\partial \mathbf{h}} = W_2^T \cdot \frac{\partial L}{\partial \hat{y}} \]

Step 4. Propagate to the first-layer parameters (note the derivative of the activation function):

\[ \frac{\partial L}{\partial W_1} = \left( \frac{\partial L}{\partial \mathbf{h}} \odot \sigma'(W_1 \mathbf{x} + \mathbf{b}_1) \right) \cdot \mathbf{x}^T \]

where \(\odot\) denotes element-wise multiplication (Hadamard product).

This is the complete process of backpropagation: starting from the loss, we compute the gradient of each parameter layer by layer in reverse. Each step involves only local derivatives and the upstream gradient passed down — this is the essence of automatic differentiation.

Automatic Differentiation in PyTorch

PyTorch's torch.autograd module implements reverse-mode automatic differentiation. Below are the core usage patterns.

Basic Usage

import torch

# 创建张量，requires_grad=True 表示需要计算梯度
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# 前向传播 — PyTorch 自动构建计算图
z = x**2 + 3*x*y + y**2  # z = x^2 + 3xy + y^2

# 反向传播 — 计算 dz/dx 和 dz/dy
z.backward()

# 读取梯度
print(x.grad)  # dz/dx = 2x + 3y = 2*2 + 3*3 = 13
print(y.grad)  # dz/dy = 3x + 2y = 3*2 + 2*3 = 12

Gradient Accumulation and Zeroing

Gradients in PyTorch are accumulated and are not automatically zeroed. This is a common pitfall in training loops:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for data, target in dataloader:
    optimizer.zero_grad()      # 必须先清零，否则梯度会累积
    output = model(data)       # 前向传播
    loss = criterion(output, target)
    loss.backward()            # 反向传播，计算梯度
    optimizer.step()           # 更新参数

If you forget to call optimizer.zero_grad() (or model.zero_grad()), the gradients computed by each backward() call will accumulate in the .grad attribute, causing gradients to grow erroneously over time.

The torch.no_grad() Context

During inference, gradient computation is unnecessary. Using torch.no_grad() saves memory and speeds up computation:

# 推理时关闭梯度计算
with torch.no_grad():
    output = model(test_input)
    # 此上下文中的运算不会被记录到计算图中

# 等价的装饰器写法
@torch.no_grad()
def inference(model, x):
    return model(x)

The detach() Method

detach() separates a tensor from the computational graph, returning a new tensor that no longer tracks gradients:

h = model.encoder(x)
h_detached = h.detach()  # 切断梯度流，encoder 不会收到来自后续计算的梯度

# 常见用途：作为另一个模型的输入，但不希望梯度回传
output = another_model(h_detached)

Common Pitfalls

1. In-place Operations Breaking the Computational Graph

PyTorch detects whether in-place operations affect gradient computation. Certain in-place operations will cause runtime errors:

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
y += 1          # in-place 操作，可能报错：
                # RuntimeError: one of the variables needed for gradient
                # computation has been modified by an inplace operation

# 正确写法
y = y + 1       # 创建新张量，不影响计算图

2. detach() vs. no_grad()

# detach()：从计算图中分离一个张量（计算图仍然存在）
z = x.detach()  # z 与 x 共享数据，但 z 不在计算图中

# no_grad()：整个上下文中不构建计算图（更高效）
with torch.no_grad():
    z = x * 2   # 不记录任何操作到计算图

detach() is suitable for selectively cutting off certain gradient paths during training (e.g., stop-gradient in GAN training)
no_grad() is suitable for the inference phase or computation blocks where gradients are not needed

3. Proper Use of Gradient Accumulation

Sometimes we intentionally leverage gradient accumulation to simulate a larger batch size:

accumulation_steps = 4
optimizer.zero_grad()

for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target)
    loss = loss / accumulation_steps  # 缩放 loss
    loss.backward()                   # 梯度累积

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()              # 每累积 4 步才更新一次参数
        optimizer.zero_grad()         # 更新后清零

4. Calling backward() on Non-scalar Outputs

backward() can only be called on a scalar by default. If the output is not a scalar, you must pass the gradient argument:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2  # y 是向量，不是标量

# 不能直接调用 y.backward()，需要传入与 y 同形状的权重向量
y.backward(torch.tensor([1.0, 1.0, 1.0]))
print(x.grad)  # tensor([2., 2., 2.])

Mathematically, this is equivalent to computing \(\mathbf{v}^T \cdot J\), where \(J\) is the Jacobian matrix and \(\mathbf{v}\) is the gradient vector passed in. This is the so-called VJP (Vector-Jacobian Product), which is the fundamental operation of reverse-mode automatic differentiation.