Initialization

Mathematical Derivation of Variance Propagation

This section is based on the theoretical derivations from Glorot & Bengio (2010) "Understanding the difficulty of training deep feedforward neural networks" and He et al. (2015) "Delving Deep into Rectifiers."

The central question of initialization is: how should we set the initial distribution of weights so that signals during forward propagation and gradients during backpropagation neither explode nor vanish? Answering this requires a rigorous mathematical analysis of variance propagation.

DNN Model Setup

Consider a fully connected network with \(L\) layers, where layer \(l\) has \(m_l\) neurons. The forward pass at layer \(l\) is:

\[ \mathbf{y}_l = \mathbf{W}_l \mathbf{x}_{l-1} + \mathbf{b}_l, \quad \mathbf{x}_l = f(\mathbf{y}_l) \]

where \(\mathbf{W}_l \in \mathbb{R}^{m_l \times m_{l-1}}\) is the weight matrix, \(\mathbf{b}_l\) is the bias vector, \(f(\cdot)\) is the activation function, \(\mathbf{y}_l\) is the pre-activation value, and \(\mathbf{x}_l\) is the post-activation value.

Key Assumptions:

Weights \(w_l^{(ij)}\) are i.i.d. with zero mean: \(E[w] = 0\)
Inputs \(x_{l-1}^{(j)}\) are i.i.d. and independent of the weights
Biases are initialized to zero: \(b_l = 0\)

Forward Propagation Variance Analysis

For the \(i\)-th neuron in layer \(l\):

\[ y_l^{(i)} = \sum_{j=1}^{m_{l-1}} w_l^{(ij)} x_{l-1}^{(j)} \]

Since \(w\) and \(x\) are mutually independent and \(E[w] = 0\), using \(\text{Var}(wx) = \text{Var}(w) \cdot E[x^2]\) (when \(E[w]=0\)), summing over all inputs yields:

\[ \boxed{\text{Var}[y_l] = m_{l-1} \cdot \text{Var}[w_l] \cdot E[x_{l-1}^2]} \]

This formula is the foundation of all initialization methods.

Symmetric Activation Functions (Sigmoid/Tanh):

When the activation function is symmetric about the origin (e.g., Tanh), the output has zero mean, i.e., \(E[x] = 0\), so \(E[x^2] = \text{Var}(x)\). In this case:

\[ \text{Var}[y_l] = m_{l-1} \cdot \text{Var}[w_l] \cdot \text{Var}[x_{l-1}] = m_{l-1} \cdot \text{Var}[w_l] \cdot \text{Var}[y_{l-1}] \]

(The last step holds under the linear regime approximation \(x \approx y\).)

To keep the variance constant across layers (\(\text{Var}[y_l] = \text{Var}[y_{l-1}]\)), we need:

\[ \boxed{m_{l-1} \cdot \text{Var}[w_l] = 1} \quad \Longrightarrow \quad \text{Var}[w_l] = \frac{1}{m_{l-1}} \]

This is the Xavier condition.

ReLU Activation Function:

ReLU truncates the negative half-axis to zero, so \(E[x] \neq 0\). Assuming \(y\) is symmetrically distributed around zero, ReLU retains only the positive half, giving:

\[ E[x^2] = E[\text{ReLU}(y)^2] = \frac{1}{2} E[y^2] = \frac{1}{2} \text{Var}(y) \]

Substituting into the variance propagation formula:

\[ \text{Var}[y_l] = m_{l-1} \cdot \text{Var}[w_l] \cdot \frac{1}{2}\text{Var}[y_{l-1}] \]

To keep the variance constant, we need:

\[ \boxed{\frac{1}{2} \cdot m_{l-1} \cdot \text{Var}[w_l] = 1} \quad \Longrightarrow \quad \text{Var}[w_l] = \frac{2}{m_{l-1}} \]

This is the Kaiming/He condition — the extra factor of 2 compensates for the halving of variance caused by ReLU discarding the negative half-axis.

Backpropagation Gradient Variance Analysis

Similarly, we can analyze the variance of gradients during backpropagation. Let \(\delta_l = \frac{\partial L}{\partial y_l}\) denote the gradient of the loss with respect to the pre-activation at layer \(l\). Then:

\[ \delta_{l-1} = f'(y_{l-1}) \odot (\mathbf{W}_l^T \delta_l) \]

Performing a similar variance analysis on the gradients:

\[ \text{Var}[\delta_{l-1}] = m_l \cdot \text{Var}[w_l] \cdot \text{Var}[\delta_l] \]

To keep the gradient variance constant, we need \(m_l \cdot \text{Var}[w_l] = 1\), i.e., \(\text{Var}[w_l] = \frac{1}{m_l}\).

Note that the forward condition requires \(\text{Var}[w] = \frac{1}{m_{l-1}}\) (fan-in), while the backward condition requires \(\text{Var}[w] = \frac{1}{m_l}\) (fan-out). These two conditions generally cannot be satisfied simultaneously (unless \(m_{l-1} = m_l\)), so Xavier takes a compromise:

\[ \text{Var}[w_l] = \frac{2}{m_{l-1} + m_l} = \frac{2}{n_{\text{in}} + n_{\text{out}}} \]

Deriving the Uniform Distribution Parameters

For a uniform distribution \(U(-a, a)\), the variance is \(\text{Var} = \frac{a^2}{3}\).

Xavier uniform distribution: Setting \(\frac{a^2}{3} = \frac{2}{n_{\text{in}} + n_{\text{out}}}\), we get:

\[ a = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}} \]

Kaiming uniform distribution: Setting \(\frac{a^2}{3} = \frac{2}{n_{\text{in}}}\), we get:

\[ a = \sqrt{\frac{6}{n_{\text{in}}}} \]

Initialization Strategies

The initialization strategy determines the range and standard deviation of the random numbers used for initialization. Using the analogy of a box filled with random numbers, the initialization strategy determines the size and boundaries of the box.

Xavier/Glorot Initialization

Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks", AISTATS 2010.

The core idea of Xavier Initialization is to maintain equal variance of input and output signals, ensuring that signals neither diverge nor decay throughout the network.

It was originally designed for symmetric activation functions such as Sigmoid and Tanh, since these functions are approximately linear in the region near the origin and have output means close to zero. The Xavier condition derives from the \(m \cdot \text{Var}[w] = 1\) (forward) relationship and the compromise between forward and backward conditions derived above.

Normal distribution form:

\[ W \sim \mathcal{N}\left(0, \; \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) \]

Uniform distribution form:

Weights \(W\) are sampled from a uniform distribution \(U(-a, a)\), where:

\[ a = \sqrt{\frac{6}{n_{in} + n_{out}}} \]

\(n_{in}\): Number of input neurons to the current layer (fan-in).
\(n_{out}\): Number of output neurons from the current layer (fan-out).

For example, suppose we define a Linear(in_features=100, out_features=100) layer in code:

Computing \(a\)****:

\[ a = \sqrt{\frac{6}{100 + 100}} = \sqrt{0.03} \approx 0.173 \]

Applying the distribution: All weights \(W\) in this layer are randomly sampled from \(\text{Uniform}(-0.173, 0.173)\).

For instance, \(W_1 = 0.05\)
\(W_2 = -0.12\)
\(W_3 = 0.17\)
...but a value of \(0.2\) would never appear, since it exceeds \(a\).

Kaiming/He Initialization

He et al., "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", ICCV 2015.

Proposed by Kaiming He et al., this method is specifically designed for ReLU (Rectified Linear Unit) activation functions. Since ReLU sets all negative outputs to zero (effectively "killing" half of the neurons), the signal variance is halved after each layer. Kaiming initialization compensates for this by setting \(\text{Var}[w] = \frac{2}{n_{\text{in}}}\) (compared to Xavier's \(\frac{1}{n_{\text{in}}}\), the extra factor of 2 exactly offsets ReLU's half-variance effect).

It is primarily suited for ReLU and its variants (e.g., Leaky ReLU).

Normal distribution form:

\[ W \sim \mathcal{N}\left(0, \; \frac{2}{n_{\text{in}}}\right) \]

Uniform distribution form:

Weights \(W\) are sampled from a uniform distribution \(U(-a, a)\), where:

\[ a = \sqrt{\frac{6}{n_{in}}} \]

Note: The formula uses only the number of input neurons \(n_{in}\) (fan-in mode). Kaiming initialization can also use fan-out mode (\(\text{Var}[w] = \frac{2}{n_{\text{out}}}\)), which may be more appropriate for certain convolutional layers.

Initialization Distributions

The initialization distribution determines the shape of the random numbers generated. Using the box-of-random-numbers analogy again, the distribution determines the shape of the random numbers placed inside the box.

Uniform Distribution

In a uniform distribution, weights \(W\) are sampled with equal probability from a specific interval \([-\mathbf{A}, +\mathbf{A}]\). The weight values are spread uniformly within the specified range. The Xavier and Kaiming formulas compute the half-width \(A\) of this interval.

Normal Distribution

In a normal distribution, weights \(W\) are randomly drawn from a Gaussian (normal) distribution with mean \(\mu\) and standard deviation \(\sigma\)*. Weights closer to the mean (*\(\mu=0\)) have the highest probability of being selected, and the probability decreases as the distance from the mean increases.

During initialization, the mean is typically set to \(\mu=0\). The Xavier and Kaiming formulas are used to compute the standard deviation \(\sigma\) (e.g., \(\sigma^2 = \frac{2}{n_{in} + n_{out}}\) or \(\sigma^2 = \frac{2}{n_{in}}\)).

Why Arbitrary Weight Initialization Fails

Symmetry Breaking

A fundamental requirement of initialization is to break symmetry. If all neurons in a layer have identical weights, they will compute exactly the same outputs, receive exactly the same gradients, and undergo exactly the same updates — meaning that no matter how long training continues, these neurons remain "clones" of one another, and the entire layer degenerates into effectively a single neuron.

The essential purpose of random initialization is to give each neuron a different starting point so they can learn distinct features. Even a small random perturbation is sufficient to break symmetry, but the scale of the weights must be carefully chosen (i.e., the Xavier/Kaiming conditions), otherwise gradient explosion or vanishing will occur.

Consequences of Poor Initialization

There are several initialization pitfalls that lead to significant problems:

All-zero initialization: Cannot break symmetry; the network degenerates into an extremely simple linear model
Constant initialization: Suffers the same problem as all-zero initialization — all neurons remain identical forever
All-positive initialization: Causes mean shift, saturation, and zigzag behavior
Too-large initialization: Leads to gradient explosion or neuron death (Sigmoid/Tanh entering the saturation region)
Too-small initialization: Leads to gradient vanishing (signals decay to zero layer by layer)

Other Initialization Methods

Orthogonal Initialization

Orthogonal initialization sets the weight matrix to an orthogonal matrix (or its approximation), ensuring that all singular values of the matrix equal 1. This guarantees that signals are neither amplified nor attenuated during both forward and backward propagation.

Method: First sample a matrix from the standard normal distribution, then perform QR decomposition or SVD, and use the orthogonal component as the weight matrix.

\[ W = Q, \quad \text{where } A = QR, \; A_{ij} \sim \mathcal{N}(0, 1) \]

Use cases: Orthogonal initialization is especially well-suited for recurrent weight matrices in RNNs/LSTMs. In RNNs, the hidden state is repeatedly multiplied by the same weight matrix; if the spectral norm of this matrix deviates from 1, gradients will either explode or vanish exponentially over time steps. Since the spectral norm of an orthogonal matrix is exactly 1, this approach effectively mitigates the problem.

# Orthogonal initialization in PyTorch
import torch.nn as nn

nn.init.orthogonal_(layer.weight)

Scaling Initialization / LSUV (Layer-Sequential Unit-Variance)

Mishkin & Matas, "All you need is a good init", ICLR 2016.

Scaling Initialization (also known as LSUV) is a data-driven initialization method. The core idea is that theoretical derivations (such as Xavier/Kaiming) rely on assumptions about the activation functions and data distributions, which may not fully hold in practical deep networks. LSUV directly calibrates the output variance of each layer using actual data:

Initialize all layers with orthogonal initialization (using orthogonal matrices as starting points)
Feed a mini-batch of data through the network
Starting from the first layer, sequentially inspect the output variance of each layer
If the variance is not 1, scale that layer's weights to make the output variance equal to 1: \(W_l \leftarrow W_l / \sqrt{\text{Var}[\mathbf{x}_l]}\)
Repeat until the output variance of all layers is approximately 1

Advantages: It does not rely on assumptions about activation functions and works for arbitrary network architectures and activation functions (including non-standard ones like Mish). Experiments have shown that LSUV achieves performance comparable to Batch Normalization on networks such as GoogLeNet and VGG, without adding extra BN layers to the network.

Pre-trained Initialization

In transfer learning and fine-tuning scenarios, using pre-trained model weights as initialization is the most common and effective approach:

ImageNet pre-training: For vision tasks, ImageNet pre-trained ResNet/ViT weights are typically used as initialization
Large language models: Fine-tuning LLMs is essentially initializing with pre-trained weights and continuing training on task-specific data
Initialization of newly added layers: When adding new classification heads or adapter layers to a pre-trained model, the new layers are usually initialized with Xavier or Kaiming, while the pre-trained layers retain their original weights

Pre-trained initialization typically yields far better results than random initialization, because the pre-trained weights already encode rich feature representations.

Initialization in Practice with PyTorch

Common Initialization Functions

import torch.nn as nn

# Xavier initialization
nn.init.xavier_uniform_(layer.weight)   # Uniform distribution
nn.init.xavier_normal_(layer.weight)    # Normal distribution

# Kaiming initialization
nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')

# Orthogonal initialization
nn.init.orthogonal_(layer.weight)

# Constant initialization (typically used for bias)
nn.init.zeros_(layer.bias)
nn.init.constant_(layer.bias, 0.01)

Initializing an Entire Model

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(module, (nn.BatchNorm2d, nn.LayerNorm)):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

model.apply(init_weights)

Initialization Method Selection Guide

Activation Function	Recommended Initialization	Reason
Sigmoid / Tanh	Xavier	Preserves input-output variance consistency
ReLU / Leaky ReLU	Kaiming	Compensates for ReLU's variance-halving effect
GELU / SiLU	Kaiming	Asymmetric characteristics similar to ReLU
RNN recurrent weights	Orthogonal initialization	Maintains gradient stability across time steps
Transformer	Xavier (original) / Scaled initialization	Deep Transformers often scale residual branches by \(1/\sqrt{2N}\)
Fine-tuning scenarios	Pre-trained weights	Already contains rich feature representations

Scaled initialization for deep networks: In very deep networks (such as GPT and other deep Transformers), residual connections can cause signals to grow as the number of layers increases. A common practice is to multiply the weights of the last layer in each residual branch by \(1/\sqrt{2N}\) (where \(N\) is the number of layers), keeping signal growth along the residual path under control. This approach was adopted in the GPT-2 paper.

Initialization

Initialization

Mathematical Derivation of Variance Propagation

DNN Model Setup

Forward Propagation Variance Analysis

Backpropagation Gradient Variance Analysis

Deriving the Uniform Distribution Parameters

Initialization Strategies

Xavier/Glorot Initialization

Kaiming/He Initialization

Initialization Distributions

Uniform Distribution

Normal Distribution

Why Arbitrary Weight Initialization Fails

Symmetry Breaking

Consequences of Poor Initialization

Other Initialization Methods

Orthogonal Initialization

Scaling Initialization / LSUV (Layer-Sequential Unit-Variance)

Pre-trained Initialization

Initialization in Practice with PyTorch

Common Initialization Functions

Initializing an Entire Model

Initialization Method Selection Guide

评论 #