Normalization
Normalization
In deep learning, normalization refers to techniques that stabilize training, accelerate convergence, and improve model performance by dynamically adjusting and reshaping (e.g., adjusting the mean and variance of) the distribution of layer activations during training.
Batch Normalization
Ioffe & Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML 2015.
Batch Normalization (BN), proposed by Ioffe and Szegedy in 2015, is one of the most important breakthroughs in deep learning.
Algorithm: Given a mini-batch \(\mathcal{B} = \{x_1, x_2, \ldots, x_m\}\), BN performs the following operations along each feature dimension:
- Compute the mini-batch mean: \(\mu_\mathcal{B} = \frac{1}{m}\sum_{i=1}^{m} x_i\)
- Compute the mini-batch variance: \(\sigma_\mathcal{B}^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_\mathcal{B})^2\)
- Normalize: \(\hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}\)
- Affine transformation (Scale and Shift): \(y_i = \gamma \hat{x}_i + \beta\)
Here \(\gamma\) (scale) and \(\beta\) (shift) are learnable parameters. Without step 4, BN would force all activations into a standard normal distribution, potentially destroying the network's representational capacity. The learnable \(\gamma\) and \(\beta\) allow the network to "recover" the original distribution when needed.
Training vs. Inference:
- During training: The current mini-batch mean and variance are used for normalization, while exponential moving averages (EMA) continuously track the global mean and variance.
- During inference: Batch statistics are no longer used (since there may be only a single sample at inference time). Instead, the global mean and variance accumulated during training (i.e., the running mean and running var) are used.
This is why in PyTorch you need to call model.train() and model.eval() to switch between modes.
Why it works:
- Mitigates Internal Covariate Shift: During training, the input distribution to each layer shifts constantly as parameters in preceding layers are updated. BN stabilizes each layer's input distribution within a fixed range through normalization, allowing each layer to learn more independently.
- Smooths the optimization landscape: Santurkar et al. (NeurIPS 2018), "How Does Batch Normalization Help Optimization?", argued that BN's primary benefit does not come from reducing Internal Covariate Shift, but rather from making the optimization landscape smoother (improving loss Lipschitzness and gradient Lipschitzness). Specifically, BN makes the gradient of the loss function change more gradually (smaller \(\beta\)-smoothness), thereby allowing larger learning rates without divergence.
- Enables larger learning rates: Since the distribution of activations is controlled, gradients are less prone to exploding or vanishing, permitting larger learning rates for faster training.
- Built-in regularization effect: The mean and variance of each mini-batch are inherently stochastic (depending on which samples happen to be in the batch), effectively injecting noise into the network — a regularization effect similar to Dropout. Experiments show that increasing the batch size reduces this noise-based regularization.
Scale Invariance and Gradient Bounds of BN:
BN exhibits scale invariance with respect to weights: for any constant \(\alpha > 0\), \(\text{BN}(\alpha \mathbf{W}\mathbf{x}) = \text{BN}(\mathbf{W}\mathbf{x})\), because the normalization operation cancels out the scaling factor. This property leads to an important corollary:
That is, larger weights receive smaller gradients, and smaller weights receive larger gradients. This provides an automatic gradient regulation mechanism that effectively prevents gradient explosion and stabilizes training.
Placement of BN:
BN can be placed before or after the activation function:
- Before activation (original paper): \(\text{BN}(\mathbf{Wx} + \mathbf{b}) \to f(\cdot)\) — BN is applied after the linear transformation and before activation. In this case, the bias \(\mathbf{b}\) can be omitted (since BN's \(\beta\) parameter already serves the shifting role).
- After activation: \(f(\mathbf{Wx} + \mathbf{b}) \to \text{BN}(\cdot)\) — some practitioners have found this works well in practice too.
In modern practice, both placements are used and there is no universally optimal choice.
Limitations: BN depends on batch statistics, so its performance degrades significantly when the batch size is very small (e.g., 1 or 2) due to inaccurate statistics. Additionally, BN is not well-suited for RNNs/Transformers with variable sequence lengths, since the distribution can vary substantially across different positions.
Layer Normalization
Ba et al., "Layer Normalization", 2016.
Layer Normalization (LN), proposed by Jimmy Lei Ba et al. in 2016, is an alternative to BN.
Formula: Normalization is performed across all feature dimensions of a single sample:
where \(H\) is the total number of features in a single sample (i.e., the mean and variance are computed over all hidden dimensions of one sample). As with BN, learnable parameters \(\gamma\) and \(\beta\) are used for the affine transformation.
Key difference from BN: BN normalizes the same feature dimension across samples (along the batch dimension); LN normalizes within a single sample (along the feature dimension). This means LN's computation is entirely independent of other samples in the batch.
Why it is preferred in Transformers/RNNs:
- No batch dependency: LN computes independently for each sample without requiring batch statistics. This remains effective even with a batch size of 1 or variable sequence lengths.
- Handles variable-length sequences: RNNs and Transformers process sequences of different lengths. BN statistics across different time steps are inconsistent and unstable; LN normalizes each time step independently, avoiding this issue.
- Consistent behavior at inference: The computation is exactly the same during training and inference — no running statistics need to be maintained.
In Transformers, LN is typically applied at the input or output of each sub-layer (Self-Attention and FFN).
Instance Normalization
Ulyanov et al., "Instance Normalization: The Missing Ingredient for Fast Stylization", 2016.
Instance Normalization (IN), proposed by Ulyanov et al. in 2016, was originally designed to address problems in style transfer.
Formula: For a feature map of shape \((N, C, H, W)\), IN normalizes each channel of each sample independently:
where \(\mu_{nc}\) and \(\sigma_{nc}^2\) are the mean and variance computed over the spatial dimensions \((H, W)\).
Why it suits style transfer: In style transfer tasks, an image's "style" is primarily captured by the mean and variance of each channel in the feature maps. By normalizing each channel independently, IN effectively removes the style information of the original image, making it easier for the network to "inject" a new style into the content image.
Group Normalization
Wu & He, "Group Normalization", ECCV 2018.
Group Normalization (GN), proposed by Yuxin Wu and Kaiming He in 2018, is a compromise between BN and LN.
Core idea: Channels are divided into several groups, and normalization is performed over the spatial dimensions and within-group channel dimensions for each group.
Formula: The \(C\) channels are divided into \(G\) groups, each containing \(C/G\) channels. The mean and variance are computed over all elements within each group for normalization.
Relationship to other methods:
- When \(G = 1\) (all channels in one group), GN is equivalent to LN.
- When \(G = C\) (each channel is its own group), GN is equivalent to IN.
- GN is a generalized form of both LN and IN.
Advantages: GN does not depend on batch size and remains stable even with very small batches (even batch=1). This is particularly useful in tasks like object detection and semantic segmentation, where input images are large and GPU memory constraints often limit the batch size to just 1 or 2. A typical setting is \(G = 32\).
Comparison of Normalization Methods
| Method | Normalization Dimensions | Batch-Dependent? | Typical Use Cases |
|---|---|---|---|
| Batch Norm (BN) | Along batch dimension, per feature/channel | Yes | CNN classification (ResNet, VGG, etc.) |
| Layer Norm (LN) | Along feature dimension, all features of a single sample | No | Transformer, RNN |
| Instance Norm (IN) | Along spatial dimensions, per channel per sample | No | Style transfer, GAN |
| Group Norm (GN) | Along spatial + within-group channel dimensions | No | Object detection, small-batch scenarios |
For a 4D feature map of shape \((N, C, H, W)\):
- BN computes statistics over \((N, H, W)\), independently for each \(C\)
- LN computes statistics over \((C, H, W)\), independently for each \(N\)
- IN computes statistics over \((H, W)\), independently for each \((N, C)\)
- GN computes statistics over \((C/G, H, W)\), independently for each \((N, G)\)
Selection guidelines:
- CNN + large batch: prefer BN
- Transformer / RNN: use LN
- Style transfer / image generation: use IN
- CNN + small batch (detection/segmentation): use GN
RMSNorm (Root Mean Square Normalization)
Zhang & Sennrich, "Root Mean Square Layer Normalization", NeurIPS 2019.
RMSNorm is a simplified version of Layer Normalization. It removes the mean-centering step from LN and performs only variance normalization:
Note that RMSNorm has no \(\beta\) (shift parameter) — only \(\gamma\) (scale parameter).
Comparison with LN: LN computes two statistics — the mean \(\mu\) and the variance \(\sigma^2\) — while RMSNorm computes only one: the root mean square (RMS). RMSNorm can be understood as "Layer Norm that assumes zero mean."
Why it works: Research has shown that the effectiveness of Layer Normalization comes primarily from scale invariance rather than re-centering. RMSNorm eliminates the unnecessary mean computation, reducing computational overhead by approximately 10--15% while maintaining comparable performance.
Applications: RMSNorm has been adopted by major open-source large language models including LLaMA, LLaMA-2, LLaMA-3, Qwen, and Mistral, and has become the de facto standard for large models.
Weight Normalization
Salimans & Kingma, "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks", NeurIPS 2016.
Weight Normalization takes a fundamentally different approach from BN/LN: it operates on the weights themselves rather than on the activations. Specifically, the weight vector \(\mathbf{w}\) is decomposed into a direction and a magnitude:
where \(g\) is a learnable scalar (controlling the magnitude/norm of the weights), \(\mathbf{v}\) is a learnable vector (controlling the direction of the weights), and \(\|\mathbf{v}\|\) is the Euclidean norm of \(\mathbf{v}\).
Differences from BN:
- BN normalizes activations and depends on mini-batch statistics.
- Weight Norm reparameterizes the weights, is batch-independent, and has lower computational overhead.
- Weight Norm lacks BN's regularization effect (since there is no batch noise).
Advantages: Low computational overhead, no batch size dependency, and well-suited for RNNs and generative models (e.g., WaveNet). However, in most modern architectures, BN or LN typically achieves better results.
Extended Batch Normalization
Luo et al., "Towards Understanding Regularization in Batch Normalization", ICLR 2019.
Extended BN investigates the theoretical foundations of BN's regularization effect. Standard BN uses the mini-batch mean and variance to estimate population statistics, and this estimation inherently introduces noise. Extended BN formalizes this noise effect, showing that BN's regularization strength is inversely proportional to batch size — smaller batches produce more noise and thus stronger regularization.
This analysis explains why generalization performance often degrades with large-batch training, and provides a theoretical basis for tuning the regularization strength of BN.
Pre-Norm vs Post-Norm
In Transformers, there are two choices for where the normalization layer is placed relative to the residual connection:
Post-Norm (original Transformer):
The residual addition is performed first, followed by normalization. This is the approach used in the original Transformer (Vaswani et al., 2017).
Pre-Norm:
Normalization is applied first, then the sub-layer, and finally the residual addition.
Comparison:
| Aspect | Post-Norm | Pre-Norm |
|---|---|---|
| Training stability | Poor; requires warmup for deep models | Good; trainable without warmup |
| Final performance | Typically slightly better (if training is stable) | Typically slightly below Post-Norm |
| Learning rate sensitivity | High | Low |
| Primary applications | BERT, original Transformer | GPT series, LLaMA, most LLMs |
Why Pre-Norm is more stable: In Pre-Norm, the residual connection runs directly from input to output without passing through the normalization layer's "compression." This allows gradients to flow unimpeded through the residual path, similar to the identity shortcut in ResNet. In Post-Norm, gradients must pass through the normalization layer, which can lead to vanishing gradients in deep networks.
DeepNet: Making Post-Norm Trainable at Scale
Wang et al., "DeepNet: Scaling Transformers to 1,000 Layers", 2022.
Although Post-Norm tends to achieve slightly better final performance, it suffers from training instability. DeepNet introduces a scaling factor \(\alpha\) on the residual branch along with a specialized initialization strategy, enabling stable training of Post-Norm Transformers with over 1,000 layers:
Here \(\alpha > 1\) is a depth-dependent constant that amplifies the residual connection's contribution, while sub-layer weights are initialized with smaller values.
Interaction between Residual Connections and Normalization:
In deep networks, the interplay between residual connections (\(x + f(x)\)) and normalization layers is critical. In the Pre-Norm architecture, the residual path is free from the normalization layer's "compression," allowing gradients to flow directly through the identity shortcut — echoing the design philosophy of ResNet. In Post-Norm, each layer's output undergoes normalization's "rescaling," which helps control the distribution of activations but may impede gradient propagation in extremely deep networks.
Xiong et al., "On Layer Normalization in the Transformer Architecture", ICML 2020 provides a theoretical analysis of Pre-LN and Post-LN, showing that Pre-LN gradients are well-behaved at initialization, whereas Post-LN gradients may diverge without warmup.
Current trends: Most modern large language models (GPT-3, LLaMA, Mistral, etc.) adopt a Pre-RMSNorm configuration (Pre-Norm + RMSNorm), as training stability is paramount in large-scale pretraining.
VGG Case Study
Before the Batch Normalization paper was published in 2015, VGG's training performance was mediocre, achieving only around 70% accuracy. However, with the addition of Batch Normalization, the vanilla VGG can easily surpass 85% on the CIFAR-10 dataset. (See the PyTorch model building notes for reference.) An important note for this task: after adding Batch Normalization, Dropout should be removed. BatchNorm itself provides a degree of regularization (preventing overfitting). When a network has both strong BatchNorm and a high Dropout rate of 0.5 at the end, it can sometimes cause a "Variance Shift" problem, which actually slows convergence or even reduces final accuracy. Moreover, CIFAR-10 images are only 32x32, meaning the useful features extracted are far fewer than from 224x224 images. Dropping 50% of features at the last layer may leave the model with insufficient information for classification.
Transformer on French-English
This experiment explores how different normalization and regularization techniques affect the results:
- Normalization determines how the model's activations are scaled and shifted during training, which directly affect the shape of the layer input/output distributions (mean and variance) and can greatly influence whether the model will converge, the convergence speed, and the stability of learning.
- Regularization methods help prevent overfitting and improve generalization ability by controlling model complexity and constraining parameter updates. They can prevent units from co-adapting or penalize large weights, leading to smoother functions that are less sensitive to input variations and therefore achieve better generalization on unseen test data.
In simpler terms:
- Normalization helps the model converge faster and more stably by adjusting the distribution of internal activations (i.e., their mean and variance).
- Regularization prevents overfitting and improves generalization by controlling model complexity (e.g., penalizing excessively large weights), ensuring the model does not over-rely on training data details and thus performs better on unseen test data.
Even more simply:
- Normalization helps the model learn effectively and learn fast.
- Regularization helps the model learn well and generalize well.
IWSLT2017 Dataset
IWSLT2017 is a classic, moderately-sized machine translation dataset:
- Source: Transcripts and translations of TED talks
- Training Set: 210,000 sentence pairs
- Validation Set: 890 pairs
- Testing Set: 8,000 pairs
Each sample consists of a source sentence in French and its corresponding target translation in English.
Transformer Model
For detailed network architecture, refer to the DNN notes. A quick review:
- The Transformer does not rely on the traditional sequential processing mechanism of RNNs; it can efficiently capture long-range dependencies and supports parallel training.
- The Multi-Head Attention Mechanism allows the model to attend to all relevant words in the sentence when processing a given word (multi-head means multi-perspective).
- Position-wise Feed-Forward Networks (FFNs) process each word independently — in other words, every word in the sentence is processed separately and independently.
- Before feeding word meaning vectors (embeddings) to the model, the Transformer first adds a positional vector (Positional Encoding).
The two core components of the Transformer:
- Encoder: Processes the source sentence (French in our project). The encoder produces a series of contextual representations, achieving "understanding."
- Decoder: Receives the target language sentence (English in our project). The decoder's goal is to learn to generate — simply put, it generates the corresponding English sentence word by word by referencing the encoder's understanding.
Let us look at how the Transformer works on a translation task. During training, the encoder first sees all of the French text and understands the relationships within it. Then the decoder, while attending to all of these French relationships, processes English words one at a time. At each step, the decoder associates all of the French relationships and the ground-truth next word (teacher-forced) with the previously seen words, memorizing the predictive relationship between the French context, the words the decoder already knows, and the next word the decoder should output.
In other words, during training:
- INPUT: Context, i.e., [French understanding] + [correct English prefix, e.g., "The cat"]
- TARGET: The model is made to "predict" [the next word "is"]. This is enforced learning (teacher forcing).
After training, when the model sees a French passage, it can continuously predict the next corresponding English word, then keep predicting the next word based on the French passage and the English words already generated, until the entire sentence is complete. This snowball-like process is called autoregressive generation.
Baseline Model
In the initial Transformer configuration, dropout (0.1) is applied before the "Add & Norm" step, the Adam optimizer is used, and no additional regularization methods are employed. The model uses post-layer normalization (applied after the residual connection) and performs learning rate warm-up during the first 4,000 update steps. Weights are initialized using the Xavier Uniform distribution. In this project, we use this baseline model as the foundation to explore the effects of different regularization and normalization methods.
Dropout is a powerful technique that randomly "drops" or "deactivates" neurons with a probability of 10% (0.1) during training. This forces the model not to over-rely on any single neuron but instead to learn more robust and diverse features, thereby preventing overfitting.
In this experiment, we first reproduce the baseline model. The key components include:
- Normalization: Post-layer Normalization
- Regularization: Dropout 0.1
- Optimizer: Adam
- LR Scheduling: Learning Rate Warm-Up
- Initialization: Xavier Uniform Distribution
After successfully reproducing the baseline experiment, we explore different methods.
BLEU
BLEU (Bilingual Evaluation Understudy) Score is a standard metric for automatically evaluating machine translation quality. It works by comparing the model-generated translation against one or more human reference translations and scoring based on the degree of n-gram overlap between them. In this project, the BLEU score serves as the ultimate "report card" for model quality: a higher BLEU score means the translation is closer to human translation in terms of word choice and sentence structure, indicating better translation quality.
Experimental Setup
Hardware configuration:
- Memory: 64GB
- GPU: H200 - 140GB
Completed experiments:
- Preliminary test experiments (in notebook)
- Formal experiments: 10 long-running experiments (epoch=50), with a total runtime of approximately 40 hours