Regularization
Regularization
Regularization techniques can be categorized as follows:
- Explicit Regularization: Directly adding a penalty term to the loss function.
- L1 / L2 (Weight Decay)
- Entropy Regularization
- Structural/Computational Regularization: Modifying the network architecture or computation.
- Dropout (randomly disabling neurons)
- Label Smoothing (softening targets)
- Process Regularization: Altering the training procedure.
- Early Stopping (terminating training early)
- Adversarial Training
- Data Regularization:
- Data Augmentation
- Mixup / Cutout
The Fundamental Conflict
Optimization:
- Goal: Reduce training loss.
- Enemy: Underfitting, local minima, saddle points, vanishing/exploding gradients.
- Approach: Make the model learn faster and more accurately.
- Key players: SGD, Momentum, Adam, Learning Rate Scheduling.
- Mantra: "Memorize everything first — worry about understanding later."
Regularization:
- Goal: Reduce test loss (generalization error).
- Enemy: Overfitting.
- Approach: Hinder the model's learning — make training harder so it cannot fit too comfortably.
- Key players: Dropout, L1/L2 Weight Decay, Data Augmentation, Early Stopping.
- Mantra: "Don't memorize — learn the underlying patterns."
Dropout
Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", JMLR 2014.
During training, Dropout randomly sets a fraction of neuron outputs to 0 with a given probability (e.g., \(p=0.5\)), effectively "disabling" them. For each neuron \(i\), a Bernoulli random variable \(r_i \sim \text{Bernoulli}(1-p)\) is introduced:
At inference time, Dropout is not applied, but outputs are scaled by \((1-p)\) to compensate for the expected value mismatch during training (alternatively, inverted dropout can be used during training, which divides by \((1-p)\)).
- Breaking Co-adaptation: In a standard network, neurons develop dependencies — some neurons specialize in correcting the errors of others. Dropout randomly removes neurons, forcing the remaining ones to learn useful features independently rather than relying on specific partners.
- Implicit Model Ensemble: A network with \(n\) neurons under Dropout effectively samples one of \(2^n\) possible "thinned networks" (sparse sub-networks) for each forward pass during training. At inference time, using the full network (with scaled weights) is equivalent to taking the geometric mean of predictions from all sub-networks. This exponentially large implicit ensemble is the core source of Dropout's strong regularization effect.
Variational Dropout
Kingma et al., "Variational Dropout and the Local Reparameterization Trick", NeurIPS 2015.
Standard Dropout uses a fixed drop probability \(p\) for every neuron. Variational Dropout treats the dropout rate itself as a learnable parameter, using a variational inference framework to automatically learn the optimal drop probability for each weight or neuron.
From a Bayesian perspective, Dropout can be interpreted as approximate variational inference over the weights, where the dropout mask corresponds to a multiplicative noise posterior. Variational Dropout formalizes this connection further: each weight \(w_{ij}\) is associated with a noise variable \(\xi_{ij} \sim \mathcal{N}(1, \alpha_{ij})\), where \(\alpha_{ij}\) is learnable. When \(\alpha \to \infty\), the weight is effectively dropped entirely; when \(\alpha \to 0\), the weight is fully retained.
Gaussian Dropout
Gaussian Dropout is a continuous relaxation of standard Dropout. Instead of setting neuron outputs to either 0 or their original value (discrete Bernoulli noise), it multiplies outputs by Gaussian noise:
where \(\alpha = \frac{p}{1-p}\) (\(p\) is the equivalent dropout rate). When \(p = 0.5\), \(\alpha = 1\).
Gaussian Dropout matches standard Dropout in the first and second moments, but because the noise is continuous, gradients are smoother, leading to more stable training in certain scenarios.
DropConnect
Wan et al., "Regularization of Neural Networks using DropConnect", ICML 2013.
DropConnect is a generalization of Dropout. While Dropout randomly zeroes out neuron activations, DropConnect randomly zeroes out individual connections (weights) in the weight matrix:
where \(\mathbf{M}\) is a Bernoulli mask matrix of the same shape as the weight matrix, and \(\odot\) denotes element-wise multiplication.
Dropout vs. DropConnect: Dropout applies the same mask to all outgoing connections of a neuron (either all kept or all dropped), whereas DropConnect samples an independent mask for each individual connection, providing finer-grained regularization. DropConnect theoretically produces more sub-network combinations (\(2^{|\mathbf{W}|}\) vs. \(2^n\)), but at a higher computational cost.
Weight Decay
Hanson & Pratt, "Comparing Biases for Minimal Network Construction with Back-Propagation", NeurIPS 1988.
Weight Decay was originally defined as multiplying the weights by a decay factor less than 1 at each parameter update:
where \(\lambda\) is the weight decay coefficient and \(\eta\) is the learning rate. This means that at every update step, the weights are "decayed" by a small fraction.
L2 Regularization (Krogh & Hertz, 1991) adds the sum of squared weights to the loss function:
Its gradient is \(\nabla L_{total} = \nabla L + \lambda w\). Substituting into the SGD update rule:
Equivalence of WD and L2 — Only Holds for SGD:
Under standard SGD, Weight Decay and L2 regularization are mathematically equivalent (with appropriate scaling of \(\lambda\)). However, for adaptive optimizers like Adam, the two are not equivalent.
In Adam, the L2 regularization gradient \(\lambda w\) gets scaled by Adam's second-moment estimate, resulting in inconsistent regularization strength across parameters — parameters with larger gradient histories actually receive weaker regularization. This contradicts the original intent of Weight Decay.
AdamW: Decoupled Weight Decay
Loshchilov & Hutter, "Decoupled Weight Decay Regularization", ICLR 2019.
AdamW decouples weight decay from the gradient update, applying the decay directly to the parameters after the update:
The key difference is in the last line: the weight decay term \((1-\lambda)w_t\) acts directly on the weights themselves, bypassing Adam's adaptive scaling. This ensures uniform decay strength across all parameters.
Experiments show that AdamW significantly outperforms Adam + L2 regularization on models such as Transformers, with more stable hyperparameter search. AdamW has become the standard optimizer for training Transformers.
Interaction Between Weight Decay and Batch Normalization
van Laarhoven, "L2 Regularization versus Batch and Weight Normalization", 2017.
When Batch Normalization (BN) is used in a network, WD/L2 has a subtle effect on the weights of the layer preceding BN. Due to BN's scale invariance (\(\text{BN}(\alpha W x) = \text{BN}(Wx)\)), the absolute magnitude of the weights does not affect BN's output, so WD does not directly influence the network's function.
However, WD causes the weight norm to gradually decrease, and because of BN's scale invariance, the gradient is inversely proportional to the weight norm (\(\|\nabla_W L\| \propto 1/\|W\|\)). As a result, WD effectively acts as an increase in the effective learning rate:
The more WD shrinks the weights, the larger the effective learning rate becomes. This explains why Weight Decay behaves differently in networks with BN compared to those without.
Properties of Weight Decay
Summary of Weight Decay's core effects:
- Penalizes large weights: At each parameter update, the weights are not only moved in the gradient descent direction but also pulled slightly back toward zero.
- Limits complexity: Mathematically, larger weights cause the model function to fluctuate more dramatically in response to input changes (making it highly sensitive to noise). Weight Decay keeps the weights small, resulting in a smoother model function.
- Improves noise robustness: L2 regularization can be interpreted as imposing a Gaussian prior on the weights \(p(w) \propto \exp(-\lambda w^2)\), which is equivalent to assuming the input is corrupted by Gaussian noise with variance \(\lambda/\eta\). Smaller weights mean the model is less sensitive to input perturbations.
- Implicit learning rate adjustment: In networks with BN, WD indirectly increases the effective learning rate by shrinking the weight norm.
L1 Regularization
L1 regularization (also known as Lasso Regularization) adds the sum of absolute weight values to the loss function:
The key difference from L2 regularization is that L1 tends to produce sparse weights (many weights become exactly 0), while L2 only makes weights small without driving them precisely to zero.
Why L1 produces sparsity: The gradient of L1 is a constant \(+\lambda\) when \(w > 0\) and \(-\lambda\) when \(w < 0\) (independent of the magnitude of \(w\)). This means that regardless of how small a weight is, L1 applies the same "push" toward zero. In contrast, the gradient of L2 is \(2\lambda w\), which weakens as the weight gets smaller — so weights can approach zero asymptotically but never actually reach it.
Use cases: L1 regularization is commonly used for feature selection (automatically zeroing out weights corresponding to irrelevant features). In deep learning, L2 (Weight Decay) is more prevalent, since sparsity is typically not essential in neural networks.
Early Stopping
Early Stopping is one of the simplest and most practical regularization techniques.
Core idea: During training, monitor both the training loss and the validation loss. When the validation loss stops decreasing (or begins to increase) while the training loss continues to drop, the model is starting to overfit. At this point, training is terminated early, and the model parameters from the epoch with the lowest validation loss are restored.
Implementation details:
- Set a
patienceparameter (e.g., 10), specifying how many consecutive epochs the validation loss is allowed to not improve - Whenever the validation loss reaches a new minimum, save the current model parameters as a checkpoint
- When the validation loss has not improved for
patienceconsecutive epochs, stop training - Use the saved best checkpoint as the final model
Why it is regularization: Early Stopping limits the effective training duration, which is equivalent to limiting the model's "effective capacity." The longer training continues, the more the model can memorize noise and details in the training data; stopping early prevents the model from overfitting to the training set.
Equivalence with L2 regularization: It can be shown mathematically that, under certain conditions, Early Stopping is equivalent to L2 regularization. For a simple quadratic loss function with SGD, training for \(T\) steps is equivalent to L2 regularization with \(\lambda \approx \frac{1}{\eta T}\) (\(\eta\) being the learning rate). Fewer training steps (earlier stopping) correspond to a larger effective regularization strength \(\lambda\), imposing a stronger constraint on model complexity.
Data Augmentation
Data Augmentation artificially expands the training dataset by applying various transformations to the training data, serving as an implicit form of regularization.
Why it is regularization: Data augmentation increases the diversity of training data, forcing the model to learn features that are invariant to various transformations rather than memorizing the specific details of training samples. This directly improves generalization.
Common data augmentation methods:
- Image domain: Random Crop, Horizontal Flip, Rotation, Color Jitter, Random Erasing
- Text domain: Synonym replacement, random word insertion/deletion/swapping, Back Translation
- Advanced methods: Mixup (blending two images by a ratio), CutMix (replacing a region of one image with a region from another), Cutout (randomly masking a rectangular region of an image)
In modern deep learning, data augmentation is nearly standard practice, especially in scenarios with limited data.
Label Smoothing
Label Smoothing is a regularization method that softens the training targets, proposed by Szegedy et al. in the Inception-v2 paper.
Core idea: In standard classification tasks, target labels are one-hot encoded (e.g., \([0, 0, 1, 0]\)), requiring the model to predict the correct class with 100% confidence. Label Smoothing replaces these "hard" targets with "soft" targets:
where \(\alpha\) is the smoothing coefficient (typically 0.1) and \(K\) is the number of classes. For example, in a 4-class problem with \(\alpha = 0.1\), the label \([0, 0, 1, 0]\) becomes \([0.025, 0.025, 0.925, 0.025]\).
Why it works:
- Prevents overconfidence: It discourages the model from producing extreme logit values for training samples, resulting in a more "tempered" output probability distribution
- Improves generalization: Soft targets encourage the model to learn relative relationships between classes, rather than focusing solely on the correct class
- Tolerance to label noise: When training data contains labeling errors, Label Smoothing mitigates the negative impact of incorrect labels
Applications: Label Smoothing is widely used in Transformer training (e.g., machine translation, language modeling) and image classification, and is a standard component of modern training pipelines.
Mixup and CutMix
Mixup
Mixup (Zhang et al., 2018) generates new training samples by linearly interpolating between two training samples and their labels:
where \(\lambda \sim \text{Beta}(\alpha, \alpha)\), with \(\alpha\) typically set to 0.2–0.4.
Intuition: If \(\lambda = 0.7\), the new sample is "70% cat + 30% dog," and the corresponding label is \([0.7, 0.3]\). This forces the model to learn smooth transitions between samples rather than memorizing discrete data points.
CutMix
CutMix (Yun et al., 2019) does not blend entire images; instead, it replaces a rectangular region of one image with the corresponding region from another:
- A random rectangular region is selected (with an area ratio of \(1 - \lambda\))
- That region is replaced with content from another image
- Labels are mixed proportionally to the area: \(\tilde{y} = \lambda y_i + (1 - \lambda) y_j\)
Advantages over Mixup: Mixup superimposes two images, producing unnatural blended textures. CutMix preserves local image structure, encouraging the model to classify correctly even under partial occlusion, while also providing a Cutout-like effect.
Gradient Clipping
Gradient Clipping prevents gradient explosion by capping the magnitude of gradients, and is standard practice in RNN and Transformer training.
Two common approaches:
Clip by Norm:
If the L2 norm of the gradient exceeds max_norm, the gradient is proportionally scaled down. This preserves the gradient direction while limiting its magnitude.
# PyTorch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Clip by Value:
Each gradient component is clamped to the range \([-\text{clip\_value}, +\text{clip\_value}]\). This may alter the gradient direction.
# PyTorch
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
Is it regularization? Strictly speaking, Gradient Clipping is more of a training stability technique than a regularization method. However, by limiting the magnitude of parameter updates, it indirectly constrains model complexity. In practice, nearly all Transformer models are trained with gradient clipping at max_norm=1.0.
Stochastic Depth
Stochastic Depth, proposed by Huang et al. in 2016, can be understood as "layer-level Dropout."
Core idea: During training, certain residual blocks are randomly skipped (dropped). For a residual block \(x_{l+1} = x_l + f_l(x_l)\), Stochastic Depth simplifies it to \(x_{l+1} = x_l\) (bypassing the layer's computation entirely) with probability \(p_l\).
Survival probability schedule: A linear decay strategy is typically used — shallow layers have a high survival probability (close to 1), while deeper layers have a lower one. This is based on the observation that shallow layers learn fundamental features (more critical), while deeper layers learn higher-level features (with some redundancy).
where \(L\) is the total number of layers and \(p_L\) is the survival probability of the last layer (typically 0.5–0.8).
At inference time: All layers are retained, but each layer's output is multiplied by its training survival probability (analogous to Dropout's inference scaling).
Effect: Stochastic Depth significantly improves generalization in deep networks such as ResNet, while also reducing training time (since some layers' computations are skipped). This idea was later widely adopted in Vision Transformers (ViT/DeiT), becoming a standard component of ViT training (commonly referred to as DropPath in the Transformer context).
Recommendations for Choosing Regularization Methods
| Task/Model | Recommended Regularization |
|---|---|
| CNN Image Classification | Data Augmentation + Weight Decay + Label Smoothing |
| Transformer NLP | Dropout + Label Smoothing + Weight Decay (AdamW) |
| Vision Transformer | DropPath + Mixup/CutMix + Label Smoothing + Weight Decay |
| RNN/LSTM | Dropout + Gradient Clipping + Weight Decay |
| Small Datasets | Data Augmentation + Dropout + Early Stopping + Weight Decay |
| Large-Scale Pretraining | Weight Decay + Gradient Clipping (regularization needs are lower) |
General principles:
- Weight Decay (AdamW) should be used in virtually all scenarios
- Gradient Clipping is essential for Transformer and RNN training
- Data Augmentation is most effective when data is limited
- Avoid stacking too many regularization methods simultaneously, as this may lead to underfitting