Skip to content

The Pre-Training Paradigm

Why Pre-Training Is Needed

Motivation from Transfer Learning

Traditional supervised learning faces a fundamental contradiction: high-quality labeled data is scarce and expensive, yet deep models require large amounts of data to learn good representations.

The core idea behind pre-training is:

First learn general-purpose representations from massive unlabeled data, then adapt to specific tasks using a small amount of labeled data.

The advantages of this approach include:

  • Data efficiency: Downstream tasks can achieve strong performance with only a small amount of labeled data
  • Representation quality: Features learned by pre-trained models have richer semantic structure than those from randomly initialized models
  • Generalization: Pre-trained representations transfer well to a wide variety of downstream tasks

Formally, pre-training can be viewed as learning a good parameter initialization \(\theta_0\):

\[ \theta_0 = \arg\min_\theta \mathcal{L}_{\text{pretrain}}(\theta; \mathcal{D}_{\text{unlabeled}}) \]

The model is then fine-tuned on a downstream task:

\[ \theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta; \mathcal{D}_{\text{labeled}}) \quad \text{starting from } \theta_0 \]

Self-Supervised Learning: The Key to Pre-Training

Core Idea

The essence of Self-Supervised Learning (SSL) is:

Constructing supervisory signals from the data itself, without any manual annotation.

In practice, some transformation or masking is applied to the input data, and the model is trained to predict the hidden portion:

General framework of self-supervised learning:

Input x → Transform/Mask → Partial observation x̃
Model objective: Recover/predict the missing parts of x from x̃

Categories of Self-Supervised Signals

Type Methods Prediction Target
Generative GPT, MAE Predict masked/future tokens
Contrastive SimCLR, MoCo, CLIP Pull positive pairs closer, push negative pairs apart
Predictive BERT (MLM), BEiT Predict masked tokens/patches

Language Model Pre-Training

Language model pre-training is the most successful paradigm for Foundation Models. Based on the pre-training objective, it can be divided into three major categories.

1. Autoregressive Language Models (Autoregressive LM)

Representative models: GPT series

Autoregressive models predict the next token from left to right:

\[ \mathcal{L}_{\text{AR}} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) \]

Architectural characteristics:

  • Uses the Transformer Decoder architecture
  • Causal attention mask: each position can only attend to previous tokens
  • Naturally suited for text generation tasks
Input:   [The] [cat] [sat] [on]  [the]
Target:  [cat] [sat] [on]  [the] [mat]
         ←  Predict next token step by step  →

Strengths: Strong generation capability; supports in-context learning.

Weaknesses: Unidirectional encoding; cannot leverage bidirectional context.

2. Masked Language Models (Masked LM)

Representative model: BERT

Randomly masks 15% of the input tokens and trains the model to predict the masked tokens:

\[ \mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}; \theta) \]

where \(\mathcal{M}\) is the set of masked positions and \(x_{\backslash \mathcal{M}}\) denotes the unmasked portion.

Architectural characteristics:

  • Uses the Transformer Encoder architecture
  • Bidirectional attention: each position can attend to all other positions
  • Well-suited for Natural Language Understanding (NLU) tasks
Input:   [The] [cat] [MASK] [on] [the] [MASK]
Target:  Predict [MASK] → "sat", "mat"

Strengths: Bidirectional encoding; strong comprehension ability.

Weaknesses: Weak generation capability; mismatch between pre-training and downstream tasks (no [MASK] tokens at inference time).

3. Sequence-to-Sequence Models (Seq2Seq)

Representative models: T5, BART

The pre-training task is unified into a "text-to-text" format:

\[ \mathcal{L}_{\text{S2S}} = -\sum_{t=1}^{T_{\text{out}}} \log P(y_t | y_{<t}, x; \theta) \]

T5's Span Corruption objective: randomly masks contiguous spans in the input and generates the masked content on the decoder side.

Input  (Encoder):  "The <X> sat on <Y> mat"
Output (Decoder):  "<X> cat <Y> the"

Architectural characteristics:

  • Encoder-Decoder architecture
  • Combines both comprehension and generation capabilities
  • Unified text-to-text framework

Comparison of Language Pre-Training Objectives

Method Architecture Attention Suitable Tasks Representative Models
Autoregressive Decoder-only Causal Generation, Dialogue GPT, LLaMA
Masked LM Encoder-only Bidirectional Classification, NER BERT, RoBERTa
Seq2Seq Encoder-Decoder Mixed Translation, Summarization T5, BART

Visual Pre-Training

1. Contrastive Learning

Core idea: Pull together the representations of different augmented views of the same image (positive pairs) while pushing apart representations of different images (negative pairs).

InfoNCE Loss:

\[ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)} \]

where \(\tau\) is the temperature parameter and \(\text{sim}(\cdot, \cdot)\) is typically cosine similarity.

SimCLR (Chen et al., 2020):

  • Applies two different data augmentations to the same image to form a positive pair
  • Relies on large batch sizes to provide sufficient negative samples
  • Architecture: Encoder → Projection Head → Contrastive Loss

MoCo (He et al., 2020):

  • Introduces a momentum encoder and a dynamic queue
  • Removes SimCLR's dependency on large batch sizes
  • Momentum update: \(\theta_k \leftarrow m \theta_k + (1 - m) \theta_q\), where \(m = 0.999\)
Contrastive learning framework:

Image → Augment1 → Encoder → z_1 ─┐
                                     ├→ Pull closer (positive pair)
Image → Augment2 → Encoder → z_2 ─┘

Other Images → Encoder → z_neg ───→ Push apart (negative pairs)

2. Masked Image Modeling

MAE (He et al., 2022): Masked Autoencoder

  • Randomly masks 75% of image patches
  • The encoder only processes visible patches (saving computation)
  • The decoder reconstructs the masked patches
\[ \mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \| \hat{x}_i - x_i \|_2^2 \]

BEiT (Bao et al., 2022):

  • Uses a discrete visual tokenizer (e.g., dVAE) to encode patches into discrete tokens
  • The pre-training objective becomes predicting the discrete tokens corresponding to masked patches
  • Analogous to BERT's MLM, but applied in the visual domain

Multimodal Pre-Training

1. Contrastive Multimodal Pre-Training

CLIP (Radford et al., 2021): Contrastive Language-Image Pretraining

Core idea: Align the representation spaces of images and text.

\[ \mathcal{L}_{\text{CLIP}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(s_{ii}/\tau)}{\sum_j \exp(s_{ij}/\tau)} + \log \frac{\exp(s_{ii}/\tau)}{\sum_j \exp(s_{ji}/\tau)} \right] \]

where \(s_{ij} = \text{sim}(f_{\text{image}}(I_i), f_{\text{text}}(T_j))\).

CLIP was trained on 400 million image-text pairs and achieved remarkable zero-shot visual classification capabilities.

2. Generative Multimodal Pre-Training

Cross-modal alignment is learned by having the model generate content in one modality conditioned on another.

Representative method: CoCa (Contrastive Captioner), which combines both contrastive and generative losses.


Summary of Pre-Training Objectives

Pre-Training Paradigm Modality Objective Representative Methods Characteristics
Autoregressive LM Language Predict next token GPT Strong generation
Masked LM Language Predict masked tokens BERT Strong comprehension
Span Corruption Language Predict masked spans T5 Comprehension + Generation
Contrastive Learning Vision Positive/negative contrast SimCLR, MoCo Discriminative representations
Masked Image Modeling Vision Reconstruct masked patches MAE, BEiT Pixel/token-level reconstruction
Cross-modal Contrastive Multimodal Image-text alignment CLIP Zero-shot transfer
Cross-modal Generative Multimodal Cross-modal generation CoCa Comprehension + Generation

Fine-Tuning Paradigms

After pre-training, the model needs to be adapted to downstream tasks. The main fine-tuning strategies include:

1. Full Fine-Tuning

Updates all model parameters. Yields the best performance but at the highest cost, and may cause catastrophic forgetting.

\[ \theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta; \mathcal{D}_{\text{task}}) \]

2. Linear Probing

Freezes the pre-trained model and trains only a linear classification head. Used to evaluate the quality of pre-trained representations.

\[ \theta^*_{\text{head}} = \arg\min_{\theta_{\text{head}}} \mathcal{L}(W \cdot f_{\theta_{\text{frozen}}}(x) + b, y) \]

3. LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method proposed by Hu et al. (2022). Core idea: approximate weight updates using low-rank matrices.

\[ W' = W_0 + \Delta W = W_0 + BA \]

where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), and \(r \ll \min(d, k)\).

Only \(A\) and \(B\) are trained, drastically reducing the number of trainable parameters (typically only 0.1%--1% of the original model).

4. Prompt Tuning

Prepends learnable soft prompt tokens to the input and trains only these prompt parameters.

\[ \hat{y} = f_{\theta_{\text{frozen}}}([\underbrace{p_1, p_2, \ldots, p_m}_{\text{learnable prompts}}; x_1, x_2, \ldots, x_n]) \]

5. Adapter

Inserts small trainable modules (typically down-project → nonlinearity → up-project) between Transformer layers, while freezing all other parameters.

Comparison of Fine-Tuning Strategies

Method Trainable Parameters Performance Memory Use Case
Full Fine-Tuning 100% Best High Sufficient data
Linear Probing <0.01% Weaker Low Representation evaluation
LoRA 0.1%--1% Near full fine-tuning Low Resource-constrained
Prompt Tuning <0.1% Moderate Low Multi-task switching
Adapter 1%--5% Good Medium Multi-task adaptation

Conclusion

The core contribution of the pre-training paradigm is:

Decoupling "learning representations from data" and "adapting to specific tasks" into two independent stages.

This unlocks the full value of large-scale unlabeled data while dramatically reducing the need for labeled data in downstream tasks. The pre-training paradigm is the technical cornerstone of Foundation Model success.


评论 #