The Pre-Training Paradigm

Why Pre-Training Is Needed

Motivation from Transfer Learning

Traditional supervised learning faces a fundamental contradiction: high-quality labeled data is scarce and expensive, yet deep models require large amounts of data to learn good representations.

The core idea behind pre-training is:

First learn general-purpose representations from massive unlabeled data, then adapt to specific tasks using a small amount of labeled data.

The advantages of this approach include:

Data efficiency: Downstream tasks can achieve strong performance with only a small amount of labeled data
Representation quality: Features learned by pre-trained models have richer semantic structure than those from randomly initialized models
Generalization: Pre-trained representations transfer well to a wide variety of downstream tasks

Formally, pre-training can be viewed as learning a good parameter initialization \(\theta_0\):

\[ \theta_0 = \arg\min_\theta \mathcal{L}_{\text{pretrain}}(\theta; \mathcal{D}_{\text{unlabeled}}) \]

The model is then fine-tuned on a downstream task:

\[ \theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta; \mathcal{D}_{\text{labeled}}) \quad \text{starting from } \theta_0 \]

Self-Supervised Learning: The Key to Pre-Training

Core Idea

The essence of Self-Supervised Learning (SSL) is:

Constructing supervisory signals from the data itself, without any manual annotation.

In practice, some transformation or masking is applied to the input data, and the model is trained to predict the hidden portion:

General framework of self-supervised learning:

Input x → Transform/Mask → Partial observation x̃
Model objective: Recover/predict the missing parts of x from x̃

Categories of Self-Supervised Signals

Type	Methods	Prediction Target
Generative	GPT, MAE	Predict masked/future tokens
Contrastive	SimCLR, MoCo, CLIP	Pull positive pairs closer, push negative pairs apart
Predictive	BERT (MLM), BEiT	Predict masked tokens/patches

Language Model Pre-Training

Language model pre-training is the most successful paradigm for Foundation Models. Based on the pre-training objective, it can be divided into three major categories.

1. Autoregressive Language Models (Autoregressive LM)

Representative models: GPT series

Autoregressive models predict the next token from left to right:

\[ \mathcal{L}_{\text{AR}} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) \]

Architectural characteristics:

Uses the Transformer Decoder architecture
Causal attention mask: each position can only attend to previous tokens
Naturally suited for text generation tasks

Input:   [The] [cat] [sat] [on]  [the]
Target:  [cat] [sat] [on]  [the] [mat]
         ←  Predict next token step by step  →

Strengths: Strong generation capability; supports in-context learning.

Weaknesses: Unidirectional encoding; cannot leverage bidirectional context.

2. Masked Language Models (Masked LM)

Representative model: BERT

Randomly masks 15% of the input tokens and trains the model to predict the masked tokens:

\[ \mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}; \theta) \]

where \(\mathcal{M}\) is the set of masked positions and \(x_{\backslash \mathcal{M}}\) denotes the unmasked portion.

Architectural characteristics:

Uses the Transformer Encoder architecture
Bidirectional attention: each position can attend to all other positions
Well-suited for Natural Language Understanding (NLU) tasks

Input:   [The] [cat] [MASK] [on] [the] [MASK]
Target:  Predict [MASK] → "sat", "mat"

Strengths: Bidirectional encoding; strong comprehension ability.

Weaknesses: Weak generation capability; mismatch between pre-training and downstream tasks (no [MASK] tokens at inference time).

3. Sequence-to-Sequence Models (Seq2Seq)

Representative models: T5, BART

The pre-training task is unified into a "text-to-text" format:

\[ \mathcal{L}_{\text{S2S}} = -\sum_{t=1}^{T_{\text{out}}} \log P(y_t | y_{<t}, x; \theta) \]

T5's Span Corruption objective: randomly masks contiguous spans in the input and generates the masked content on the decoder side.

Input  (Encoder):  "The <X> sat on <Y> mat"
Output (Decoder):  "<X> cat <Y> the"

Architectural characteristics:

Encoder-Decoder architecture
Combines both comprehension and generation capabilities
Unified text-to-text framework

Comparison of Language Pre-Training Objectives

Method	Architecture	Attention	Suitable Tasks	Representative Models
Autoregressive	Decoder-only	Causal	Generation, Dialogue	GPT, LLaMA
Masked LM	Encoder-only	Bidirectional	Classification, NER	BERT, RoBERTa
Seq2Seq	Encoder-Decoder	Mixed	Translation, Summarization	T5, BART

Visual Pre-Training

1. Contrastive Learning

Core idea: Pull together the representations of different augmented views of the same image (positive pairs) while pushing apart representations of different images (negative pairs).

InfoNCE Loss:

\[ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)} \]

where \(\tau\) is the temperature parameter and \(\text{sim}(\cdot, \cdot)\) is typically cosine similarity.

SimCLR (Chen et al., 2020):

Applies two different data augmentations to the same image to form a positive pair
Relies on large batch sizes to provide sufficient negative samples
Architecture: Encoder → Projection Head → Contrastive Loss

MoCo (He et al., 2020):

Introduces a momentum encoder and a dynamic queue
Removes SimCLR's dependency on large batch sizes
Momentum update: \(\theta_k \leftarrow m \theta_k + (1 - m) \theta_q\), where \(m = 0.999\)

Contrastive learning framework:

Image → Augment1 → Encoder → z_1 ─┐
                                     ├→ Pull closer (positive pair)
Image → Augment2 → Encoder → z_2 ─┘

Other Images → Encoder → z_neg ───→ Push apart (negative pairs)

2. Masked Image Modeling

MAE (He et al., 2022): Masked Autoencoder

Randomly masks 75% of image patches
The encoder only processes visible patches (saving computation)
The decoder reconstructs the masked patches

\[ \mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \| \hat{x}_i - x_i \|_2^2 \]

BEiT (Bao et al., 2022):

Uses a discrete visual tokenizer (e.g., dVAE) to encode patches into discrete tokens
The pre-training objective becomes predicting the discrete tokens corresponding to masked patches
Analogous to BERT's MLM, but applied in the visual domain

Multimodal Pre-Training

1. Contrastive Multimodal Pre-Training

CLIP (Radford et al., 2021): Contrastive Language-Image Pretraining

Core idea: Align the representation spaces of images and text.

\[ \mathcal{L}_{\text{CLIP}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(s_{ii}/\tau)}{\sum_j \exp(s_{ij}/\tau)} + \log \frac{\exp(s_{ii}/\tau)}{\sum_j \exp(s_{ji}/\tau)} \right] \]

where \(s_{ij} = \text{sim}(f_{\text{image}}(I_i), f_{\text{text}}(T_j))\).

CLIP was trained on 400 million image-text pairs and achieved remarkable zero-shot visual classification capabilities.

2. Generative Multimodal Pre-Training

Cross-modal alignment is learned by having the model generate content in one modality conditioned on another.

Representative method: CoCa (Contrastive Captioner), which combines both contrastive and generative losses.

Summary of Pre-Training Objectives

Pre-Training Paradigm	Modality	Objective	Representative Methods	Characteristics
Autoregressive LM	Language	Predict next token	GPT	Strong generation
Masked LM	Language	Predict masked tokens	BERT	Strong comprehension
Span Corruption	Language	Predict masked spans	T5	Comprehension + Generation
Contrastive Learning	Vision	Positive/negative contrast	SimCLR, MoCo	Discriminative representations
Masked Image Modeling	Vision	Reconstruct masked patches	MAE, BEiT	Pixel/token-level reconstruction
Cross-modal Contrastive	Multimodal	Image-text alignment	CLIP	Zero-shot transfer
Cross-modal Generative	Multimodal	Cross-modal generation	CoCa	Comprehension + Generation

Fine-Tuning Paradigms

After pre-training, the model needs to be adapted to downstream tasks. The main fine-tuning strategies include:

1. Full Fine-Tuning

Updates all model parameters. Yields the best performance but at the highest cost, and may cause catastrophic forgetting.

\[ \theta^* = \arg\min_\theta \mathcal{L}_{\text{task}}(\theta; \mathcal{D}_{\text{task}}) \]

2. Linear Probing

Freezes the pre-trained model and trains only a linear classification head. Used to evaluate the quality of pre-trained representations.

\[ \theta^*_{\text{head}} = \arg\min_{\theta_{\text{head}}} \mathcal{L}(W \cdot f_{\theta_{\text{frozen}}}(x) + b, y) \]

3. LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method proposed by Hu et al. (2022). Core idea: approximate weight updates using low-rank matrices.

\[ W' = W_0 + \Delta W = W_0 + BA \]

where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), and \(r \ll \min(d, k)\).

Only \(A\) and \(B\) are trained, drastically reducing the number of trainable parameters (typically only 0.1%--1% of the original model).

4. Prompt Tuning

Prepends learnable soft prompt tokens to the input and trains only these prompt parameters.

\[ \hat{y} = f_{\theta_{\text{frozen}}}([\underbrace{p_1, p_2, \ldots, p_m}_{\text{learnable prompts}}; x_1, x_2, \ldots, x_n]) \]

5. Adapter

Inserts small trainable modules (typically down-project → nonlinearity → up-project) between Transformer layers, while freezing all other parameters.

Comparison of Fine-Tuning Strategies

Method	Trainable Parameters	Performance	Memory	Use Case
Full Fine-Tuning	100%	Best	High	Sufficient data
Linear Probing	<0.01%	Weaker	Low	Representation evaluation
LoRA	0.1%--1%	Near full fine-tuning	Low	Resource-constrained
Prompt Tuning	<0.1%	Moderate	Low	Multi-task switching
Adapter	1%--5%	Good	Medium	Multi-task adaptation

Conclusion

The core contribution of the pre-training paradigm is:

Decoupling "learning representations from data" and "adapting to specific tasks" into two independent stages.

This unlocks the full value of large-scale unlabeled data while dramatically reducing the need for labeled data in downstream tasks. The pre-training paradigm is the technical cornerstone of Foundation Model success.

The Pre-Training Paradigm

Why Pre-Training Is Needed

Motivation from Transfer Learning

Self-Supervised Learning: The Key to Pre-Training

Core Idea

Categories of Self-Supervised Signals

Language Model Pre-Training

1. Autoregressive Language Models (Autoregressive LM)

2. Masked Language Models (Masked LM)

3. Sequence-to-Sequence Models (Seq2Seq)

Comparison of Language Pre-Training Objectives

Visual Pre-Training

1. Contrastive Learning

2. Masked Image Modeling

Multimodal Pre-Training

1. Contrastive Multimodal Pre-Training

2. Generative Multimodal Pre-Training

Summary of Pre-Training Objectives

Fine-Tuning Paradigms

1. Full Fine-Tuning

2. Linear Probing

3. LoRA (Low-Rank Adaptation)

4. Prompt Tuning

5. Adapter

Comparison of Fine-Tuning Strategies

Conclusion

评论 #