The Pre-Training Paradigm
Why Pre-Training Is Needed
Motivation from Transfer Learning
Traditional supervised learning faces a fundamental contradiction: high-quality labeled data is scarce and expensive, yet deep models require large amounts of data to learn good representations.
The core idea behind pre-training is:
First learn general-purpose representations from massive unlabeled data, then adapt to specific tasks using a small amount of labeled data.
The advantages of this approach include:
- Data efficiency: Downstream tasks can achieve strong performance with only a small amount of labeled data
- Representation quality: Features learned by pre-trained models have richer semantic structure than those from randomly initialized models
- Generalization: Pre-trained representations transfer well to a wide variety of downstream tasks
Formally, pre-training can be viewed as learning a good parameter initialization \(\theta_0\):
The model is then fine-tuned on a downstream task:
Self-Supervised Learning: The Key to Pre-Training
Core Idea
The essence of Self-Supervised Learning (SSL) is:
Constructing supervisory signals from the data itself, without any manual annotation.
In practice, some transformation or masking is applied to the input data, and the model is trained to predict the hidden portion:
General framework of self-supervised learning:
Input x → Transform/Mask → Partial observation x̃
Model objective: Recover/predict the missing parts of x from x̃
Categories of Self-Supervised Signals
| Type | Methods | Prediction Target |
|---|---|---|
| Generative | GPT, MAE | Predict masked/future tokens |
| Contrastive | SimCLR, MoCo, CLIP | Pull positive pairs closer, push negative pairs apart |
| Predictive | BERT (MLM), BEiT | Predict masked tokens/patches |
Language Model Pre-Training
Language model pre-training is the most successful paradigm for Foundation Models. Based on the pre-training objective, it can be divided into three major categories.
1. Autoregressive Language Models (Autoregressive LM)
Representative models: GPT series
Autoregressive models predict the next token from left to right:
Architectural characteristics:
- Uses the Transformer Decoder architecture
- Causal attention mask: each position can only attend to previous tokens
- Naturally suited for text generation tasks
Input: [The] [cat] [sat] [on] [the]
Target: [cat] [sat] [on] [the] [mat]
← Predict next token step by step →
Strengths: Strong generation capability; supports in-context learning.
Weaknesses: Unidirectional encoding; cannot leverage bidirectional context.
2. Masked Language Models (Masked LM)
Representative model: BERT
Randomly masks 15% of the input tokens and trains the model to predict the masked tokens:
where \(\mathcal{M}\) is the set of masked positions and \(x_{\backslash \mathcal{M}}\) denotes the unmasked portion.
Architectural characteristics:
- Uses the Transformer Encoder architecture
- Bidirectional attention: each position can attend to all other positions
- Well-suited for Natural Language Understanding (NLU) tasks
Input: [The] [cat] [MASK] [on] [the] [MASK]
Target: Predict [MASK] → "sat", "mat"
Strengths: Bidirectional encoding; strong comprehension ability.
Weaknesses: Weak generation capability; mismatch between pre-training and downstream tasks (no [MASK] tokens at inference time).
3. Sequence-to-Sequence Models (Seq2Seq)
Representative models: T5, BART
The pre-training task is unified into a "text-to-text" format:
T5's Span Corruption objective: randomly masks contiguous spans in the input and generates the masked content on the decoder side.
Input (Encoder): "The <X> sat on <Y> mat"
Output (Decoder): "<X> cat <Y> the"
Architectural characteristics:
- Encoder-Decoder architecture
- Combines both comprehension and generation capabilities
- Unified text-to-text framework
Comparison of Language Pre-Training Objectives
| Method | Architecture | Attention | Suitable Tasks | Representative Models |
|---|---|---|---|---|
| Autoregressive | Decoder-only | Causal | Generation, Dialogue | GPT, LLaMA |
| Masked LM | Encoder-only | Bidirectional | Classification, NER | BERT, RoBERTa |
| Seq2Seq | Encoder-Decoder | Mixed | Translation, Summarization | T5, BART |
Visual Pre-Training
1. Contrastive Learning
Core idea: Pull together the representations of different augmented views of the same image (positive pairs) while pushing apart representations of different images (negative pairs).
InfoNCE Loss:
where \(\tau\) is the temperature parameter and \(\text{sim}(\cdot, \cdot)\) is typically cosine similarity.
SimCLR (Chen et al., 2020):
- Applies two different data augmentations to the same image to form a positive pair
- Relies on large batch sizes to provide sufficient negative samples
- Architecture: Encoder → Projection Head → Contrastive Loss
MoCo (He et al., 2020):
- Introduces a momentum encoder and a dynamic queue
- Removes SimCLR's dependency on large batch sizes
- Momentum update: \(\theta_k \leftarrow m \theta_k + (1 - m) \theta_q\), where \(m = 0.999\)
Contrastive learning framework:
Image → Augment1 → Encoder → z_1 ─┐
├→ Pull closer (positive pair)
Image → Augment2 → Encoder → z_2 ─┘
Other Images → Encoder → z_neg ───→ Push apart (negative pairs)
2. Masked Image Modeling
MAE (He et al., 2022): Masked Autoencoder
- Randomly masks 75% of image patches
- The encoder only processes visible patches (saving computation)
- The decoder reconstructs the masked patches
BEiT (Bao et al., 2022):
- Uses a discrete visual tokenizer (e.g., dVAE) to encode patches into discrete tokens
- The pre-training objective becomes predicting the discrete tokens corresponding to masked patches
- Analogous to BERT's MLM, but applied in the visual domain
Multimodal Pre-Training
1. Contrastive Multimodal Pre-Training
CLIP (Radford et al., 2021): Contrastive Language-Image Pretraining
Core idea: Align the representation spaces of images and text.
where \(s_{ij} = \text{sim}(f_{\text{image}}(I_i), f_{\text{text}}(T_j))\).
CLIP was trained on 400 million image-text pairs and achieved remarkable zero-shot visual classification capabilities.
2. Generative Multimodal Pre-Training
Cross-modal alignment is learned by having the model generate content in one modality conditioned on another.
Representative method: CoCa (Contrastive Captioner), which combines both contrastive and generative losses.
Summary of Pre-Training Objectives
| Pre-Training Paradigm | Modality | Objective | Representative Methods | Characteristics |
|---|---|---|---|---|
| Autoregressive LM | Language | Predict next token | GPT | Strong generation |
| Masked LM | Language | Predict masked tokens | BERT | Strong comprehension |
| Span Corruption | Language | Predict masked spans | T5 | Comprehension + Generation |
| Contrastive Learning | Vision | Positive/negative contrast | SimCLR, MoCo | Discriminative representations |
| Masked Image Modeling | Vision | Reconstruct masked patches | MAE, BEiT | Pixel/token-level reconstruction |
| Cross-modal Contrastive | Multimodal | Image-text alignment | CLIP | Zero-shot transfer |
| Cross-modal Generative | Multimodal | Cross-modal generation | CoCa | Comprehension + Generation |
Fine-Tuning Paradigms
After pre-training, the model needs to be adapted to downstream tasks. The main fine-tuning strategies include:
1. Full Fine-Tuning
Updates all model parameters. Yields the best performance but at the highest cost, and may cause catastrophic forgetting.
2. Linear Probing
Freezes the pre-trained model and trains only a linear classification head. Used to evaluate the quality of pre-trained representations.
3. LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method proposed by Hu et al. (2022). Core idea: approximate weight updates using low-rank matrices.
where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), and \(r \ll \min(d, k)\).
Only \(A\) and \(B\) are trained, drastically reducing the number of trainable parameters (typically only 0.1%--1% of the original model).
4. Prompt Tuning
Prepends learnable soft prompt tokens to the input and trains only these prompt parameters.
5. Adapter
Inserts small trainable modules (typically down-project → nonlinearity → up-project) between Transformer layers, while freezing all other parameters.
Comparison of Fine-Tuning Strategies
| Method | Trainable Parameters | Performance | Memory | Use Case |
|---|---|---|---|---|
| Full Fine-Tuning | 100% | Best | High | Sufficient data |
| Linear Probing | <0.01% | Weaker | Low | Representation evaluation |
| LoRA | 0.1%--1% | Near full fine-tuning | Low | Resource-constrained |
| Prompt Tuning | <0.1% | Moderate | Low | Multi-task switching |
| Adapter | 1%--5% | Good | Medium | Multi-task adaptation |
Conclusion
The core contribution of the pre-training paradigm is:
Decoupling "learning representations from data" and "adapting to specific tasks" into two independent stages.
This unlocks the full value of large-scale unlabeled data while dramatically reducing the need for labeled data in downstream tasks. The pre-training paradigm is the technical cornerstone of Foundation Model success.