Visual Self-Supervised Learning
Overview
Visual Self-Supervised Learning learns powerful visual representations from unlabeled images and serves as the core pretraining paradigm for computer vision foundation models. It is broadly categorized into three schools: contrastive learning, masked image modeling, and self-distillation.
graph TD
A[Visual Self-Supervised Learning] --> B[Contrastive Learning]
A --> C[Masked Image Modeling]
A --> D[Self-Distillation]
B --> B1[SimCLR]
B --> B2[MoCo v1/v2/v3]
B --> B3[BYOL]
B --> B4[SwAV]
B --> B5[Barlow Twins]
C --> C1[MAE]
C --> C2[BEiT]
C --> C3[SimMIM]
C --> C4[I-JEPA]
D --> D1[DINO]
D --> D2[DINOv2]
D --> D3[EMA Teacher]
1. Contrastive Learning
1.1 Core Idea
The goal of contrastive learning is to pull positive pairs closer and push negative pairs apart:
- Positive pairs: Different augmented views of the same image
- Negative pairs: Views from different images
1.2 SimCLR
SimCLR (Chen et al., 2020) is the classic contrastive learning framework.
Pipeline:
- Apply two random augmentations to each image \(x\), yielding \(\tilde{x}_i\) and \(\tilde{x}_j\)
- Encoder \(f(\cdot)\) (e.g., ResNet) extracts representations
- Projection head \(g(\cdot)\) (MLP) maps to contrastive space
- Compute NT-Xent loss in the projection space
NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):
where:
- \(z_i = g(f(\tilde{x}_i))\) is the projected representation
- \(\text{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}\) is cosine similarity
- \(\tau\) is the temperature hyperparameter
- \(N\) is the number of images in the batch
Key Findings:
- The composition of data augmentations is crucial (random crop + color distortion is most effective)
- The projection head \(g\) matters for downstream performance (but downstream tasks use the output of \(f\))
- Larger batch sizes yield better results (SimCLR uses 4096+)
1.3 MoCo (Momentum Contrast)
Motivation: SimCLR requires extremely large batch sizes. MoCo decouples batch size from the number of negatives through a momentum queue.
MoCo v1 (He et al., 2020):
- Momentum encoder: \(\theta_k \leftarrow m \theta_k + (1-m) \theta_q\), \(m=0.999\)
- Queue: Maintains a large negative sample queue (65,536 entries)
- Query encoder updated by gradient; key encoder updated by momentum
MoCo v2: Adds MLP projection head + stronger augmentations
MoCo v3: Removes queue, uses in-batch contrastive (similar to SimCLR) with ViT
1.4 BYOL (Bootstrap Your Own Latent)
Breakthrough: Self-supervised learning without negative samples!
Architecture:
- Online network: Encoder + projector + predictor
- Target network: Encoder + projector (EMA update)
Loss Function:
where \(\bar{q}\) and \(\bar{z}'\) are \(L_2\)-normalized vectors.
Why doesn't it collapse?
- Asymmetry of the predictor
- The EMA-updated target network provides slowly changing targets
- Implicit regularization from batch normalization
1.5 Other Contrastive Methods
| Method | Key Innovation | Requires Negatives? |
|---|---|---|
| SimCLR | Simple framework, strong augmentations | Yes (large batch) |
| MoCo | Momentum queue | Yes (queue) |
| BYOL | Predictor + EMA | No |
| SwAV | Online clustering | No (clustering substitute) |
| Barlow Twins | Cross-correlation matrix → identity | No |
| VICReg | Variance-Invariance-Covariance regularization | No |
2. Masked Image Modeling
2.1 Core Idea
Inspired by masked language modeling in NLP (BERT), randomly mask parts of an image and have the model predict the masked content.
2.2 MAE (Masked Autoencoders)
MAE (He et al., 2022) is a milestone in masked image modeling.
Design:
- Divide the image into \(16 \times 16\) patches
- Randomly mask 75% of patches (extremely high masking ratio)
- Encoder (ViT) processes only visible patches
- Lightweight decoder reconstructs masked pixels
Loss Function: MSE on masked patches:
where \(\mathcal{M}\) is the set of masked patches.
Key Design Choices:
- 75% masking ratio >> 15% in NLP, because images have higher redundancy
- Encoder does not process mask tokens → significant compute savings
- Decoder is shallow (1-2 Transformer layers) → forces encoder to learn semantics
2.3 BEiT
BEiT (Bao et al., 2022): Predicts discrete visual tokens instead of raw pixels.
Two-Stage Pipeline:
- Pretrain a dVAE (discrete VAE) as visual tokenizer
- Mask patches → predict corresponding dVAE tokens
BEiT v2: Replaces dVAE with VQ-KD (Vector-Quantized Knowledge Distillation)
2.4 Other Masked Modeling Methods
| Method | Prediction Target | Encoder |
|---|---|---|
| MAE | Raw pixels | ViT |
| BEiT | dVAE tokens | ViT |
| SimMIM | Raw pixels (simple head) | Swin Transformer |
| I-JEPA | Abstract representations (not pixels) | ViT |
| data2vec | Teacher model representations | Multimodal |
2.5 I-JEPA
I-JEPA (Assran et al., 2023) predicts abstract representations rather than pixels or tokens:
- Does not use data augmentation (unlike contrastive learning)
- Predicts target region representations in representation space
- Avoids the low-level bias of pixel-level prediction
3. Self-Distillation
3.1 DINO
DINO (Caron et al., 2021): Self-distillation with label-free ViT.
Architecture: Student-teacher framework
- Student network: Updated by gradient
- Teacher network: EMA update \(\theta_t \leftarrow \lambda \theta_t + (1-\lambda) \theta_s\)
Multi-Crop Strategy:
- 2 global views (224x224) → teacher
- N local views (96x96) → student
- Student predicts teacher's global view outputs
Loss Function (cross-entropy):
Centering prevents collapse: \(P_t(x) = \text{softmax}((g_t(x) - c) / \tau_t)\)
Emergent Properties of DINO:
- Attention maps automatically segment foreground objects
- Learns semantic segmentation in an unsupervised manner
- k-NN classifier achieves strong results
3.2 DINOv2
DINOv2 (Oquab et al., 2023): Visual foundation model.
Key Improvements:
- Combines DINO self-distillation + iBOT masked modeling
- Large-scale automatically curated dataset (LVD-142M)
- Larger models (ViT-g, 1.1B parameters)
- KoLeo regularization maintains uniform representation distribution
DINOv2 Loss:
DINOv2 as Feature Extractor:
- Frozen DINOv2 + linear probing achieves SOTA on multiple tasks
- Including: classification, segmentation, depth estimation, retrieval
4. Method Comparison
4.1 Overall Comparison
| Dimension | Contrastive | Masked Modeling | Self-Distillation |
|---|---|---|---|
| Pretext Task | Instance discrimination | Pixel/token reconstruction | Knowledge distillation |
| Data Augmentation | Critical | Less dependent | Multi-crop |
| Negative Samples | Required/alternatives | Not needed | Not needed |
| Suitable Architectures | CNN/ViT | Mainly ViT | Mainly ViT |
| Learned Features | Global discriminative | Local + global | Global semantic |
| Representative | SimCLR, MoCo | MAE, BEiT | DINO, DINOv2 |
4.2 Downstream Task Performance
| Method | ImageNet Linear | ImageNet k-NN | ADE20K Seg |
|---|---|---|---|
| MoCo v3 (ViT-B) | 76.7 | - | - |
| MAE (ViT-B) | 68.0 | - | 48.1 |
| DINO (ViT-B) | 78.2 | 76.1 | - |
| DINOv2 (ViT-g) | 86.5 | 83.5 | 49.0 |
4.3 Selection Guide
graph TD
A[Choose SSL Method] --> B{Data Scale?}
B -->|Small| C[MAE/DINO]
B -->|Large| D{Target Task?}
D -->|Classification/Retrieval| E[DINOv2]
D -->|Dense Prediction| F{Using ViT?}
F -->|Yes| G[MAE + Fine-tuning]
F -->|No/CNN| H[MoCo/BYOL]
5. Data Augmentation Strategies
Data augmentation is central to the success of contrastive learning:
| Augmentation | SimCLR | MoCo v2 | DINO |
|---|---|---|---|
| Random Crop | Yes | Yes | Yes (multi-crop) |
| Color Jitter | Yes | Yes | Yes |
| Gaussian Blur | Yes | Yes | Yes |
| Horizontal Flip | Yes | Yes | Yes |
| Grayscale | Yes | Yes | - |
| Solarization | - | - | Yes |
| Multi-scale Crop | - | - | Yes |
6. Practical Guide
6.1 Pretraining Setup
# MAE pretraining example (pseudocode)
class MAE(nn.Module):
def __init__(self, encoder, decoder, mask_ratio=0.75):
self.encoder = encoder # ViT
self.decoder = decoder # Shallow Transformer
self.mask_ratio = mask_ratio
def forward(self, x):
# 1. Patch embedding
patches = self.patchify(x)
# 2. Random masking
visible, masked, ids = self.random_mask(patches)
# 3. Encode visible patches
latent = self.encoder(visible)
# 4. Decode and reconstruct
pred = self.decoder(latent, ids)
# 5. MSE loss (only on masked regions)
loss = F.mse_loss(pred[masked], patches[masked])
return loss
6.2 Fine-tuning Recommendations
- Linear probing: Freeze encoder, train only a linear layer (evaluates representation quality)
- End-to-end fine-tuning: Unfreeze encoder with smaller learning rate
- Adapter fine-tuning: Parameter-efficient methods like LoRA
7. Summary and Outlook
Historical Evolution:
- 2020: SimCLR, MoCo → Contrastive learning explosion
- 2020: BYOL → Methods without negatives
- 2021: DINO → Self-distillation + ViT emergent properties
- 2022: MAE → Rise of masked image modeling
- 2023: DINOv2, I-JEPA → Visual foundation models
- 2024+: Multimodal SSL, video SSL
Future Directions:
- Video self-supervised learning (temporal consistency)
- Multimodal joint self-supervision (image + text + audio)
- More efficient pretraining methods
- Fusion of self-supervision with large-scale generative models
References
- Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations," ICML 2020
- He et al., "Momentum Contrast for Unsupervised Visual Representation Learning," CVPR 2020
- Grill et al., "Bootstrap Your Own Latent," NeurIPS 2020
- He et al., "Masked Autoencoders Are Scalable Vision Learners," CVPR 2022
- Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," ICCV 2021
- Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," TMLR 2024