Skip to content

Visual Self-Supervised Learning

Overview

Visual Self-Supervised Learning learns powerful visual representations from unlabeled images and serves as the core pretraining paradigm for computer vision foundation models. It is broadly categorized into three schools: contrastive learning, masked image modeling, and self-distillation.

graph TD
    A[Visual Self-Supervised Learning] --> B[Contrastive Learning]
    A --> C[Masked Image Modeling]
    A --> D[Self-Distillation]

    B --> B1[SimCLR]
    B --> B2[MoCo v1/v2/v3]
    B --> B3[BYOL]
    B --> B4[SwAV]
    B --> B5[Barlow Twins]

    C --> C1[MAE]
    C --> C2[BEiT]
    C --> C3[SimMIM]
    C --> C4[I-JEPA]

    D --> D1[DINO]
    D --> D2[DINOv2]
    D --> D3[EMA Teacher]

1. Contrastive Learning

1.1 Core Idea

The goal of contrastive learning is to pull positive pairs closer and push negative pairs apart:

  • Positive pairs: Different augmented views of the same image
  • Negative pairs: Views from different images

1.2 SimCLR

SimCLR (Chen et al., 2020) is the classic contrastive learning framework.

Pipeline:

  1. Apply two random augmentations to each image \(x\), yielding \(\tilde{x}_i\) and \(\tilde{x}_j\)
  2. Encoder \(f(\cdot)\) (e.g., ResNet) extracts representations
  3. Projection head \(g(\cdot)\) (MLP) maps to contrastive space
  4. Compute NT-Xent loss in the projection space

NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):

\[ \ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)} \]

where:

  • \(z_i = g(f(\tilde{x}_i))\) is the projected representation
  • \(\text{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}\) is cosine similarity
  • \(\tau\) is the temperature hyperparameter
  • \(N\) is the number of images in the batch

Key Findings:

  • The composition of data augmentations is crucial (random crop + color distortion is most effective)
  • The projection head \(g\) matters for downstream performance (but downstream tasks use the output of \(f\))
  • Larger batch sizes yield better results (SimCLR uses 4096+)

1.3 MoCo (Momentum Contrast)

Motivation: SimCLR requires extremely large batch sizes. MoCo decouples batch size from the number of negatives through a momentum queue.

MoCo v1 (He et al., 2020):

  • Momentum encoder: \(\theta_k \leftarrow m \theta_k + (1-m) \theta_q\), \(m=0.999\)
  • Queue: Maintains a large negative sample queue (65,536 entries)
  • Query encoder updated by gradient; key encoder updated by momentum

MoCo v2: Adds MLP projection head + stronger augmentations

MoCo v3: Removes queue, uses in-batch contrastive (similar to SimCLR) with ViT

1.4 BYOL (Bootstrap Your Own Latent)

Breakthrough: Self-supervised learning without negative samples!

Architecture:

  • Online network: Encoder + projector + predictor
  • Target network: Encoder + projector (EMA update)

Loss Function:

\[ \mathcal{L}_{\text{BYOL}} = \| \bar{q}_\theta(z_\theta) - \bar{z}'_\xi \|_2^2 \]

where \(\bar{q}\) and \(\bar{z}'\) are \(L_2\)-normalized vectors.

Why doesn't it collapse?

  • Asymmetry of the predictor
  • The EMA-updated target network provides slowly changing targets
  • Implicit regularization from batch normalization

1.5 Other Contrastive Methods

Method Key Innovation Requires Negatives?
SimCLR Simple framework, strong augmentations Yes (large batch)
MoCo Momentum queue Yes (queue)
BYOL Predictor + EMA No
SwAV Online clustering No (clustering substitute)
Barlow Twins Cross-correlation matrix → identity No
VICReg Variance-Invariance-Covariance regularization No

2. Masked Image Modeling

2.1 Core Idea

Inspired by masked language modeling in NLP (BERT), randomly mask parts of an image and have the model predict the masked content.

2.2 MAE (Masked Autoencoders)

MAE (He et al., 2022) is a milestone in masked image modeling.

Design:

  1. Divide the image into \(16 \times 16\) patches
  2. Randomly mask 75% of patches (extremely high masking ratio)
  3. Encoder (ViT) processes only visible patches
  4. Lightweight decoder reconstructs masked pixels

Loss Function: MSE on masked patches:

\[ \mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \| \hat{x}_i - x_i \|_2^2 \]

where \(\mathcal{M}\) is the set of masked patches.

Key Design Choices:

  • 75% masking ratio >> 15% in NLP, because images have higher redundancy
  • Encoder does not process mask tokens → significant compute savings
  • Decoder is shallow (1-2 Transformer layers) → forces encoder to learn semantics

2.3 BEiT

BEiT (Bao et al., 2022): Predicts discrete visual tokens instead of raw pixels.

Two-Stage Pipeline:

  1. Pretrain a dVAE (discrete VAE) as visual tokenizer
  2. Mask patches → predict corresponding dVAE tokens
\[ \mathcal{L}_{\text{BEiT}} = -\sum_{i \in \mathcal{M}} \log p(z_i | x_{\backslash \mathcal{M}}) \]

BEiT v2: Replaces dVAE with VQ-KD (Vector-Quantized Knowledge Distillation)

2.4 Other Masked Modeling Methods

Method Prediction Target Encoder
MAE Raw pixels ViT
BEiT dVAE tokens ViT
SimMIM Raw pixels (simple head) Swin Transformer
I-JEPA Abstract representations (not pixels) ViT
data2vec Teacher model representations Multimodal

2.5 I-JEPA

I-JEPA (Assran et al., 2023) predicts abstract representations rather than pixels or tokens:

  • Does not use data augmentation (unlike contrastive learning)
  • Predicts target region representations in representation space
  • Avoids the low-level bias of pixel-level prediction

3. Self-Distillation

3.1 DINO

DINO (Caron et al., 2021): Self-distillation with label-free ViT.

Architecture: Student-teacher framework

  • Student network: Updated by gradient
  • Teacher network: EMA update \(\theta_t \leftarrow \lambda \theta_t + (1-\lambda) \theta_s\)

Multi-Crop Strategy:

  • 2 global views (224x224) → teacher
  • N local views (96x96) → student
  • Student predicts teacher's global view outputs

Loss Function (cross-entropy):

\[ \mathcal{L} = -\sum_x \sum_{\substack{s \in \{x_1^s, \ldots, x_V^s\} \\ t \in \{x_1^g, x_2^g\}, t \neq s}} P_t(x) \log P_s(x) \]

Centering prevents collapse: \(P_t(x) = \text{softmax}((g_t(x) - c) / \tau_t)\)

Emergent Properties of DINO:

  • Attention maps automatically segment foreground objects
  • Learns semantic segmentation in an unsupervised manner
  • k-NN classifier achieves strong results

3.2 DINOv2

DINOv2 (Oquab et al., 2023): Visual foundation model.

Key Improvements:

  • Combines DINO self-distillation + iBOT masked modeling
  • Large-scale automatically curated dataset (LVD-142M)
  • Larger models (ViT-g, 1.1B parameters)
  • KoLeo regularization maintains uniform representation distribution

DINOv2 Loss:

\[ \mathcal{L} = \mathcal{L}_{\text{DINO}} + \lambda \mathcal{L}_{\text{iBOT}} \]

DINOv2 as Feature Extractor:

  • Frozen DINOv2 + linear probing achieves SOTA on multiple tasks
  • Including: classification, segmentation, depth estimation, retrieval

4. Method Comparison

4.1 Overall Comparison

Dimension Contrastive Masked Modeling Self-Distillation
Pretext Task Instance discrimination Pixel/token reconstruction Knowledge distillation
Data Augmentation Critical Less dependent Multi-crop
Negative Samples Required/alternatives Not needed Not needed
Suitable Architectures CNN/ViT Mainly ViT Mainly ViT
Learned Features Global discriminative Local + global Global semantic
Representative SimCLR, MoCo MAE, BEiT DINO, DINOv2

4.2 Downstream Task Performance

Method ImageNet Linear ImageNet k-NN ADE20K Seg
MoCo v3 (ViT-B) 76.7 - -
MAE (ViT-B) 68.0 - 48.1
DINO (ViT-B) 78.2 76.1 -
DINOv2 (ViT-g) 86.5 83.5 49.0

4.3 Selection Guide

graph TD
    A[Choose SSL Method] --> B{Data Scale?}
    B -->|Small| C[MAE/DINO]
    B -->|Large| D{Target Task?}
    D -->|Classification/Retrieval| E[DINOv2]
    D -->|Dense Prediction| F{Using ViT?}
    F -->|Yes| G[MAE + Fine-tuning]
    F -->|No/CNN| H[MoCo/BYOL]

5. Data Augmentation Strategies

Data augmentation is central to the success of contrastive learning:

Augmentation SimCLR MoCo v2 DINO
Random Crop Yes Yes Yes (multi-crop)
Color Jitter Yes Yes Yes
Gaussian Blur Yes Yes Yes
Horizontal Flip Yes Yes Yes
Grayscale Yes Yes -
Solarization - - Yes
Multi-scale Crop - - Yes

6. Practical Guide

6.1 Pretraining Setup

# MAE pretraining example (pseudocode)
class MAE(nn.Module):
    def __init__(self, encoder, decoder, mask_ratio=0.75):
        self.encoder = encoder  # ViT
        self.decoder = decoder  # Shallow Transformer
        self.mask_ratio = mask_ratio

    def forward(self, x):
        # 1. Patch embedding
        patches = self.patchify(x)
        # 2. Random masking
        visible, masked, ids = self.random_mask(patches)
        # 3. Encode visible patches
        latent = self.encoder(visible)
        # 4. Decode and reconstruct
        pred = self.decoder(latent, ids)
        # 5. MSE loss (only on masked regions)
        loss = F.mse_loss(pred[masked], patches[masked])
        return loss

6.2 Fine-tuning Recommendations

  • Linear probing: Freeze encoder, train only a linear layer (evaluates representation quality)
  • End-to-end fine-tuning: Unfreeze encoder with smaller learning rate
  • Adapter fine-tuning: Parameter-efficient methods like LoRA

7. Summary and Outlook

Historical Evolution:

  1. 2020: SimCLR, MoCo → Contrastive learning explosion
  2. 2020: BYOL → Methods without negatives
  3. 2021: DINO → Self-distillation + ViT emergent properties
  4. 2022: MAE → Rise of masked image modeling
  5. 2023: DINOv2, I-JEPA → Visual foundation models
  6. 2024+: Multimodal SSL, video SSL

Future Directions:

  • Video self-supervised learning (temporal consistency)
  • Multimodal joint self-supervision (image + text + audio)
  • More efficient pretraining methods
  • Fusion of self-supervision with large-scale generative models

References

  • Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations," ICML 2020
  • He et al., "Momentum Contrast for Unsupervised Visual Representation Learning," CVPR 2020
  • Grill et al., "Bootstrap Your Own Latent," NeurIPS 2020
  • He et al., "Masked Autoencoders Are Scalable Vision Learners," CVPR 2022
  • Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," ICCV 2021
  • Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," TMLR 2024

评论 #