Visual Self-Supervised Learning

Overview

Visual Self-Supervised Learning learns powerful visual representations from unlabeled images and serves as the core pretraining paradigm for computer vision foundation models. It is broadly categorized into three schools: contrastive learning, masked image modeling, and self-distillation.

graph TD
    A[Visual Self-Supervised Learning] --> B[Contrastive Learning]
    A --> C[Masked Image Modeling]
    A --> D[Self-Distillation]

    B --> B1[SimCLR]
    B --> B2[MoCo v1/v2/v3]
    B --> B3[BYOL]
    B --> B4[SwAV]
    B --> B5[Barlow Twins]

    C --> C1[MAE]
    C --> C2[BEiT]
    C --> C3[SimMIM]
    C --> C4[I-JEPA]

    D --> D1[DINO]
    D --> D2[DINOv2]
    D --> D3[EMA Teacher]

1. Contrastive Learning

1.1 Core Idea

The goal of contrastive learning is to pull positive pairs closer and push negative pairs apart:

Positive pairs: Different augmented views of the same image
Negative pairs: Views from different images

1.2 SimCLR

SimCLR (Chen et al., 2020) is the classic contrastive learning framework.

Pipeline:

Apply two random augmentations to each image \(x\), yielding \(\tilde{x}_i\) and \(\tilde{x}_j\)
Encoder \(f(\cdot)\) (e.g., ResNet) extracts representations
Projection head \(g(\cdot)\) (MLP) maps to contrastive space
Compute NT-Xent loss in the projection space

NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):

\[ \ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)} \]

where:

\(z_i = g(f(\tilde{x}_i))\) is the projected representation
\(\text{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}\) is cosine similarity
\(\tau\) is the temperature hyperparameter
\(N\) is the number of images in the batch

Key Findings:

The composition of data augmentations is crucial (random crop + color distortion is most effective)
The projection head \(g\) matters for downstream performance (but downstream tasks use the output of \(f\))
Larger batch sizes yield better results (SimCLR uses 4096+)

1.3 MoCo (Momentum Contrast)

Motivation: SimCLR requires extremely large batch sizes. MoCo decouples batch size from the number of negatives through a momentum queue.

MoCo v1 (He et al., 2020):

Momentum encoder: \(\theta_k \leftarrow m \theta_k + (1-m) \theta_q\), \(m=0.999\)
Queue: Maintains a large negative sample queue (65,536 entries)
Query encoder updated by gradient; key encoder updated by momentum

MoCo v2: Adds MLP projection head + stronger augmentations

MoCo v3: Removes queue, uses in-batch contrastive (similar to SimCLR) with ViT

1.4 BYOL (Bootstrap Your Own Latent)

Breakthrough: Self-supervised learning without negative samples!

Architecture:

Online network: Encoder + projector + predictor
Target network: Encoder + projector (EMA update)

Loss Function:

\[ \mathcal{L}_{\text{BYOL}} = \| \bar{q}_\theta(z_\theta) - \bar{z}'_\xi \|_2^2 \]

where \(\bar{q}\) and \(\bar{z}'\) are \(L_2\)-normalized vectors.

Why doesn't it collapse?

Asymmetry of the predictor
The EMA-updated target network provides slowly changing targets
Implicit regularization from batch normalization

1.5 Other Contrastive Methods

Method	Key Innovation	Requires Negatives?
SimCLR	Simple framework, strong augmentations	Yes (large batch)
MoCo	Momentum queue	Yes (queue)
BYOL	Predictor + EMA	No
SwAV	Online clustering	No (clustering substitute)
Barlow Twins	Cross-correlation matrix → identity	No
VICReg	Variance-Invariance-Covariance regularization	No

2. Masked Image Modeling

2.1 Core Idea

Inspired by masked language modeling in NLP (BERT), randomly mask parts of an image and have the model predict the masked content.

2.2 MAE (Masked Autoencoders)

MAE (He et al., 2022) is a milestone in masked image modeling.

Design:

Divide the image into \(16 \times 16\) patches
Randomly mask 75% of patches (extremely high masking ratio)
Encoder (ViT) processes only visible patches
Lightweight decoder reconstructs masked pixels

Loss Function: MSE on masked patches:

\[ \mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \| \hat{x}_i - x_i \|_2^2 \]

where \(\mathcal{M}\) is the set of masked patches.

Key Design Choices:

75% masking ratio >> 15% in NLP, because images have higher redundancy
Encoder does not process mask tokens → significant compute savings
Decoder is shallow (1-2 Transformer layers) → forces encoder to learn semantics

2.3 BEiT

BEiT (Bao et al., 2022): Predicts discrete visual tokens instead of raw pixels.

Two-Stage Pipeline:

Pretrain a dVAE (discrete VAE) as visual tokenizer
Mask patches → predict corresponding dVAE tokens

\[ \mathcal{L}_{\text{BEiT}} = -\sum_{i \in \mathcal{M}} \log p(z_i | x_{\backslash \mathcal{M}}) \]

BEiT v2: Replaces dVAE with VQ-KD (Vector-Quantized Knowledge Distillation)

2.4 Other Masked Modeling Methods

Method	Prediction Target	Encoder
MAE	Raw pixels	ViT
BEiT	dVAE tokens	ViT
SimMIM	Raw pixels (simple head)	Swin Transformer
I-JEPA	Abstract representations (not pixels)	ViT
data2vec	Teacher model representations	Multimodal

2.5 I-JEPA

I-JEPA (Assran et al., 2023) predicts abstract representations rather than pixels or tokens:

Does not use data augmentation (unlike contrastive learning)
Predicts target region representations in representation space
Avoids the low-level bias of pixel-level prediction

3. Self-Distillation

3.1 DINO

DINO (Caron et al., 2021): Self-distillation with label-free ViT.

Architecture: Student-teacher framework

Student network: Updated by gradient
Teacher network: EMA update \(\theta_t \leftarrow \lambda \theta_t + (1-\lambda) \theta_s\)

Multi-Crop Strategy:

2 global views (224x224) → teacher
N local views (96x96) → student
Student predicts teacher's global view outputs

Loss Function (cross-entropy):

\[ \mathcal{L} = -\sum_x \sum_{\substack{s \in \{x_1^s, \ldots, x_V^s\} \\ t \in \{x_1^g, x_2^g\}, t \neq s}} P_t(x) \log P_s(x) \]

Centering prevents collapse: \(P_t(x) = \text{softmax}((g_t(x) - c) / \tau_t)\)

Emergent Properties of DINO:

Attention maps automatically segment foreground objects
Learns semantic segmentation in an unsupervised manner
k-NN classifier achieves strong results

3.2 DINOv2

DINOv2 (Oquab et al., 2023): Visual foundation model.

Key Improvements:

Combines DINO self-distillation + iBOT masked modeling
Large-scale automatically curated dataset (LVD-142M)
Larger models (ViT-g, 1.1B parameters)
KoLeo regularization maintains uniform representation distribution

DINOv2 Loss:

\[ \mathcal{L} = \mathcal{L}_{\text{DINO}} + \lambda \mathcal{L}_{\text{iBOT}} \]

DINOv2 as Feature Extractor:

Frozen DINOv2 + linear probing achieves SOTA on multiple tasks
Including: classification, segmentation, depth estimation, retrieval

4. Method Comparison

4.1 Overall Comparison

Dimension	Contrastive	Masked Modeling	Self-Distillation
Pretext Task	Instance discrimination	Pixel/token reconstruction	Knowledge distillation
Data Augmentation	Critical	Less dependent	Multi-crop
Negative Samples	Required/alternatives	Not needed	Not needed
Suitable Architectures	CNN/ViT	Mainly ViT	Mainly ViT
Learned Features	Global discriminative	Local + global	Global semantic
Representative	SimCLR, MoCo	MAE, BEiT	DINO, DINOv2

4.2 Downstream Task Performance

Method	ImageNet Linear	ImageNet k-NN	ADE20K Seg
MoCo v3 (ViT-B)	76.7	-	-
MAE (ViT-B)	68.0	-	48.1
DINO (ViT-B)	78.2	76.1	-
DINOv2 (ViT-g)	86.5	83.5	49.0

4.3 Selection Guide

graph TD
    A[Choose SSL Method] --> B{Data Scale?}
    B -->|Small| C[MAE/DINO]
    B -->|Large| D{Target Task?}
    D -->|Classification/Retrieval| E[DINOv2]
    D -->|Dense Prediction| F{Using ViT?}
    F -->|Yes| G[MAE + Fine-tuning]
    F -->|No/CNN| H[MoCo/BYOL]

5. Data Augmentation Strategies

Data augmentation is central to the success of contrastive learning:

Augmentation	SimCLR	MoCo v2	DINO
Random Crop	Yes	Yes	Yes (multi-crop)
Color Jitter	Yes	Yes	Yes
Gaussian Blur	Yes	Yes	Yes
Horizontal Flip	Yes	Yes	Yes
Grayscale	Yes	Yes	-
Solarization	-	-	Yes
Multi-scale Crop	-	-	Yes

6. Practical Guide

6.1 Pretraining Setup

# MAE pretraining example (pseudocode)
class MAE(nn.Module):
    def __init__(self, encoder, decoder, mask_ratio=0.75):
        self.encoder = encoder  # ViT
        self.decoder = decoder  # Shallow Transformer
        self.mask_ratio = mask_ratio

    def forward(self, x):
        # 1. Patch embedding
        patches = self.patchify(x)
        # 2. Random masking
        visible, masked, ids = self.random_mask(patches)
        # 3. Encode visible patches
        latent = self.encoder(visible)
        # 4. Decode and reconstruct
        pred = self.decoder(latent, ids)
        # 5. MSE loss (only on masked regions)
        loss = F.mse_loss(pred[masked], patches[masked])
        return loss

6.2 Fine-tuning Recommendations

Linear probing: Freeze encoder, train only a linear layer (evaluates representation quality)
End-to-end fine-tuning: Unfreeze encoder with smaller learning rate
Adapter fine-tuning: Parameter-efficient methods like LoRA

7. Summary and Outlook

Historical Evolution:

2020: SimCLR, MoCo → Contrastive learning explosion
2020: BYOL → Methods without negatives
2021: DINO → Self-distillation + ViT emergent properties
2022: MAE → Rise of masked image modeling
2023: DINOv2, I-JEPA → Visual foundation models
2024+: Multimodal SSL, video SSL

Future Directions:

Video self-supervised learning (temporal consistency)
Multimodal joint self-supervision (image + text + audio)
More efficient pretraining methods
Fusion of self-supervision with large-scale generative models

References

Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations," ICML 2020
He et al., "Momentum Contrast for Unsupervised Visual Representation Learning," CVPR 2020
Grill et al., "Bootstrap Your Own Latent," NeurIPS 2020
He et al., "Masked Autoencoders Are Scalable Vision Learners," CVPR 2022
Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," ICCV 2021
Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision," TMLR 2024