Representation Learning

📁 1_Representation/

Overview of representation learning
Geometry of representation spaces
Linear Probing
Contrastive Learning (InfoNCE)
Masked Modeling (MAE)
Joint Embedding vs Generative Modeling

This chapter lays the groundwork for foundation models.

The following topics are also touched upon:

📁 8_Reasoning/

Representation vs Reasoning
Emergent Abilities
The World Model Hypothesis
Is Representation Equivalent to Knowledge?
Statistical Learning vs Structural Reasoning

The goal of representation learning is:

To learn a good feature representation space in which semantic structure becomes simple.

Representation learning is the foundation of Foundation Models.

Core characteristics of foundation models:

Large-scale pretraining
General-purpose capabilities
Transferability to multiple tasks
Downstream fine-tuning or zero-shot inference

Regardless of the specific model—

BERT
GPT
CLIP
ViT
Multimodal LLMs

—they all share one common trait:

They first learn a general-purpose representation space.

This is because the goal of a foundation model is not to solve a single task.

It must be able to:

Classify
Generate
Retrieve
Reason
Transfer

To support all of these capabilities, one condition must first be met:

Semantics must be organized into a stable structure within the internal space.

In other words:

The capabilities of a foundation model stem from the "semantic geometry" it has learned.

If the internal representations are chaotic,

no decoder, however powerful, can help.

Representation Learning (objective)
    ├── Supervised Learning
    ├── Unsupervised Learning
    └── Self-Supervised Learning (modern mainstream)

Historical progression:

Feature Engineering Era (1990s–2012): SIFT, HOG, LBP; humans designed features, then applied SVM / classifiers. Features were hand-crafted, resulting in weak generalization.
Rise of Representation Learning (2006–2014): Autoencoder, RBM, Deep Belief Network. No need for hand-crafted features—models learned representations automatically. The term "representation learning" began to emerge around 2006. AlexNet's success in 2012 made CNN-based automatic feature learning the mainstream approach. During this phase, representation learning was essentially synonymous with supervised deep learning.
Unsupervised Representation Learning (2014–2018): Researchers realized that labeled data was too expensive, and began exploring autoencoders, GANs, context prediction, word2vec, etc. This phase was mainly impactful in NLP; results in the vision domain remained limited.
Self-Supervised Learning Era (2018–2021): Key methods include CPC, SimCLR, MoCo, BYOL, and MAE. The goal was to learn transferable representations without labels. Application domains included image pretraining, video, and speech. A consensus emerged during this period: pretraining matters more than supervised learning.
Cross-Modal Representation Learning (2021–present): Key models include CLIP, ALIGN, Flamingo, and GPT-4V. The focus shifted to learning joint representations. Application domains include zero-shot classification, multimodal QA, and text-to-image generation. Representation learning has now become the core of foundation models.

Contrastive Learning

The core idea of contrastive learning is to pull positive pairs closer and push negative pairs apart in embedding space. Rather than relying on labels, it constructs positive pairs through data augmentation.

SimCLR (Chen et al., 2020)

SimCLR is one of the most influential contrastive learning frameworks, proposed by Google Brain.

Pipeline:

Apply two different data augmentations to the same image, producing \(x_i\) and \(x_j\) (a positive pair)
Extract representations through an encoder \(f(\cdot)\) (e.g., ResNet)
Project into a lower-dimensional space through a projection head \(g(\cdot)\) (MLP)
Compute the NT-Xent loss for contrastive learning

NT-Xent Loss (Normalized Temperature-scaled Cross Entropy):

\[\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}\]

where \(\text{sim}(u, v) = \frac{u^\top v}{\|u\| \|v\|}\) is cosine similarity, \(\tau\) is a temperature parameter, and \(N\) is the batch size.

Key findings:

Large batch sizes are essential (the paper used 4096 or even 8192)
The combination of data augmentations (especially random crop + color jitter) matters more than the network architecture
The projection head \(g(\cdot)\) is critical for downstream performance, but downstream tasks use the output of \(f(\cdot)\), not \(g(\cdot)\)

MoCo (He et al., 2020)

MoCo (Momentum Contrast) addresses SimCLR's dependence on large batch sizes.

Core design:

Momentum Encoder: Maintains a slowly updated encoder \(f_k\) whose parameters are an exponential moving average of the query encoder \(f_q\):

\[\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q, \quad m = 0.999\]

Queue: Maintains a large queue of negative samples (e.g., 65,536), decoupling the number of negatives from the batch size
The queue is updated in FIFO fashion, with old samples gradually replaced

Advantage: Only a standard batch size (256) is needed to access a large pool of negatives, drastically reducing GPU requirements.

BYOL (Grill et al., 2020)

BYOL (Bootstrap Your Own Latent) achieved a breakthrough: it requires no negative samples at all.

Architecture:

Online network: encoder + projection head + prediction head
Target network: encoder + projection head (momentum-updated, no prediction head)

Loss function:

\[\mathcal{L} = \| \bar{q}_\theta(z_\theta) - \bar{z}'_\xi \|_2^2\]

where \(\bar{\cdot}\) denotes L2 normalization.

Why doesn't it collapse? This is the most surprising aspect of BYOL. The core reasons include:

The asymmetric design of the prediction head breaks symmetry
Momentum updates provide a slowly changing target
Batch Normalization implicitly introduces a negative-sample effect (subsequent research showed that removing BN leads to collapse)

CLIP: Contrastive Language-Image Pre-training

CLIP (Radford et al., 2021)

CLIP extended contrastive learning to the vision-language multimodal domain.

Architecture:

Image encoder: ViT or ResNet
Text encoder: Transformer
Both encoders map images and text into a shared embedding space

Training objective: Given a batch of \(N\) image-text pairs, maximize the similarity of matching pairs and minimize the similarity of non-matching pairs:

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(s_{i,i}/\tau)}{\sum_j \exp(s_{i,j}/\tau)} + \log\frac{\exp(s_{i,i}/\tau)}{\sum_j \exp(s_{j,i}/\tau)}\right]\]

where \(s_{i,j} = \text{sim}(\text{img}_i, \text{txt}_j)\).

Zero-shot classification:

Convert class names into text prompts (e.g., "a photo of a {class}")
Encode the image and all class texts separately
Select the class with the highest cosine similarity as the prediction

CLIP's far-reaching impact:

Demonstrated the power of natural language supervision (400 million image-text pairs)
Pioneered the zero-shot visual understanding paradigm
Became a foundational component for subsequent models such as DALL-E, Stable Diffusion, and GPT-4V

DINO / DINOv2: Self-Distillation and Emergent Visual Features

DINO (Caron et al., 2021)

DINO (Self-DIstillation with NO labels) revealed remarkable emergent properties in self-supervised ViTs.

Method:

Teacher-student self-distillation framework; the teacher network is updated via EMA
The student sees local crops; the teacher sees global crops
The loss function is cross-entropy (soft-label distillation)

Emergent properties:

The attention maps of self-supervised ViTs automatically learn object segmentation capabilities
Without any pixel-level annotation, attention maps can precisely delineate target objects
The self-supervised ViT's [CLS] token encodes global semantic information, while intermediate tokens encode local information

DINOv2 (Oquab et al., 2024)

Trained on a larger-scale dataset (LVD-142M, an automatically curated dataset)
Combines DINO's self-distillation with iBOT's masked image modeling
Produces features that can be used directly for various downstream tasks (segmentation, depth estimation, retrieval) without fine-tuning
Regarded as one of the strongest general-purpose visual feature extractors available today

Embedding Space Geometry

Cosine Similarity and Metric Learning

In representation learning, cosine similarity is the most commonly used similarity metric:

\[\text{sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}\]

Why cosine rather than Euclidean distance?

Cosine similarity is insensitive to vector magnitude, focusing only on direction
In high-dimensional spaces, Euclidean distance suffers from the "curse of dimensionality" (all pairwise distances converge to the same value)
Cosine similarity operates on the normalized hypersphere, making it better suited for contrastive learning

The Isotropy Problem

An ideal embedding space should be isotropic: representations are uniformly distributed on the hypersphere with roughly equal variance in all directions.

Anisotropy problem: In practice, embeddings tend to occupy a narrow conical region (anisotropic), leading to:

Uniformly high similarity scores with reduced discriminability
Compression of the effective range of cosine similarity

Solutions:

Post-processing: mean removal + whitening
During training: adding a uniformity loss

Representation Collapse

Representation collapse is the greatest threat to contrastive learning: all inputs are mapped to the same point (or a low-dimensional subspace).

Types of collapse:

Complete collapse: All representations are identical, \(f(x) = c, \forall x\)
Dimensional collapse: Representations use only a small subset of the available dimensions

Anti-collapse strategies:

Strategy	Representative Methods	Mechanism
Negative samples	SimCLR, MoCo	Explicitly push apart different samples
Asymmetric architecture	BYOL, SimSiam	Prediction head breaks symmetry
Information maximization	Barlow Twins, VICReg	Maximize independence between feature dimensions
Clustering	SwAV, DINO	Avoid collapse via online clustering

Multimodal Representation Alignment

A core application of multimodal representation learning is cross-modal retrieval:

Image-to-text retrieval: Given an image, retrieve the most relevant text descriptions
Text-to-image retrieval: Given text, retrieve the most matching images

Implementation: map different modalities into a shared embedding space and rank by cosine similarity.

ALIGN (Jia et al., 2021)

Trained on 1.8 billion "noisy" image-text pairs (without careful curation)
Demonstrated that data scale can compensate for data quality
Dual-encoder architecture (EfficientNet + BERT) with a contrastive learning objective
Competitive with CLIP on zero-shot and retrieval tasks

Challenges of Alignment

Modality Gap: Representations from different modalities tend to occupy distinct regions in the shared space, exhibiting a systematic offset
Fine-grained alignment: Aligning local image regions with local text phrases (e.g., GLIP, Grounding DINO)
Many-to-many relationships: A single image can correspond to multiple text descriptions and vice versa; naive one-to-one training objectives may be insufficient
Cross-modal hallucination: Models may learn spurious cross-modal correlations and degrade on out-of-distribution data