Vision Foundation Models

Overview

The development trajectory of vision foundation models mirrors that of NLP: from task-specific CNN models, to general-purpose pretrained Vision Transformers, and then to cross-modal vision-language models.

Core paradigm shift: from "one backbone per task" to "one foundation for many tasks."

Vision Foundation Model Taxonomy:

Supervised Pretraining (ImageNet)          → ResNet, ViT
Self-Supervised Pretraining (Contrastive)  → SimCLR, MoCo, DINO
Self-Supervised Pretraining (Masked Modeling) → MAE, BEiT
Cross-Modal Pretraining (Image-Text Alignment) → CLIP, SigLIP
Task-Level Foundation (Segmentation)       → SAM
Task-Level Foundation (Detection)          → Grounding DINO

ViT Recap

Vision Transformer (ViT) serves as the backbone architecture for vision foundation models. For details, refer to the ViT notes.

Core idea: Split an image into fixed-size patches, and feed each patch as a token into a Transformer.

\[ z_0 = [x_{\text{cls}}; x_p^1 E; x_p^2 E; \ldots; x_p^N E] + E_{\text{pos}} \]

where \(E \in \mathbb{R}^{(P^2 \cdot C) \times D}\) is the patch embedding matrix and \(P\) is the patch size.

The success of ViT demonstrated that Transformers are equally effective in the vision domain and outperform CNNs when trained on large-scale data.

Self-Supervised Visual Pretraining

MAE (Masked Autoencoder)

MAE, proposed by He et al. (2022), is a landmark work in masked modeling for vision.

Core idea: Randomly mask a large proportion (75%) of image patches, then use an encoder-decoder architecture to reconstruct the masked patches.

MAE Architecture:

Original Image → Split into Patches → Randomly Mask 75%
                                          ↓
Visible Patches → ViT Encoder (processes only visible patches, saving compute)
                                          ↓
Encodings + Mask Tokens → Lightweight Decoder → Pixel Reconstruction
                                          ↓
Compute MSE Loss (only on masked patches)

Reconstruction loss:

\[ \mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \| \hat{x}_i - x_i \|_2^2 \]

Key design choices in MAE:

High masking ratio (75%): Forces the model to learn global semantics rather than local texture interpolation
Asymmetric architecture: Heavy encoder + lightweight decoder; the encoder processes only visible patches
Pixel reconstruction: Directly predicts pixel values without requiring a tokenizer

DINO / DINOv2

DINO (Caron et al., 2021): Self-Distillation with No Labels

Core idea: A student-teacher self-distillation framework that requires neither negative samples nor labels.

DINO Framework:

Image → Global Crop → Teacher Network → p_t
Image → Local Crops  → Student Network → p_s

Loss: Cross-entropy(p_t, p_s)
Teacher: EMA update (θ_t ← m·θ_t + (1-m)·θ_s)

\[ \mathcal{L}_{\text{DINO}} = -\sum_x \sum_{s \neq t} p_t(x) \log p_s(x) \]

A notable property of features learned by DINO is that its attention maps automatically correspond to semantic parts of objects, exhibiting strong interpretability.

DINOv2 (Oquab et al., 2023):

Significantly scaled up training data (142M images, automatically curated)
Combined DINO's self-distillation with iBOT's masked modeling
Produces features that match or even surpass supervised pretraining on classification, segmentation, depth estimation, and other tasks
Widely regarded as one of the strongest general-purpose visual feature extractors to date

Contrastive Vision-Language Models

CLIP (Contrastive Language-Image Pretraining)

CLIP, proposed by Radford et al. (2021), is a milestone in cross-modal foundation models.

Core idea: Train with contrastive learning on 400 million image-text pairs to align the representation spaces of images and text.

CLIP Architecture:

Image → Image Encoder (ViT/ResNet) → Image Embedding → L2 Norm → I
Text  → Text Encoder (Transformer) → Text Embedding  → L2 Norm → T

Contrastive Learning:
         T_1    T_2    T_3    ...    T_N
I_1   [ cos   cos    cos    ...    cos  ]
I_2   [ cos   cos    cos    ...    cos  ]
...
I_N   [ cos   cos    cos    ...    cos  ]

Objective: Maximize diagonal elements, minimize off-diagonal elements

Loss function (symmetric over image-to-text and text-to-image):

\[ \mathcal{L}_{\text{CLIP}} = -\frac{1}{2N} \sum_{i=1}^{N} \left[ \log \frac{e^{s_{ii}/\tau}}{\sum_j e^{s_{ij}/\tau}} + \log \frac{e^{s_{ii}/\tau}}{\sum_j e^{s_{ji}/\tau}} \right] \]

Applications of CLIP:

Zero-shot classification: Construct class names as "a photo of a {class}" text prompts and compute similarity with images
Image-text retrieval: Perform nearest-neighbor retrieval directly in the embedding space
Multimodal foundation: Provides the visual encoder for models such as LLaVA and Stable Diffusion

SigLIP (Sigmoid Loss for Language-Image Pretraining)

An improvement on CLIP by Zhai et al. (2023):

Replaces Softmax with Sigmoid, turning contrastive learning into independent binary classification problems
Eliminates the need for global in-batch softmax normalization
Better suited for large-scale distributed training

\[ \mathcal{L}_{\text{SigLIP}} = -\frac{1}{N} \sum_{i,j} \left[ y_{ij} \log \sigma(s_{ij}) + (1 - y_{ij}) \log (1 - \sigma(s_{ij})) \right] \]

where \(y_{ij} = 1\) when \(i = j\) (matched pair), and \(y_{ij} = 0\) otherwise.

OpenCLIP

An open-source reproduction of CLIP that supports a wider range of model scales and training datasets (e.g., LAION-5B).

Segmentation Foundation: SAM (Segment Anything Model)

SAM, proposed by Kirillov et al. (2023), is the first general-purpose foundation model for image segmentation.

Core Idea

Build a "promptable" segmentation model: given any prompt (points, boxes, or text), output the corresponding segmentation masks.

SAM Architecture:

Image → ViT Encoder (MAE pretrained) → Image Embedding
                                              ↓
Prompt (point/box/text) → Prompt Encoder → Prompt Embedding
                                              ↓
                                   Lightweight Mask Decoder
                                              ↓
                                       Segmentation Masks

Key Design Choices

Image Encoder: Uses an MAE-pretrained ViT-H; each image only needs to be encoded once
Prompt Encoder: Supports multiple prompt types including points, boxes, coarse masks, and text
Mask Decoder: A lightweight Transformer decoder that outputs multiple candidate masks with confidence scores
Training Data: The SA-1B dataset, containing 11M images and 1.1B masks

SAM 2 (2024)

Extends SAM to the video domain, supporting object segmentation and tracking in videos. Introduces a Memory Bank mechanism to achieve cross-frame consistency.

Detection Foundation

Grounding DINO

An open-vocabulary detection model proposed by Liu et al. (2023) that fuses the DINO detector with language features.

Core capability: Given any text description, localize the corresponding objects in an image.

Grounding DINO:

Image → Image Backbone → Image Features ─┐
                                           ├→ Cross-modal Fusion → Detection Head → Boxes
Text  → Text Encoder   → Text Features  ─┘

Open-Vocabulary Detection

Traditional object detection can only detect predefined categories. Open-vocabulary detection leverages language features to detect objects of arbitrary categories.

Key approaches:

OWL-ViT (Google): Uses CLIP's image-text alignment for zero-shot detection
YOLO-World (2024): Brings open-vocabulary capability to real-time detection frameworks

Vision Foundation Model Comparison

Model	Type	Pretraining Method	Parameters	Core Capability	Training Data
ViT-L	Backbone	Supervised (ImageNet-21K)	307M	Image Classification	14M images
MAE ViT-H	Backbone	Masked Reconstruction	632M	General Features	ImageNet
DINOv2 ViT-g	Backbone	Self-Distillation + Masking	1.1B	General Features	142M images
CLIP ViT-L/14	Dual Encoder	Image-Text Contrastive	428M	Zero-shot Classification	400M image-text pairs
SigLIP	Dual Encoder	Improved Contrastive	Various scales	Zero-shot Classification	WebLI
SAM ViT-H	Segmentation Foundation	Interactive Segmentation	641M	Universal Segmentation	SA-1B (1.1B masks)
Grounding DINO	Detection Foundation	Open-Vocabulary Detection	172M	Text-Guided Localization	Multi-dataset mixture

Development Trends

Vision foundation models are evolving along the following trends:

From discriminative to unified: A single model simultaneously supports classification, segmentation, detection, and generation
From unimodal to multimodal: Visual encoders are increasingly integrated with language models
From fixed to dynamic resolution: Support for arbitrary resolutions and aspect ratios
From closed-set to open-vocabulary: Leveraging language features to recognize arbitrary concepts

The future direction of vision foundations is to become the "eyes" of multimodal large models, deeply integrated with LLMs.