Vision Foundation Models
Overview
The development trajectory of vision foundation models mirrors that of NLP: from task-specific CNN models, to general-purpose pretrained Vision Transformers, and then to cross-modal vision-language models.
Core paradigm shift: from "one backbone per task" to "one foundation for many tasks."
Vision Foundation Model Taxonomy:
Supervised Pretraining (ImageNet) → ResNet, ViT
Self-Supervised Pretraining (Contrastive) → SimCLR, MoCo, DINO
Self-Supervised Pretraining (Masked Modeling) → MAE, BEiT
Cross-Modal Pretraining (Image-Text Alignment) → CLIP, SigLIP
Task-Level Foundation (Segmentation) → SAM
Task-Level Foundation (Detection) → Grounding DINO
ViT Recap
Vision Transformer (ViT) serves as the backbone architecture for vision foundation models. For details, refer to the ViT notes.
Core idea: Split an image into fixed-size patches, and feed each patch as a token into a Transformer.
where \(E \in \mathbb{R}^{(P^2 \cdot C) \times D}\) is the patch embedding matrix and \(P\) is the patch size.
The success of ViT demonstrated that Transformers are equally effective in the vision domain and outperform CNNs when trained on large-scale data.
Self-Supervised Visual Pretraining
MAE (Masked Autoencoder)
MAE, proposed by He et al. (2022), is a landmark work in masked modeling for vision.
Core idea: Randomly mask a large proportion (75%) of image patches, then use an encoder-decoder architecture to reconstruct the masked patches.
MAE Architecture:
Original Image → Split into Patches → Randomly Mask 75%
↓
Visible Patches → ViT Encoder (processes only visible patches, saving compute)
↓
Encodings + Mask Tokens → Lightweight Decoder → Pixel Reconstruction
↓
Compute MSE Loss (only on masked patches)
Reconstruction loss:
Key design choices in MAE:
- High masking ratio (75%): Forces the model to learn global semantics rather than local texture interpolation
- Asymmetric architecture: Heavy encoder + lightweight decoder; the encoder processes only visible patches
- Pixel reconstruction: Directly predicts pixel values without requiring a tokenizer
DINO / DINOv2
DINO (Caron et al., 2021): Self-Distillation with No Labels
Core idea: A student-teacher self-distillation framework that requires neither negative samples nor labels.
DINO Framework:
Image → Global Crop → Teacher Network → p_t
Image → Local Crops → Student Network → p_s
Loss: Cross-entropy(p_t, p_s)
Teacher: EMA update (θ_t ← m·θ_t + (1-m)·θ_s)
A notable property of features learned by DINO is that its attention maps automatically correspond to semantic parts of objects, exhibiting strong interpretability.
DINOv2 (Oquab et al., 2023):
- Significantly scaled up training data (142M images, automatically curated)
- Combined DINO's self-distillation with iBOT's masked modeling
- Produces features that match or even surpass supervised pretraining on classification, segmentation, depth estimation, and other tasks
- Widely regarded as one of the strongest general-purpose visual feature extractors to date
Contrastive Vision-Language Models
CLIP (Contrastive Language-Image Pretraining)
CLIP, proposed by Radford et al. (2021), is a milestone in cross-modal foundation models.
Core idea: Train with contrastive learning on 400 million image-text pairs to align the representation spaces of images and text.
CLIP Architecture:
Image → Image Encoder (ViT/ResNet) → Image Embedding → L2 Norm → I
Text → Text Encoder (Transformer) → Text Embedding → L2 Norm → T
Contrastive Learning:
T_1 T_2 T_3 ... T_N
I_1 [ cos cos cos ... cos ]
I_2 [ cos cos cos ... cos ]
...
I_N [ cos cos cos ... cos ]
Objective: Maximize diagonal elements, minimize off-diagonal elements
Loss function (symmetric over image-to-text and text-to-image):
Applications of CLIP:
- Zero-shot classification: Construct class names as "a photo of a {class}" text prompts and compute similarity with images
- Image-text retrieval: Perform nearest-neighbor retrieval directly in the embedding space
- Multimodal foundation: Provides the visual encoder for models such as LLaVA and Stable Diffusion
SigLIP (Sigmoid Loss for Language-Image Pretraining)
An improvement on CLIP by Zhai et al. (2023):
- Replaces Softmax with Sigmoid, turning contrastive learning into independent binary classification problems
- Eliminates the need for global in-batch softmax normalization
- Better suited for large-scale distributed training
where \(y_{ij} = 1\) when \(i = j\) (matched pair), and \(y_{ij} = 0\) otherwise.
OpenCLIP
An open-source reproduction of CLIP that supports a wider range of model scales and training datasets (e.g., LAION-5B).
Segmentation Foundation: SAM (Segment Anything Model)
SAM, proposed by Kirillov et al. (2023), is the first general-purpose foundation model for image segmentation.
Core Idea
Build a "promptable" segmentation model: given any prompt (points, boxes, or text), output the corresponding segmentation masks.
SAM Architecture:
Image → ViT Encoder (MAE pretrained) → Image Embedding
↓
Prompt (point/box/text) → Prompt Encoder → Prompt Embedding
↓
Lightweight Mask Decoder
↓
Segmentation Masks
Key Design Choices
- Image Encoder: Uses an MAE-pretrained ViT-H; each image only needs to be encoded once
- Prompt Encoder: Supports multiple prompt types including points, boxes, coarse masks, and text
- Mask Decoder: A lightweight Transformer decoder that outputs multiple candidate masks with confidence scores
- Training Data: The SA-1B dataset, containing 11M images and 1.1B masks
SAM 2 (2024)
Extends SAM to the video domain, supporting object segmentation and tracking in videos. Introduces a Memory Bank mechanism to achieve cross-frame consistency.
Detection Foundation
Grounding DINO
An open-vocabulary detection model proposed by Liu et al. (2023) that fuses the DINO detector with language features.
Core capability: Given any text description, localize the corresponding objects in an image.
Grounding DINO:
Image → Image Backbone → Image Features ─┐
├→ Cross-modal Fusion → Detection Head → Boxes
Text → Text Encoder → Text Features ─┘
Open-Vocabulary Detection
Traditional object detection can only detect predefined categories. Open-vocabulary detection leverages language features to detect objects of arbitrary categories.
Key approaches:
- OWL-ViT (Google): Uses CLIP's image-text alignment for zero-shot detection
- YOLO-World (2024): Brings open-vocabulary capability to real-time detection frameworks
Vision Foundation Model Comparison
| Model | Type | Pretraining Method | Parameters | Core Capability | Training Data |
|---|---|---|---|---|---|
| ViT-L | Backbone | Supervised (ImageNet-21K) | 307M | Image Classification | 14M images |
| MAE ViT-H | Backbone | Masked Reconstruction | 632M | General Features | ImageNet |
| DINOv2 ViT-g | Backbone | Self-Distillation + Masking | 1.1B | General Features | 142M images |
| CLIP ViT-L/14 | Dual Encoder | Image-Text Contrastive | 428M | Zero-shot Classification | 400M image-text pairs |
| SigLIP | Dual Encoder | Improved Contrastive | Various scales | Zero-shot Classification | WebLI |
| SAM ViT-H | Segmentation Foundation | Interactive Segmentation | 641M | Universal Segmentation | SA-1B (1.1B masks) |
| Grounding DINO | Detection Foundation | Open-Vocabulary Detection | 172M | Text-Guided Localization | Multi-dataset mixture |
Development Trends
Vision foundation models are evolving along the following trends:
- From discriminative to unified: A single model simultaneously supports classification, segmentation, detection, and generation
- From unimodal to multimodal: Visual encoders are increasingly integrated with language models
- From fixed to dynamic resolution: Support for arbitrary resolutions and aspect ratios
- From closed-set to open-vocabulary: Leveraging language features to recognize arbitrary concepts
The future direction of vision foundations is to become the "eyes" of multimodal large models, deeply integrated with LLMs.