Multimodal Learning
Overview
Multimodal learning aims to jointly understand and generate information from different modalities (text, images, audio, video, etc.). From CLIP's contrastive pretraining to LLaVA's visual instruction tuning, multimodal models are moving toward unified agent architectures.
graph TD
A[Multimodal Learning] --> B[Contrastive Pretraining]
A --> C[Generative Pretraining]
A --> D[Multimodal LLMs]
A --> E[Audio Models]
B --> B1[CLIP]
B --> B2[SigLIP]
B --> B3[ALIGN]
C --> C1[BLIP]
C --> C2[BLIP-2]
C --> C3[CoCa]
D --> D1[LLaVA]
D --> D2[GPT-4V]
D --> D3[Gemini]
E --> E1[Whisper]
E --> E2[AudioLM]
subgraph Fusion Strategies
F1[Early Fusion]
F2[Late Fusion]
F3[Cross-Attention]
end
1. CLIP: Contrastive Language-Image Pre-training
1.1 Core Idea
CLIP (Radford et al., 2021) maps images and text into a shared embedding space through large-scale image-text contrastive learning.
Training Objective:
For a batch of \(N\) image-text pairs \((I_i, T_i)\):
where \(s_{i,j} = f_I(I_i)^\top f_T(T_j)\) is the similarity between image and text embeddings.
1.2 Architecture
- Image encoder: ViT (e.g., ViT-L/14) or ResNet
- Text encoder: Transformer
- Projection layer: Maps both encoder outputs to shared space
1.3 Zero-Shot Classification
Image classification without fine-tuning:
- Construct text prompts:
"a photo of a {class}" - Compute similarity between image and all class texts
- Select the class with highest similarity
1.4 CLIP Variants
| Model | Improvement |
|---|---|
| OpenCLIP | Open-source implementation, larger-scale training |
| SigLIP | Sigmoid replaces Softmax, no large batch needed |
| ALIGN | Larger-scale noisy data |
| EVA-CLIP | Stronger visual encoder |
| MetaCLIP | Optimized data curation strategy |
2. BLIP Series
2.1 BLIP
BLIP (Bootstrapping Language-Image Pre-training): Unified understanding and generation.
Three Pretraining Tasks:
- ITC (Image-Text Contrastive): Contrastive learning
- ITM (Image-Text Matching): Binary classification matching
- LM (Language Modeling): Image-conditioned text generation
CapFilt: Self-training approach to filter noisy data + generate high-quality captions.
2.2 BLIP-2
BLIP-2 (Li et al., 2023): Bridges frozen vision and language models with a lightweight Q-Former.
Architecture:
Q-Former:
- A set of learnable query tokens (32)
- Extracts information from visual features via cross-attention
- Two-stage training:
- Vision-language representation learning (ITC + ITM + LM)
- Vision-to-language generative learning (connecting to LLM)
Advantage: Only trains Q-Former (188M parameters), freezing both vision and language models.
3. LLaVA: Visual Instruction Tuning
3.1 LLaVA
LLaVA (Liu et al., 2023): A simple yet effective multimodal large model.
Architecture:
where \(W\) is a simple linear projection layer.
Two-Stage Training:
- Pretraining: Freeze ViT and LLM, train only projection \(W\) (558K image-text pairs)
- Instruction tuning: Unfreeze LLM, fine-tune on visual instruction data (665K samples)
3.2 LLaVA-1.5
Improvements:
- Projection layer upgraded from linear to 2-layer MLP
- Higher resolution input (336x336)
- More instruction tuning data
- Vicuna 13B as the LLM
3.3 LLaVA-NeXT / LLaVA-OneVision
- Dynamic high resolution: Split images into multiple tiles
- Video understanding support
- Stronger LLM backbone
4. Multimodal Fusion Strategies
4.1 Three Fusion Approaches
graph TD
subgraph "Early Fusion"
A1[Image Tokens] --> M1[Combined Input]
A2[Text Tokens] --> M1
M1 --> T1[Unified Transformer]
end
subgraph "Late Fusion"
B1[Image] --> E1[Image Encoder]
B2[Text] --> E2[Text Encoder]
E1 --> F1[Fusion Layer]
E2 --> F1
end
subgraph "Cross-Attention"
C1[Image Features] --> CA[Cross-Attention]
C2[Text Features] --> CA
CA --> O1[Output]
end
4.2 Detailed Comparison
| Fusion Type | Description | Pros | Cons | Representative Models |
|---|---|---|---|---|
| Early Fusion | Concatenate all modality tokens, process jointly | Full interaction | High compute | Fuyu, Gemini |
| Late Fusion | Encode modalities independently, fuse at end | Efficient, modular | Shallow interaction | CLIP |
| Cross-Attention | One modality queries another | Balances efficiency and interaction | Requires attention design | Flamingo, BLIP-2 |
| Projection | Linear/MLP projection to LLM space | Simple, efficient | Information compression | LLaVA |
4.3 Visual Token Processing
| Method | Visual Tokens | Information Retention | Compute Overhead |
|---|---|---|---|
| Full retention | 576+ (ViT-L/14@336) | Most complete | Highest |
| Q-Former | 32-64 | Medium | Low |
| Perceiver Resampler | 64-256 | Medium | Medium |
| Downsampling/Pooling | Adjustable | Adjustable | Low |
5. Audio Multimodal: Whisper
5.1 Whisper
Whisper (Radford et al., 2023): Large-scale weakly supervised speech recognition.
Key Design:
- 680,000 hours of multilingual audio data
- Encoder-decoder Transformer architecture
- Multi-task training: transcription, translation, language detection, timestamps
Input Processing:
- Audio → Mel spectrogram (80 channels, 30-second window)
- Two 1D convolution layers
- Sinusoidal positional encoding
- Transformer encoder
5.2 Audio-Language Models
| Model | Capability |
|---|---|
| Whisper | Speech recognition, translation |
| AudioLM | Audio generation |
| MusicLM | Music generation |
| Qwen-Audio | Audio understanding + dialogue |
| SALMONN | Speech + audio + music understanding |
6. Unified Multimodal Architectures
6.1 Current Mainstream Architectures
| Model | Vision Encoder | Connector | LLM | Highlights |
|---|---|---|---|---|
| GPT-4V/4o | Undisclosed | Undisclosed | GPT-4 | Strongest commercial model |
| Gemini | Native multimodal | Not needed | - | End-to-end multimodal |
| Claude 3 | Undisclosed | Undisclosed | Claude | Strong visual understanding |
| LLaVA-1.5 | CLIP ViT-L | MLP | Vicuna | Open-source benchmark |
| InternVL | InternViT | QLLaMA | InternLM | Strong open-source model |
| Qwen-VL | ViT + compression | Cross-attention | Qwen | Chinese language advantage |
6.2 Development Trends
- Native multimodal: Moving beyond "vision encoder + LLM" concatenation to unified architectures
- Any-to-any: Support arbitrary modality combinations for input/output
- More modalities: 3D, haptics, robot actions
- Real-time interaction: Streaming multimodal dialogue (GPT-4o)
7. Summary
| Method | Year | Core Innovation | Impact |
|---|---|---|---|
| CLIP | 2021 | Contrastive image-text pretraining | Enabled zero-shot vision |
| BLIP-2 | 2023 | Q-Former bridging | Efficient multimodal |
| LLaVA | 2023 | Visual instruction tuning | Open-source multimodal benchmark |
| GPT-4V | 2023 | Commercial multimodal | Demonstrated multimodal potential |
| Gemini | 2024 | Native multimodal | Unified architecture trend |
References
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," ICML 2021
- Li et al., "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," ICML 2023
- Liu et al., "Visual Instruction Tuning," NeurIPS 2023
- Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision," ICML 2023