Skip to content

Multimodal Learning

Overview

Multimodal learning aims to jointly understand and generate information from different modalities (text, images, audio, video, etc.). From CLIP's contrastive pretraining to LLaVA's visual instruction tuning, multimodal models are moving toward unified agent architectures.

graph TD
    A[Multimodal Learning] --> B[Contrastive Pretraining]
    A --> C[Generative Pretraining]
    A --> D[Multimodal LLMs]
    A --> E[Audio Models]

    B --> B1[CLIP]
    B --> B2[SigLIP]
    B --> B3[ALIGN]

    C --> C1[BLIP]
    C --> C2[BLIP-2]
    C --> C3[CoCa]

    D --> D1[LLaVA]
    D --> D2[GPT-4V]
    D --> D3[Gemini]

    E --> E1[Whisper]
    E --> E2[AudioLM]

    subgraph Fusion Strategies
    F1[Early Fusion]
    F2[Late Fusion]
    F3[Cross-Attention]
    end

1. CLIP: Contrastive Language-Image Pre-training

1.1 Core Idea

CLIP (Radford et al., 2021) maps images and text into a shared embedding space through large-scale image-text contrastive learning.

Training Objective:

For a batch of \(N\) image-text pairs \((I_i, T_i)\):

\[ \mathcal{L}_{\text{CLIP}} = -\frac{1}{2N}\sum_{i=1}^{N}\left[\log\frac{\exp(s_{i,i}/\tau)}{\sum_{j=1}^{N}\exp(s_{i,j}/\tau)} + \log\frac{\exp(s_{i,i}/\tau)}{\sum_{j=1}^{N}\exp(s_{j,i}/\tau)}\right] \]

where \(s_{i,j} = f_I(I_i)^\top f_T(T_j)\) is the similarity between image and text embeddings.

1.2 Architecture

  • Image encoder: ViT (e.g., ViT-L/14) or ResNet
  • Text encoder: Transformer
  • Projection layer: Maps both encoder outputs to shared space

1.3 Zero-Shot Classification

Image classification without fine-tuning:

  1. Construct text prompts: "a photo of a {class}"
  2. Compute similarity between image and all class texts
  3. Select the class with highest similarity

1.4 CLIP Variants

Model Improvement
OpenCLIP Open-source implementation, larger-scale training
SigLIP Sigmoid replaces Softmax, no large batch needed
ALIGN Larger-scale noisy data
EVA-CLIP Stronger visual encoder
MetaCLIP Optimized data curation strategy

2. BLIP Series

2.1 BLIP

BLIP (Bootstrapping Language-Image Pre-training): Unified understanding and generation.

Three Pretraining Tasks:

  1. ITC (Image-Text Contrastive): Contrastive learning
  2. ITM (Image-Text Matching): Binary classification matching
  3. LM (Language Modeling): Image-conditioned text generation

CapFilt: Self-training approach to filter noisy data + generate high-quality captions.

2.2 BLIP-2

BLIP-2 (Li et al., 2023): Bridges frozen vision and language models with a lightweight Q-Former.

Architecture:

\[ \text{Image} \xrightarrow{\text{Frozen ViT}} \text{Features} \xrightarrow{\text{Q-Former}} \text{Visual Tokens} \xrightarrow{\text{Frozen LLM}} \text{Output} \]

Q-Former:

  • A set of learnable query tokens (32)
  • Extracts information from visual features via cross-attention
  • Two-stage training:
    1. Vision-language representation learning (ITC + ITM + LM)
    2. Vision-to-language generative learning (connecting to LLM)

Advantage: Only trains Q-Former (188M parameters), freezing both vision and language models.


3. LLaVA: Visual Instruction Tuning

3.1 LLaVA

LLaVA (Liu et al., 2023): A simple yet effective multimodal large model.

Architecture:

\[ \text{Image} \xrightarrow{\text{CLIP ViT}} \text{Visual Features} \xrightarrow{W} \text{Visual Tokens} \xrightarrow{\text{LLM}} \text{Response} \]

where \(W\) is a simple linear projection layer.

Two-Stage Training:

  1. Pretraining: Freeze ViT and LLM, train only projection \(W\) (558K image-text pairs)
  2. Instruction tuning: Unfreeze LLM, fine-tune on visual instruction data (665K samples)

3.2 LLaVA-1.5

Improvements:

  • Projection layer upgraded from linear to 2-layer MLP
  • Higher resolution input (336x336)
  • More instruction tuning data
  • Vicuna 13B as the LLM

3.3 LLaVA-NeXT / LLaVA-OneVision

  • Dynamic high resolution: Split images into multiple tiles
  • Video understanding support
  • Stronger LLM backbone

4. Multimodal Fusion Strategies

4.1 Three Fusion Approaches

graph TD
    subgraph "Early Fusion"
    A1[Image Tokens] --> M1[Combined Input]
    A2[Text Tokens] --> M1
    M1 --> T1[Unified Transformer]
    end

    subgraph "Late Fusion"
    B1[Image] --> E1[Image Encoder]
    B2[Text] --> E2[Text Encoder]
    E1 --> F1[Fusion Layer]
    E2 --> F1
    end

    subgraph "Cross-Attention"
    C1[Image Features] --> CA[Cross-Attention]
    C2[Text Features] --> CA
    CA --> O1[Output]
    end

4.2 Detailed Comparison

Fusion Type Description Pros Cons Representative Models
Early Fusion Concatenate all modality tokens, process jointly Full interaction High compute Fuyu, Gemini
Late Fusion Encode modalities independently, fuse at end Efficient, modular Shallow interaction CLIP
Cross-Attention One modality queries another Balances efficiency and interaction Requires attention design Flamingo, BLIP-2
Projection Linear/MLP projection to LLM space Simple, efficient Information compression LLaVA

4.3 Visual Token Processing

Method Visual Tokens Information Retention Compute Overhead
Full retention 576+ (ViT-L/14@336) Most complete Highest
Q-Former 32-64 Medium Low
Perceiver Resampler 64-256 Medium Medium
Downsampling/Pooling Adjustable Adjustable Low

5. Audio Multimodal: Whisper

5.1 Whisper

Whisper (Radford et al., 2023): Large-scale weakly supervised speech recognition.

Key Design:

  • 680,000 hours of multilingual audio data
  • Encoder-decoder Transformer architecture
  • Multi-task training: transcription, translation, language detection, timestamps

Input Processing:

  1. Audio → Mel spectrogram (80 channels, 30-second window)
  2. Two 1D convolution layers
  3. Sinusoidal positional encoding
  4. Transformer encoder

5.2 Audio-Language Models

Model Capability
Whisper Speech recognition, translation
AudioLM Audio generation
MusicLM Music generation
Qwen-Audio Audio understanding + dialogue
SALMONN Speech + audio + music understanding

6. Unified Multimodal Architectures

6.1 Current Mainstream Architectures

Model Vision Encoder Connector LLM Highlights
GPT-4V/4o Undisclosed Undisclosed GPT-4 Strongest commercial model
Gemini Native multimodal Not needed - End-to-end multimodal
Claude 3 Undisclosed Undisclosed Claude Strong visual understanding
LLaVA-1.5 CLIP ViT-L MLP Vicuna Open-source benchmark
InternVL InternViT QLLaMA InternLM Strong open-source model
Qwen-VL ViT + compression Cross-attention Qwen Chinese language advantage
  1. Native multimodal: Moving beyond "vision encoder + LLM" concatenation to unified architectures
  2. Any-to-any: Support arbitrary modality combinations for input/output
  3. More modalities: 3D, haptics, robot actions
  4. Real-time interaction: Streaming multimodal dialogue (GPT-4o)

7. Summary

Method Year Core Innovation Impact
CLIP 2021 Contrastive image-text pretraining Enabled zero-shot vision
BLIP-2 2023 Q-Former bridging Efficient multimodal
LLaVA 2023 Visual instruction tuning Open-source multimodal benchmark
GPT-4V 2023 Commercial multimodal Demonstrated multimodal potential
Gemini 2024 Native multimodal Unified architecture trend

References

  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," ICML 2021
  • Li et al., "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models," ICML 2023
  • Liu et al., "Visual Instruction Tuning," NeurIPS 2023
  • Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision," ICML 2023

评论 #