Multimodal Large Language Models (MLLMs)

Overview

The core goal of Multimodal Large Language Models (MLLMs) is to endow LLMs with the ability to process multiple modalities (images, video, audio, etc.). The dominant approach today is to connect a vision encoder to an LLM, enabling the LLM to understand both textual and visual information simultaneously.

Core question: How do we project visual features into the semantic space of an LLM?

Based on how the vision encoder is connected to the LLM, MLLM architectures can be divided into three major categories:

MLLM Architecture Taxonomy:

1. Projection-based: Vision Encoder → MLP Projector → LLM
   (Simple and direct. Representative: LLaVA)

2. Compression-based: Vision Encoder → Q-Former / Resampler → LLM
   (Compresses the number of visual tokens. Representatives: BLIP-2, Flamingo)

3. One-Backbone: Image + Text tokens → Unified Transformer
   (Natively multimodal. Representatives: Fuyu, Gemini)

Projection-Based Architecture

Core Idea

A simple projection layer (typically an MLP) maps the vision encoder's output into the LLM's embedding space. The resulting visual tokens are then concatenated with text tokens and fed into the LLM.

Projection-based Architecture:

Image → Vision Encoder (CLIP ViT) → Visual Features [N tokens]
                                          ↓
                                    MLP Projector
                                          ↓
                                   Visual Tokens [N tokens]
                                          ↓
User Text → Tokenizer → Text Tokens → [Visual Tokens; Text Tokens] → LLM → Response

Advantages: Simple structure, efficient training, and full preservation of visual information.

Disadvantages: Large number of visual tokens (e.g., ViT-L/14 outputs 576 tokens), which consumes a significant portion of the LLM's context window.

LLaVA (Large Language and Vision Assistant)

LLaVA, proposed by Liu et al. (2023), is the most representative model of the projection-based architecture.

Architecture: CLIP ViT-L/14 (vision encoder) + 2-layer MLP (projection layer) + Vicuna/LLaMA (LLM)

Two-Stage Training:

Stage	Training Data	Trainable Modules	Purpose
Stage 1: Pretrain	558K image-text pairs (CC3M subset)	MLP Projector only	Align the vision-language spaces
Stage 2: Finetune	158K instruction-following data	Projector + LLM	Learn multimodal instruction following

In Stage 1, both the Vision Encoder and the LLM are frozen; only the MLP Projector is trained to align visual features with the LLM's word embedding space. In Stage 2, the LLM is unfrozen and fine-tuned on multimodal instruction data generated by GPT-4.

LLaVA-1.5 Improvements:

MLP upgraded to a 2-layer MLP with GELU activation
Higher input resolution (336x336)
Incorporation of additional academic task data (VQA, OCR, etc.)

LLaVA-NeXT / LLaVA-OneVision:

Dynamic resolution support: high-resolution images are split into multiple sub-images plus a global thumbnail
Video input support: video frames are treated as multi-image input
Consistently leading performance among open-source MLLMs

DeepSeek-VL

Uses SigLIP as the vision encoder
Hybrid resolution strategy: low-resolution global view + high-resolution local crops
Excels at document understanding and OCR tasks

Qwen-VL / Qwen2-VL

Alibaba's multimodal large model series
Qwen2-VL supports dynamic resolution and arbitrary aspect ratios
Introduces Naive Dynamic Resolution: dynamically determines the number of tokens based on the image's native resolution
Supports video understanding, capable of processing videos over 20 minutes long

Compression-Based Architecture

Core Idea

A "compression module" is inserted between the vision encoder and the LLM to compress a large number of visual tokens into a small set of "query tokens" before feeding them into the LLM.

Compression-based Architecture:

Image → Vision Encoder → Visual Features [N tokens, e.g. 576]
                              ↓
                    Compression Module (Q-Former / Resampler)
                    + Learnable Queries [M tokens, e.g. 32]
                              ↓
                    Compressed Visual Tokens [M tokens]
                              ↓
           [Compressed Visual Tokens; Text Tokens] → LLM → Response

Advantages: Dramatically reduces the number of visual tokens, lowering the computational burden on the LLM.

Disadvantages: Information compression may cause loss of detail, especially on tasks requiring fine-grained understanding.

BLIP-2 (Q-Former)

BLIP-2, proposed by Li et al. (2023), introduces the Q-Former (Querying Transformer) as its core innovation.

Q-Former Architecture:

Q-Former Structure:

Learnable Queries (32)
        ↓
[Self-Attention]  ←→  [Cross-Attention] ← Visual Features (from frozen ViT)
        ↓
Compressed Visual Representation (32 tokens)
        ↓
FC Layer → LLM Input Space

The Q-Former is essentially a lightweight Transformer consisting of:

A set of learnable query embeddings (32)
Self-attention layers: interaction among queries
Cross-attention layers: queries extract information from visual features
Shared self-attention layers that can also interact with text

Three-Stage Pretraining:

Image-Text Contrastive (ITC): Aligns visual query representations with text representations
Image-Text Matching (ITM): Binary classification to determine whether image-text pairs match
Image-grounded Text Generation (ITG): Generates text descriptions conditioned on visual input

The Q-Former design allows BLIP-2 to connect to any frozen LLM (e.g., OPT, FlanT5) in a plug-and-play fashion.

Flamingo (Perceiver Resampler)

Flamingo, proposed by Alayrac et al. (2022), is a pioneer in multimodal few-shot learning.

Perceiver Resampler:

Perceiver Resampler:

Visual Features [N tokens] →
    Learnable Latent Queries [M tokens]
    Aggregate information from visual features via Cross-Attention
→ Fixed-size Visual Tokens [M tokens]

Difference from Q-Former: The Perceiver Resampler has a simpler structure, using only cross-attention for information compression.

Flamingo's Other Innovation: Gated Cross-Attention

Gated cross-attention layers are inserted between every layer of the LLM, allowing the model to attend to visual information when generating each token:

\[ y = \text{tanh}(\alpha) \cdot \text{CrossAttn}(x, v) + x \]

Here \(\alpha\) is initialized to 0, so the model starts from pure LLM behavior and gradually learns to leverage visual information.

Flamingo's strength: support for multi-image and multi-turn dialogue in few-shot scenarios.

One-Backbone Architecture

Core Idea

Rather than stitching together a vision encoder and an LLM, the approach is:

Image tokens and text tokens are mixed and jointly processed within a single Transformer from the very first layer.

This is a truly unified "one model for all modalities" approach.

One-Backbone Architecture:

Image → Patch Embedding → Image Tokens ─┐
                                          ├→ Unified Transformer → Output
Text  → Token Embedding → Text Tokens  ─┘

Representative Models

Fuyu (Adept, 2023):

Eliminates the standalone vision encoder; image patches are linearly projected directly into the Transformer's input space
Extremely simple architecture
Can handle images of arbitrary resolution

Gemini (Google, 2024):

A natively multimodal Foundation Model
Images, video, audio, and text are jointly modeled within a single model from the very beginning of training
Rather than "train an LLM first, then add vision," it is multimodal from the start

Advantages: Richer cross-modal information interaction; avoids the information bottleneck introduced by a projector.

Disadvantages: Requires training from scratch; cannot reuse existing pretrained LLMs or Vision Encoders.

Notable Closed-Source Models

GPT-4V / GPT-4o

GPT-4V (GPT-4 with Vision, 2023):

OpenAI's multimodal large model with image input support
Outstanding performance on complex visual reasoning, chart understanding, OCR, and more
Architecture not publicly disclosed

GPT-4o (2024):

"o" stands for "omni"
Natively multimodal: unified processing of text, images, and audio
Supports real-time voice conversation with extremely low response latency
Speculated to use a One-Backbone or deep-fusion architecture

Gemini Series

Model	Key Features
Gemini 1.0 Pro/Ultra	Natively multimodal; supports text, image, audio, and video
Gemini 1.5 Pro	1-million-token context window; capable of processing long videos
Gemini 2.0 Flash	Low-latency inference; supports real-time multimodal interaction

Gemini's core differentiator is "native multimodality": multimodal capabilities are not bolted on after the fact but are built in from the pretraining stage.

Notable Open-Source Models

Model	Organization	Architecture Type	Vision Encoder	LLM Backbone	Key Features
LLaVA-NeXT	Multi-institution	Projection-based	CLIP/SigLIP	LLaMA/Qwen	Dynamic resolution, continuous iteration
InternVL 2.5	Shanghai AI Lab	Projection-based	InternViT-6B	InternLM2	Very large vision encoder
Qwen2-VL	Alibaba	Projection-based	ViT (in-house)	Qwen2	Dynamic resolution, video understanding
DeepSeek-VL2	DeepSeek	Projection-based	SigLIP	DeepSeek MoE	MoE + multimodal
Phi-3 Vision	Microsoft	Projection-based	CLIP	Phi-3	Strong performance from a small model
CogVLM2	Zhipu AI	With additional visual expert	EVA2-CLIP	CogLM	Visual expert layers

MoE for Multimodal Models

Mixture of Experts (MoE) architectures have been incorporated into multimodal large models to achieve more efficient parameter utilization. For a detailed discussion of MoE, see the Scaling and Architecture notes.

Core ideas:

Inputs from different modalities can be routed to different experts
Visual tokens and text tokens may activate different FFN experts
Large total parameter counts can be maintained while keeping inference costs under control

Generative Multimodal Models

Generative multimodal models go beyond "look at an image and describe it." They can:

Understand images, video, and audio
Generate images, video, and audio
Edit and modify multimodal content

Core Ideas (Two Approaches)

Approach A: LLM as the brain + external generative models (Diffusion / Flow)

Approach A:

Input (text/image) → Understanding Module → LLM (planning + reasoning) → Generation instructions
                                                     ↓
                                    External Generative Model (SD / DALL-E) → Generated output

Representatives: Visual ChatGPT, NExT-GPT

Approach B: Unified discrete token (VQ) framework where the LLM directly predicts multimodal tokens

Approach B:

Image → VQ Tokenizer → Discrete Tokens ─┐
                                          ├→ Unified Transformer → Output Tokens → Detokenizer
Text  → BPE Tokenizer → Text Tokens    ─┘

Representatives: Chameleon (Meta), Emu3

Representative Models

GPT-4o: The leading omni model — unified understanding and generation of text, images, and audio
Gemini series: A full multimodal suite with native support for multimodal input and output
Janus (DeepSeek): Decouples understanding and generation into two pathways while sharing an LLM backbone

Evaluation Benchmarks

Benchmark	What It Evaluates	Key Features
MMBench	Multi-dimensional visual understanding	Covers 20+ dimensions including perception, reasoning, and knowledge
MMMU	Multi-discipline multimodal understanding	College-level multimodal Q&A
MME	Perception + cognition capabilities	Yes/No question format with clear scoring
SEED-Bench	Multimodal understanding (image + video)	Covers 12 evaluation dimensions
MathVista	Mathematical visual reasoning	Charts, geometry, and other math-related visual reasoning
OCRBench	OCR and document understanding	Text recognition, table understanding, document Q&A
RealWorldQA	Real-world understanding	Visual Q&A in real-world scenarios
Video-MME	Video understanding	Multi-dimensional evaluation across videos of varying lengths

Summary

The development trajectory of multimodal large models is clear:

Projection-based architectures have become the mainstream thanks to their simplicity and efficiency. LLaVA and its follow-up work have demonstrated the effectiveness of MLP projectors.
Compression-based architectures remain valuable in scenarios where token count must be controlled.
One-Backbone architectures represent the future direction — native multimodality avoids information bottlenecks between modules.
Generative multimodal models are evolving from "understanding only" toward a unified "understanding + generation" paradigm.

Current key trends in MLLMs: higher resolution, longer video, more modalities, and stronger reasoning.

Multimodal Large Language Models (MLLMs)

Overview

Projection-Based Architecture

Core Idea

LLaVA (Large Language and Vision Assistant)

DeepSeek-VL

Qwen-VL / Qwen2-VL

Compression-Based Architecture

Core Idea

BLIP-2 (Q-Former)

Flamingo (Perceiver Resampler)

One-Backbone Architecture

Core Idea

Representative Models

Notable Closed-Source Models

GPT-4V / GPT-4o

Gemini Series

Notable Open-Source Models

MoE for Multimodal Models

Generative Multimodal Models

Core Ideas (Two Approaches)

Representative Models

Evaluation Benchmarks

Summary

评论 #