Multimodal Large Language Models (MLLMs)
Overview
The core goal of Multimodal Large Language Models (MLLMs) is to endow LLMs with the ability to process multiple modalities (images, video, audio, etc.). The dominant approach today is to connect a vision encoder to an LLM, enabling the LLM to understand both textual and visual information simultaneously.
Core question: How do we project visual features into the semantic space of an LLM?
Based on how the vision encoder is connected to the LLM, MLLM architectures can be divided into three major categories:
MLLM Architecture Taxonomy:
1. Projection-based: Vision Encoder → MLP Projector → LLM
(Simple and direct. Representative: LLaVA)
2. Compression-based: Vision Encoder → Q-Former / Resampler → LLM
(Compresses the number of visual tokens. Representatives: BLIP-2, Flamingo)
3. One-Backbone: Image + Text tokens → Unified Transformer
(Natively multimodal. Representatives: Fuyu, Gemini)
Projection-Based Architecture
Core Idea
A simple projection layer (typically an MLP) maps the vision encoder's output into the LLM's embedding space. The resulting visual tokens are then concatenated with text tokens and fed into the LLM.
Projection-based Architecture:
Image → Vision Encoder (CLIP ViT) → Visual Features [N tokens]
↓
MLP Projector
↓
Visual Tokens [N tokens]
↓
User Text → Tokenizer → Text Tokens → [Visual Tokens; Text Tokens] → LLM → Response
Advantages: Simple structure, efficient training, and full preservation of visual information.
Disadvantages: Large number of visual tokens (e.g., ViT-L/14 outputs 576 tokens), which consumes a significant portion of the LLM's context window.
LLaVA (Large Language and Vision Assistant)
LLaVA, proposed by Liu et al. (2023), is the most representative model of the projection-based architecture.
Architecture: CLIP ViT-L/14 (vision encoder) + 2-layer MLP (projection layer) + Vicuna/LLaMA (LLM)
Two-Stage Training:
| Stage | Training Data | Trainable Modules | Purpose |
|---|---|---|---|
| Stage 1: Pretrain | 558K image-text pairs (CC3M subset) | MLP Projector only | Align the vision-language spaces |
| Stage 2: Finetune | 158K instruction-following data | Projector + LLM | Learn multimodal instruction following |
In Stage 1, both the Vision Encoder and the LLM are frozen; only the MLP Projector is trained to align visual features with the LLM's word embedding space. In Stage 2, the LLM is unfrozen and fine-tuned on multimodal instruction data generated by GPT-4.
LLaVA-1.5 Improvements:
- MLP upgraded to a 2-layer MLP with GELU activation
- Higher input resolution (336x336)
- Incorporation of additional academic task data (VQA, OCR, etc.)
LLaVA-NeXT / LLaVA-OneVision:
- Dynamic resolution support: high-resolution images are split into multiple sub-images plus a global thumbnail
- Video input support: video frames are treated as multi-image input
- Consistently leading performance among open-source MLLMs
DeepSeek-VL
- Uses SigLIP as the vision encoder
- Hybrid resolution strategy: low-resolution global view + high-resolution local crops
- Excels at document understanding and OCR tasks
Qwen-VL / Qwen2-VL
- Alibaba's multimodal large model series
- Qwen2-VL supports dynamic resolution and arbitrary aspect ratios
- Introduces Naive Dynamic Resolution: dynamically determines the number of tokens based on the image's native resolution
- Supports video understanding, capable of processing videos over 20 minutes long
Compression-Based Architecture
Core Idea
A "compression module" is inserted between the vision encoder and the LLM to compress a large number of visual tokens into a small set of "query tokens" before feeding them into the LLM.
Compression-based Architecture:
Image → Vision Encoder → Visual Features [N tokens, e.g. 576]
↓
Compression Module (Q-Former / Resampler)
+ Learnable Queries [M tokens, e.g. 32]
↓
Compressed Visual Tokens [M tokens]
↓
[Compressed Visual Tokens; Text Tokens] → LLM → Response
Advantages: Dramatically reduces the number of visual tokens, lowering the computational burden on the LLM.
Disadvantages: Information compression may cause loss of detail, especially on tasks requiring fine-grained understanding.
BLIP-2 (Q-Former)
BLIP-2, proposed by Li et al. (2023), introduces the Q-Former (Querying Transformer) as its core innovation.
Q-Former Architecture:
Q-Former Structure:
Learnable Queries (32)
↓
[Self-Attention] ←→ [Cross-Attention] ← Visual Features (from frozen ViT)
↓
Compressed Visual Representation (32 tokens)
↓
FC Layer → LLM Input Space
The Q-Former is essentially a lightweight Transformer consisting of:
- A set of learnable query embeddings (32)
- Self-attention layers: interaction among queries
- Cross-attention layers: queries extract information from visual features
- Shared self-attention layers that can also interact with text
Three-Stage Pretraining:
- Image-Text Contrastive (ITC): Aligns visual query representations with text representations
- Image-Text Matching (ITM): Binary classification to determine whether image-text pairs match
- Image-grounded Text Generation (ITG): Generates text descriptions conditioned on visual input
The Q-Former design allows BLIP-2 to connect to any frozen LLM (e.g., OPT, FlanT5) in a plug-and-play fashion.
Flamingo (Perceiver Resampler)
Flamingo, proposed by Alayrac et al. (2022), is a pioneer in multimodal few-shot learning.
Perceiver Resampler:
Perceiver Resampler:
Visual Features [N tokens] →
Learnable Latent Queries [M tokens]
Aggregate information from visual features via Cross-Attention
→ Fixed-size Visual Tokens [M tokens]
Difference from Q-Former: The Perceiver Resampler has a simpler structure, using only cross-attention for information compression.
Flamingo's Other Innovation: Gated Cross-Attention
Gated cross-attention layers are inserted between every layer of the LLM, allowing the model to attend to visual information when generating each token:
Here \(\alpha\) is initialized to 0, so the model starts from pure LLM behavior and gradually learns to leverage visual information.
Flamingo's strength: support for multi-image and multi-turn dialogue in few-shot scenarios.
One-Backbone Architecture
Core Idea
Rather than stitching together a vision encoder and an LLM, the approach is:
Image tokens and text tokens are mixed and jointly processed within a single Transformer from the very first layer.
This is a truly unified "one model for all modalities" approach.
One-Backbone Architecture:
Image → Patch Embedding → Image Tokens ─┐
├→ Unified Transformer → Output
Text → Token Embedding → Text Tokens ─┘
Representative Models
Fuyu (Adept, 2023):
- Eliminates the standalone vision encoder; image patches are linearly projected directly into the Transformer's input space
- Extremely simple architecture
- Can handle images of arbitrary resolution
Gemini (Google, 2024):
- A natively multimodal Foundation Model
- Images, video, audio, and text are jointly modeled within a single model from the very beginning of training
- Rather than "train an LLM first, then add vision," it is multimodal from the start
Advantages: Richer cross-modal information interaction; avoids the information bottleneck introduced by a projector.
Disadvantages: Requires training from scratch; cannot reuse existing pretrained LLMs or Vision Encoders.
Notable Closed-Source Models
GPT-4V / GPT-4o
GPT-4V (GPT-4 with Vision, 2023):
- OpenAI's multimodal large model with image input support
- Outstanding performance on complex visual reasoning, chart understanding, OCR, and more
- Architecture not publicly disclosed
GPT-4o (2024):
- "o" stands for "omni"
- Natively multimodal: unified processing of text, images, and audio
- Supports real-time voice conversation with extremely low response latency
- Speculated to use a One-Backbone or deep-fusion architecture
Gemini Series
| Model | Key Features |
|---|---|
| Gemini 1.0 Pro/Ultra | Natively multimodal; supports text, image, audio, and video |
| Gemini 1.5 Pro | 1-million-token context window; capable of processing long videos |
| Gemini 2.0 Flash | Low-latency inference; supports real-time multimodal interaction |
Gemini's core differentiator is "native multimodality": multimodal capabilities are not bolted on after the fact but are built in from the pretraining stage.
Notable Open-Source Models
| Model | Organization | Architecture Type | Vision Encoder | LLM Backbone | Key Features |
|---|---|---|---|---|---|
| LLaVA-NeXT | Multi-institution | Projection-based | CLIP/SigLIP | LLaMA/Qwen | Dynamic resolution, continuous iteration |
| InternVL 2.5 | Shanghai AI Lab | Projection-based | InternViT-6B | InternLM2 | Very large vision encoder |
| Qwen2-VL | Alibaba | Projection-based | ViT (in-house) | Qwen2 | Dynamic resolution, video understanding |
| DeepSeek-VL2 | DeepSeek | Projection-based | SigLIP | DeepSeek MoE | MoE + multimodal |
| Phi-3 Vision | Microsoft | Projection-based | CLIP | Phi-3 | Strong performance from a small model |
| CogVLM2 | Zhipu AI | With additional visual expert | EVA2-CLIP | CogLM | Visual expert layers |
MoE for Multimodal Models
Mixture of Experts (MoE) architectures have been incorporated into multimodal large models to achieve more efficient parameter utilization. For a detailed discussion of MoE, see the Scaling and Architecture notes.
Core ideas:
- Inputs from different modalities can be routed to different experts
- Visual tokens and text tokens may activate different FFN experts
- Large total parameter counts can be maintained while keeping inference costs under control
Generative Multimodal Models
Generative multimodal models go beyond "look at an image and describe it." They can:
- Understand images, video, and audio
- Generate images, video, and audio
- Edit and modify multimodal content
Core Ideas (Two Approaches)
Approach A: LLM as the brain + external generative models (Diffusion / Flow)
Approach A:
Input (text/image) → Understanding Module → LLM (planning + reasoning) → Generation instructions
↓
External Generative Model (SD / DALL-E) → Generated output
Representatives: Visual ChatGPT, NExT-GPT
Approach B: Unified discrete token (VQ) framework where the LLM directly predicts multimodal tokens
Approach B:
Image → VQ Tokenizer → Discrete Tokens ─┐
├→ Unified Transformer → Output Tokens → Detokenizer
Text → BPE Tokenizer → Text Tokens ─┘
Representatives: Chameleon (Meta), Emu3
Representative Models
- GPT-4o: The leading omni model — unified understanding and generation of text, images, and audio
- Gemini series: A full multimodal suite with native support for multimodal input and output
- Janus (DeepSeek): Decouples understanding and generation into two pathways while sharing an LLM backbone
Evaluation Benchmarks
| Benchmark | What It Evaluates | Key Features |
|---|---|---|
| MMBench | Multi-dimensional visual understanding | Covers 20+ dimensions including perception, reasoning, and knowledge |
| MMMU | Multi-discipline multimodal understanding | College-level multimodal Q&A |
| MME | Perception + cognition capabilities | Yes/No question format with clear scoring |
| SEED-Bench | Multimodal understanding (image + video) | Covers 12 evaluation dimensions |
| MathVista | Mathematical visual reasoning | Charts, geometry, and other math-related visual reasoning |
| OCRBench | OCR and document understanding | Text recognition, table understanding, document Q&A |
| RealWorldQA | Real-world understanding | Visual Q&A in real-world scenarios |
| Video-MME | Video understanding | Multi-dimensional evaluation across videos of varying lengths |
Summary
The development trajectory of multimodal large models is clear:
- Projection-based architectures have become the mainstream thanks to their simplicity and efficiency. LLaVA and its follow-up work have demonstrated the effectiveness of MLP projectors.
- Compression-based architectures remain valuable in scenarios where token count must be controlled.
- One-Backbone architectures represent the future direction — native multimodality avoids information bottlenecks between modules.
- Generative multimodal models are evolving from "understanding only" toward a unified "understanding + generation" paradigm.
Current key trends in MLLMs: higher resolution, longer video, more modalities, and stronger reasoning.