Skip to content

Multimodal Large Language Models (MLLMs)

Overview

The core goal of Multimodal Large Language Models (MLLMs) is to endow LLMs with the ability to process multiple modalities (images, video, audio, etc.). The dominant approach today is to connect a vision encoder to an LLM, enabling the LLM to understand both textual and visual information simultaneously.

Core question: How do we project visual features into the semantic space of an LLM?

Based on how the vision encoder is connected to the LLM, MLLM architectures can be divided into three major categories:

MLLM Architecture Taxonomy:

1. Projection-based: Vision Encoder → MLP Projector → LLM
   (Simple and direct. Representative: LLaVA)

2. Compression-based: Vision Encoder → Q-Former / Resampler → LLM
   (Compresses the number of visual tokens. Representatives: BLIP-2, Flamingo)

3. One-Backbone: Image + Text tokens → Unified Transformer
   (Natively multimodal. Representatives: Fuyu, Gemini)

Projection-Based Architecture

Core Idea

A simple projection layer (typically an MLP) maps the vision encoder's output into the LLM's embedding space. The resulting visual tokens are then concatenated with text tokens and fed into the LLM.

Projection-based Architecture:

Image → Vision Encoder (CLIP ViT) → Visual Features [N tokens]
                                          ↓
                                    MLP Projector
                                          ↓
                                   Visual Tokens [N tokens]
                                          ↓
User Text → Tokenizer → Text Tokens → [Visual Tokens; Text Tokens] → LLM → Response

Advantages: Simple structure, efficient training, and full preservation of visual information.

Disadvantages: Large number of visual tokens (e.g., ViT-L/14 outputs 576 tokens), which consumes a significant portion of the LLM's context window.

LLaVA (Large Language and Vision Assistant)

LLaVA, proposed by Liu et al. (2023), is the most representative model of the projection-based architecture.

Architecture: CLIP ViT-L/14 (vision encoder) + 2-layer MLP (projection layer) + Vicuna/LLaMA (LLM)

Two-Stage Training:

Stage Training Data Trainable Modules Purpose
Stage 1: Pretrain 558K image-text pairs (CC3M subset) MLP Projector only Align the vision-language spaces
Stage 2: Finetune 158K instruction-following data Projector + LLM Learn multimodal instruction following

In Stage 1, both the Vision Encoder and the LLM are frozen; only the MLP Projector is trained to align visual features with the LLM's word embedding space. In Stage 2, the LLM is unfrozen and fine-tuned on multimodal instruction data generated by GPT-4.

LLaVA-1.5 Improvements:

  • MLP upgraded to a 2-layer MLP with GELU activation
  • Higher input resolution (336x336)
  • Incorporation of additional academic task data (VQA, OCR, etc.)

LLaVA-NeXT / LLaVA-OneVision:

  • Dynamic resolution support: high-resolution images are split into multiple sub-images plus a global thumbnail
  • Video input support: video frames are treated as multi-image input
  • Consistently leading performance among open-source MLLMs

DeepSeek-VL

  • Uses SigLIP as the vision encoder
  • Hybrid resolution strategy: low-resolution global view + high-resolution local crops
  • Excels at document understanding and OCR tasks

Qwen-VL / Qwen2-VL

  • Alibaba's multimodal large model series
  • Qwen2-VL supports dynamic resolution and arbitrary aspect ratios
  • Introduces Naive Dynamic Resolution: dynamically determines the number of tokens based on the image's native resolution
  • Supports video understanding, capable of processing videos over 20 minutes long

Compression-Based Architecture

Core Idea

A "compression module" is inserted between the vision encoder and the LLM to compress a large number of visual tokens into a small set of "query tokens" before feeding them into the LLM.

Compression-based Architecture:

Image → Vision Encoder → Visual Features [N tokens, e.g. 576]
                              ↓
                    Compression Module (Q-Former / Resampler)
                    + Learnable Queries [M tokens, e.g. 32]
                              ↓
                    Compressed Visual Tokens [M tokens]
                              ↓
           [Compressed Visual Tokens; Text Tokens] → LLM → Response

Advantages: Dramatically reduces the number of visual tokens, lowering the computational burden on the LLM.

Disadvantages: Information compression may cause loss of detail, especially on tasks requiring fine-grained understanding.

BLIP-2 (Q-Former)

BLIP-2, proposed by Li et al. (2023), introduces the Q-Former (Querying Transformer) as its core innovation.

Q-Former Architecture:

Q-Former Structure:

Learnable Queries (32)
        ↓
[Self-Attention]  ←→  [Cross-Attention] ← Visual Features (from frozen ViT)
        ↓
Compressed Visual Representation (32 tokens)
        ↓
FC Layer → LLM Input Space

The Q-Former is essentially a lightweight Transformer consisting of:

  • A set of learnable query embeddings (32)
  • Self-attention layers: interaction among queries
  • Cross-attention layers: queries extract information from visual features
  • Shared self-attention layers that can also interact with text

Three-Stage Pretraining:

  1. Image-Text Contrastive (ITC): Aligns visual query representations with text representations
  2. Image-Text Matching (ITM): Binary classification to determine whether image-text pairs match
  3. Image-grounded Text Generation (ITG): Generates text descriptions conditioned on visual input

The Q-Former design allows BLIP-2 to connect to any frozen LLM (e.g., OPT, FlanT5) in a plug-and-play fashion.

Flamingo (Perceiver Resampler)

Flamingo, proposed by Alayrac et al. (2022), is a pioneer in multimodal few-shot learning.

Perceiver Resampler:

Perceiver Resampler:

Visual Features [N tokens] →
    Learnable Latent Queries [M tokens]
    Aggregate information from visual features via Cross-Attention
→ Fixed-size Visual Tokens [M tokens]

Difference from Q-Former: The Perceiver Resampler has a simpler structure, using only cross-attention for information compression.

Flamingo's Other Innovation: Gated Cross-Attention

Gated cross-attention layers are inserted between every layer of the LLM, allowing the model to attend to visual information when generating each token:

\[ y = \text{tanh}(\alpha) \cdot \text{CrossAttn}(x, v) + x \]

Here \(\alpha\) is initialized to 0, so the model starts from pure LLM behavior and gradually learns to leverage visual information.

Flamingo's strength: support for multi-image and multi-turn dialogue in few-shot scenarios.


One-Backbone Architecture

Core Idea

Rather than stitching together a vision encoder and an LLM, the approach is:

Image tokens and text tokens are mixed and jointly processed within a single Transformer from the very first layer.

This is a truly unified "one model for all modalities" approach.

One-Backbone Architecture:

Image → Patch Embedding → Image Tokens ─┐
                                          ├→ Unified Transformer → Output
Text  → Token Embedding → Text Tokens  ─┘

Representative Models

Fuyu (Adept, 2023):

  • Eliminates the standalone vision encoder; image patches are linearly projected directly into the Transformer's input space
  • Extremely simple architecture
  • Can handle images of arbitrary resolution

Gemini (Google, 2024):

  • A natively multimodal Foundation Model
  • Images, video, audio, and text are jointly modeled within a single model from the very beginning of training
  • Rather than "train an LLM first, then add vision," it is multimodal from the start

Advantages: Richer cross-modal information interaction; avoids the information bottleneck introduced by a projector.

Disadvantages: Requires training from scratch; cannot reuse existing pretrained LLMs or Vision Encoders.


Notable Closed-Source Models

GPT-4V / GPT-4o

GPT-4V (GPT-4 with Vision, 2023):

  • OpenAI's multimodal large model with image input support
  • Outstanding performance on complex visual reasoning, chart understanding, OCR, and more
  • Architecture not publicly disclosed

GPT-4o (2024):

  • "o" stands for "omni"
  • Natively multimodal: unified processing of text, images, and audio
  • Supports real-time voice conversation with extremely low response latency
  • Speculated to use a One-Backbone or deep-fusion architecture

Gemini Series

Model Key Features
Gemini 1.0 Pro/Ultra Natively multimodal; supports text, image, audio, and video
Gemini 1.5 Pro 1-million-token context window; capable of processing long videos
Gemini 2.0 Flash Low-latency inference; supports real-time multimodal interaction

Gemini's core differentiator is "native multimodality": multimodal capabilities are not bolted on after the fact but are built in from the pretraining stage.


Notable Open-Source Models

Model Organization Architecture Type Vision Encoder LLM Backbone Key Features
LLaVA-NeXT Multi-institution Projection-based CLIP/SigLIP LLaMA/Qwen Dynamic resolution, continuous iteration
InternVL 2.5 Shanghai AI Lab Projection-based InternViT-6B InternLM2 Very large vision encoder
Qwen2-VL Alibaba Projection-based ViT (in-house) Qwen2 Dynamic resolution, video understanding
DeepSeek-VL2 DeepSeek Projection-based SigLIP DeepSeek MoE MoE + multimodal
Phi-3 Vision Microsoft Projection-based CLIP Phi-3 Strong performance from a small model
CogVLM2 Zhipu AI With additional visual expert EVA2-CLIP CogLM Visual expert layers

MoE for Multimodal Models

Mixture of Experts (MoE) architectures have been incorporated into multimodal large models to achieve more efficient parameter utilization. For a detailed discussion of MoE, see the Scaling and Architecture notes.

Core ideas:

  • Inputs from different modalities can be routed to different experts
  • Visual tokens and text tokens may activate different FFN experts
  • Large total parameter counts can be maintained while keeping inference costs under control

Generative Multimodal Models

Generative multimodal models go beyond "look at an image and describe it." They can:

  • Understand images, video, and audio
  • Generate images, video, and audio
  • Edit and modify multimodal content

Core Ideas (Two Approaches)

Approach A: LLM as the brain + external generative models (Diffusion / Flow)

Approach A:

Input (text/image) → Understanding Module → LLM (planning + reasoning) → Generation instructions
                                                     ↓
                                    External Generative Model (SD / DALL-E) → Generated output

Representatives: Visual ChatGPT, NExT-GPT

Approach B: Unified discrete token (VQ) framework where the LLM directly predicts multimodal tokens

Approach B:

Image → VQ Tokenizer → Discrete Tokens ─┐
                                          ├→ Unified Transformer → Output Tokens → Detokenizer
Text  → BPE Tokenizer → Text Tokens    ─┘

Representatives: Chameleon (Meta), Emu3

Representative Models

  • GPT-4o: The leading omni model — unified understanding and generation of text, images, and audio
  • Gemini series: A full multimodal suite with native support for multimodal input and output
  • Janus (DeepSeek): Decouples understanding and generation into two pathways while sharing an LLM backbone

Evaluation Benchmarks

Benchmark What It Evaluates Key Features
MMBench Multi-dimensional visual understanding Covers 20+ dimensions including perception, reasoning, and knowledge
MMMU Multi-discipline multimodal understanding College-level multimodal Q&A
MME Perception + cognition capabilities Yes/No question format with clear scoring
SEED-Bench Multimodal understanding (image + video) Covers 12 evaluation dimensions
MathVista Mathematical visual reasoning Charts, geometry, and other math-related visual reasoning
OCRBench OCR and document understanding Text recognition, table understanding, document Q&A
RealWorldQA Real-world understanding Visual Q&A in real-world scenarios
Video-MME Video understanding Multi-dimensional evaluation across videos of varying lengths

Summary

The development trajectory of multimodal large models is clear:

  1. Projection-based architectures have become the mainstream thanks to their simplicity and efficiency. LLaVA and its follow-up work have demonstrated the effectiveness of MLP projectors.
  2. Compression-based architectures remain valuable in scenarios where token count must be controlled.
  3. One-Backbone architectures represent the future direction — native multimodality avoids information bottlenecks between modules.
  4. Generative multimodal models are evolving from "understanding only" toward a unified "understanding + generation" paradigm.

Current key trends in MLLMs: higher resolution, longer video, more modalities, and stronger reasoning.


评论 #