Skip to content

LLMs as Foundation Models

Evolution of the GPT Series

The GPT series represents the most iconic development trajectory of Foundation Models, illustrating the complete path of "scaling + alignment."

GPT-1 (2018)

  • Parameters: 117M
  • Core idea: Use a Transformer Decoder for autoregressive language model pretraining, then fine-tune on downstream tasks
  • Training data: BookCorpus (~5GB of text)
  • Significance: First demonstrated the effectiveness of "pretrain + finetune" in NLP

GPT-2 (2019)

  • Parameters: 1.5B (10x GPT-1)
  • Core idea: A language model can serve as an unsupervised multitask learner
  • Training data: WebText (~40GB)
  • Key finding: The model could perform certain tasks without fine-tuning (zero-shot)

GPT-3 (2020)

  • Parameters: 175B
  • Core breakthrough: In-context Learning -- no fine-tuning needed; simply provide a few examples in the prompt
  • Training data: ~570GB of filtered text
  • Significance: Scale brought a qualitative shift, giving rise to few-shot capabilities
\[ P(y | x_{\text{prompt}}) = P(y | [\text{examples}; x_{\text{query}}]; \theta_{\text{frozen}}) \]

InstructGPT (2022)

  • Built on GPT-3 + RLHF
  • Core idea: Align model behavior through human feedback
  • Significantly improved helpfulness and safety
  • The 1.3B InstructGPT outperformed the 175B GPT-3 in human evaluations

ChatGPT (2022)

  • A conversational model trained using the InstructGPT methodology
  • Optimized for multi-turn dialogue scenarios
  • Drove large-scale commercial adoption of large language models

GPT-4 (2023)

  • Multimodal input (text + images)
  • Substantially improved reasoning, achieving human-level performance on various professional exams
  • Architecture not publicly disclosed; speculated to use a MoE architecture

GPT-4o (2024)

  • Natively multimodal: unified processing of text, images, and audio
  • Real-time voice conversation capabilities
  • Faster inference speed
GPT Series Evolution:

GPT-1    GPT-2    GPT-3    InstructGPT  ChatGPT   GPT-4    GPT-4o
117M     1.5B     175B     1.3B/175B      -        MoE?      -
Finetune Zero-shot ICL      RLHF        对话       多模态   全模态
2018     2019     2020     2022         2022      2023     2024

Architecture: Decoder-only Transformer

Modern LLMs have almost universally adopted the Decoder-only Transformer architecture.

Basic Structure

Input Tokens → Embedding → [Decoder Block x N] → LM Head → Next Token

Decoder Block:
    ├── Causal Self-Attention (带因果掩码)
    ├── LayerNorm (Pre-Norm)
    └── Feed-Forward Network (SwiGLU)

Next-Token Prediction

The training objective of an LLM is to predict the next token:

\[ \mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) \]

Despite the extreme simplicity of this objective, when the model is large enough and the data abundant enough, the model implicitly learns grammar, semantics, reasoning, and even world knowledge.

Architectural Improvements in Modern LLMs

Compared to the original Transformer, modern LLMs incorporate several improvements:

Improvement Original Transformer Modern LLMs
Normalization Post-Norm Pre-Norm (RMSNorm)
Activation Function ReLU SwiGLU
Positional Encoding Sinusoidal RoPE (Rotary Position Embedding)
Attention Multi-Head GQA (Grouped Query Attention)

The core idea of RoPE is to encode positional information as rotation matrices:

\[ f(q, m) = R_m q, \quad \text{where } R_m = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \]

This ensures that attention scores depend only on the relative position \(m - n\).

GQA shares key and value heads across groups, reducing the memory footprint of the KV cache and improving inference efficiency.


Scaling Law

Scaling Laws describe the relationship between model performance and scale. For details, see Scaling and Architecture.

Kaplan Scaling Law (OpenAI, 2020)

The model's test loss follows a power-law relationship with the number of parameters \(N\), the amount of data \(D\), and the compute budget \(C\):

\[ L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C} \]

Conclusion: Under a fixed budget, priority should be given to increasing the number of model parameters.

Chinchilla Scaling Law (DeepMind, 2022)

This revised Kaplan's conclusion, proposing that parameter count and data volume should be scaled proportionally:

\[ N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5} \]

Practical guideline: Approximately 20 training tokens are needed per parameter. For example, a 70B model requires roughly 1.4T tokens.


Emergent Abilities

In-context Learning (ICL)

ICL refers to the model's ability to perform tasks directly from examples provided in the context, without any parameter updates:

Prompt:
  Translate English to French:
  sea otter => loutre de mer
  plush giraffe => girafe en peluche
  cheese => ?

Output: fromage

The mechanisms behind ICL are still under active research. One theory suggests that the Transformer implicitly performs gradient descent during its forward pass.

Chain-of-Thought (CoT)

CoT prompting guides the model to reason step by step, significantly improving performance on mathematical and logical tasks:

\[ P(y | x) \approx P(y | r_1, r_2, \ldots, r_k, x) \cdot P(r_1, r_2, \ldots, r_k | x) \]

where \(r_1, \ldots, r_k\) are intermediate reasoning steps.

Instruction Following

After instruction tuning, the model can understand and execute natural language instructions, rather than merely completing text.


RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) is the core method for aligning LLMs with human preferences.

Three-Stage Process

Stage 1: SFT (Supervised Fine-Tuning)
    Collect high-quality (instruction, response) data pairs
    Perform supervised fine-tuning on the pretrained LLM

           ↓

Stage 2: Reward Model (RM) Training
    Human annotators rank multiple model responses
    Train a reward model to learn human preferences
    RM Loss: L = -log(sigma(r(y_w) - r(y_l)))

           ↓

Stage 3: PPO Optimization
    Use the RM score as the reward signal
    Optimize the LLM's policy via the PPO algorithm
    Apply a KL penalty to prevent excessive divergence from the SFT model

The PPO optimization objective:

\[ \max_\theta \mathbb{E}_{x, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}}) \right] \]

where \(r_\phi\) is the Reward Model, \(\pi_{\text{ref}}\) is the SFT model, and \(\beta\) controls the strength of the KL penalty.

For further discussion on alignment methods, see Safety and Alignment.


Open-Source LLM Ecosystem

LLaMA Series (Meta)

  • LLaMA (2023): 7B/13B/33B/65B, trained on publicly available data, demonstrating the viability of open-source LLMs
  • LLaMA 2 (2023): Expanded training data to 2T tokens, incorporated RLHF alignment
  • LLaMA 3 (2024): 8B/70B/405B, trained on 15T+ tokens, performance approaching GPT-4

Mistral Series (Mistral AI)

  • Mistral 7B (2023): Introduced Sliding Window Attention + GQA; strongest performance at the 7B scale
  • Mixtral 8x7B (2024): MoE architecture with 8 experts selecting 2, 12.9B active parameters

Qwen Series (Alibaba)

  • Qwen (2023): Optimized for Chinese-English bilingual performance
  • Qwen2 (2024): Multiple sizes (0.5B--72B), supporting 128K context length
  • Qwen2.5 (2024): Further improvements in coding and mathematical capabilities

DeepSeek Series (DeepSeek)

  • DeepSeek-V2 (2024): MLA (Multi-head Latent Attention) + DeepSeekMoE
  • DeepSeek-V3 (2025): 671B total parameters, 37B active parameters, FP8 training
  • DeepSeek-R1 (2025): Reinforcement learning-driven reasoning model

Major LLM Comparison

Model Parameters Architecture Training Data Context Length Open Source
GPT-4 Undisclosed (speculated MoE) Decoder-only Undisclosed 128K No
LLaMA 3 405B 405B Dense Decoder 15T+ tokens 128K Yes
Mistral Large Undisclosed Speculated MoE Undisclosed 128K No
Mixtral 8x7B 46.7B (12.9B active) MoE Undisclosed 32K Yes
Qwen2.5 72B 72B Dense Decoder 18T+ tokens 128K Yes
DeepSeek-V3 671B (37B active) MoE + MLA 14.8T tokens 128K Yes

Summary

The success of LLMs as Foundation Models stems from several key factors:

  1. A simple yet powerful objective: Next-token prediction is sufficient to learn rich world knowledge
  2. Scaling effects: Scaling Laws guarantee that performance improves predictably with scale
  3. Emergent abilities: Capabilities such as ICL and CoT arise naturally at large scale
  4. Alignment techniques: RLHF aligns model behavior with human expectations

Current trends in LLM development include: more efficient architectures (MoE, MLA), longer context windows, stronger reasoning capabilities, and multimodal unification.


评论 #