LLMs as Foundation Models

Evolution of the GPT Series

The GPT series represents the most iconic development trajectory of Foundation Models, illustrating the complete path of "scaling + alignment."

GPT-1 (2018)

Parameters: 117M
Core idea: Use a Transformer Decoder for autoregressive language model pretraining, then fine-tune on downstream tasks
Training data: BookCorpus (~5GB of text)
Significance: First demonstrated the effectiveness of "pretrain + finetune" in NLP

GPT-2 (2019)

Parameters: 1.5B (10x GPT-1)
Core idea: A language model can serve as an unsupervised multitask learner
Training data: WebText (~40GB)
Key finding: The model could perform certain tasks without fine-tuning (zero-shot)

GPT-3 (2020)

Parameters: 175B
Core breakthrough: In-context Learning -- no fine-tuning needed; simply provide a few examples in the prompt
Training data: ~570GB of filtered text
Significance: Scale brought a qualitative shift, giving rise to few-shot capabilities

\[ P(y | x_{\text{prompt}}) = P(y | [\text{examples}; x_{\text{query}}]; \theta_{\text{frozen}}) \]

InstructGPT (2022)

Built on GPT-3 + RLHF
Core idea: Align model behavior through human feedback
Significantly improved helpfulness and safety
The 1.3B InstructGPT outperformed the 175B GPT-3 in human evaluations

ChatGPT (2022)

A conversational model trained using the InstructGPT methodology
Optimized for multi-turn dialogue scenarios
Drove large-scale commercial adoption of large language models

GPT-4 (2023)

Multimodal input (text + images)
Substantially improved reasoning, achieving human-level performance on various professional exams
Architecture not publicly disclosed; speculated to use a MoE architecture

GPT-4o (2024)

Natively multimodal: unified processing of text, images, and audio
Real-time voice conversation capabilities
Faster inference speed

GPT Series Evolution:

GPT-1    GPT-2    GPT-3    InstructGPT  ChatGPT   GPT-4    GPT-4o
117M     1.5B     175B     1.3B/175B      -        MoE?      -
Finetune Zero-shot ICL      RLHF        对话       多模态   全模态
2018     2019     2020     2022         2022      2023     2024

Architecture: Decoder-only Transformer

Modern LLMs have almost universally adopted the Decoder-only Transformer architecture.

Basic Structure

Input Tokens → Embedding → [Decoder Block x N] → LM Head → Next Token

Decoder Block:
    ├── Causal Self-Attention (带因果掩码)
    ├── LayerNorm (Pre-Norm)
    └── Feed-Forward Network (SwiGLU)

Next-Token Prediction

The training objective of an LLM is to predict the next token:

\[ \mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) \]

Despite the extreme simplicity of this objective, when the model is large enough and the data abundant enough, the model implicitly learns grammar, semantics, reasoning, and even world knowledge.

Architectural Improvements in Modern LLMs

Compared to the original Transformer, modern LLMs incorporate several improvements:

Improvement	Original Transformer	Modern LLMs
Normalization	Post-Norm	Pre-Norm (RMSNorm)
Activation Function	ReLU	SwiGLU
Positional Encoding	Sinusoidal	RoPE (Rotary Position Embedding)
Attention	Multi-Head	GQA (Grouped Query Attention)

The core idea of RoPE is to encode positional information as rotation matrices:

\[ f(q, m) = R_m q, \quad \text{where } R_m = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \]

This ensures that attention scores depend only on the relative position \(m - n\).

GQA shares key and value heads across groups, reducing the memory footprint of the KV cache and improving inference efficiency.

Scaling Law

Scaling Laws describe the relationship between model performance and scale. For details, see Scaling and Architecture.

Kaplan Scaling Law (OpenAI, 2020)

The model's test loss follows a power-law relationship with the number of parameters \(N\), the amount of data \(D\), and the compute budget \(C\):

\[ L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C} \]

Conclusion: Under a fixed budget, priority should be given to increasing the number of model parameters.

Chinchilla Scaling Law (DeepMind, 2022)

This revised Kaplan's conclusion, proposing that parameter count and data volume should be scaled proportionally:

\[ N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5} \]

Practical guideline: Approximately 20 training tokens are needed per parameter. For example, a 70B model requires roughly 1.4T tokens.

Emergent Abilities

In-context Learning (ICL)

ICL refers to the model's ability to perform tasks directly from examples provided in the context, without any parameter updates:

Prompt:
  Translate English to French:
  sea otter => loutre de mer
  plush giraffe => girafe en peluche
  cheese => ?

Output: fromage

The mechanisms behind ICL are still under active research. One theory suggests that the Transformer implicitly performs gradient descent during its forward pass.

Chain-of-Thought (CoT)

CoT prompting guides the model to reason step by step, significantly improving performance on mathematical and logical tasks:

\[ P(y | x) \approx P(y | r_1, r_2, \ldots, r_k, x) \cdot P(r_1, r_2, \ldots, r_k | x) \]

where \(r_1, \ldots, r_k\) are intermediate reasoning steps.

Instruction Following

After instruction tuning, the model can understand and execute natural language instructions, rather than merely completing text.

RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) is the core method for aligning LLMs with human preferences.

Three-Stage Process

Stage 1: SFT (Supervised Fine-Tuning)
    Collect high-quality (instruction, response) data pairs
    Perform supervised fine-tuning on the pretrained LLM

           ↓

Stage 2: Reward Model (RM) Training
    Human annotators rank multiple model responses
    Train a reward model to learn human preferences
    RM Loss: L = -log(sigma(r(y_w) - r(y_l)))

           ↓

Stage 3: PPO Optimization
    Use the RM score as the reward signal
    Optimize the LLM's policy via the PPO algorithm
    Apply a KL penalty to prevent excessive divergence from the SFT model

The PPO optimization objective:

\[ \max_\theta \mathbb{E}_{x, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}}) \right] \]

where \(r_\phi\) is the Reward Model, \(\pi_{\text{ref}}\) is the SFT model, and \(\beta\) controls the strength of the KL penalty.

For further discussion on alignment methods, see Safety and Alignment.

Open-Source LLM Ecosystem

LLaMA Series (Meta)

LLaMA (2023): 7B/13B/33B/65B, trained on publicly available data, demonstrating the viability of open-source LLMs
LLaMA 2 (2023): Expanded training data to 2T tokens, incorporated RLHF alignment
LLaMA 3 (2024): 8B/70B/405B, trained on 15T+ tokens, performance approaching GPT-4

Mistral Series (Mistral AI)

Mistral 7B (2023): Introduced Sliding Window Attention + GQA; strongest performance at the 7B scale
Mixtral 8x7B (2024): MoE architecture with 8 experts selecting 2, 12.9B active parameters

Qwen Series (Alibaba)

Qwen (2023): Optimized for Chinese-English bilingual performance
Qwen2 (2024): Multiple sizes (0.5B--72B), supporting 128K context length
Qwen2.5 (2024): Further improvements in coding and mathematical capabilities

DeepSeek Series (DeepSeek)

DeepSeek-V2 (2024): MLA (Multi-head Latent Attention) + DeepSeekMoE
DeepSeek-V3 (2025): 671B total parameters, 37B active parameters, FP8 training
DeepSeek-R1 (2025): Reinforcement learning-driven reasoning model

Major LLM Comparison

Model	Parameters	Architecture	Training Data	Context Length	Open Source
GPT-4	Undisclosed (speculated MoE)	Decoder-only	Undisclosed	128K	No
LLaMA 3 405B	405B	Dense Decoder	15T+ tokens	128K	Yes
Mistral Large	Undisclosed	Speculated MoE	Undisclosed	128K	No
Mixtral 8x7B	46.7B (12.9B active)	MoE	Undisclosed	32K	Yes
Qwen2.5 72B	72B	Dense Decoder	18T+ tokens	128K	Yes
DeepSeek-V3	671B (37B active)	MoE + MLA	14.8T tokens	128K	Yes

Summary

The success of LLMs as Foundation Models stems from several key factors:

A simple yet powerful objective: Next-token prediction is sufficient to learn rich world knowledge
Scaling effects: Scaling Laws guarantee that performance improves predictably with scale
Emergent abilities: Capabilities such as ICL and CoT arise naturally at large scale
Alignment techniques: RLHF aligns model behavior with human expectations

Current trends in LLM development include: more efficient architectures (MoE, MLA), longer context windows, stronger reasoning capabilities, and multimodal unification.