LLMs as Foundation Models
Evolution of the GPT Series
The GPT series represents the most iconic development trajectory of Foundation Models, illustrating the complete path of "scaling + alignment."
GPT-1 (2018)
- Parameters: 117M
- Core idea: Use a Transformer Decoder for autoregressive language model pretraining, then fine-tune on downstream tasks
- Training data: BookCorpus (~5GB of text)
- Significance: First demonstrated the effectiveness of "pretrain + finetune" in NLP
GPT-2 (2019)
- Parameters: 1.5B (10x GPT-1)
- Core idea: A language model can serve as an unsupervised multitask learner
- Training data: WebText (~40GB)
- Key finding: The model could perform certain tasks without fine-tuning (zero-shot)
GPT-3 (2020)
- Parameters: 175B
- Core breakthrough: In-context Learning -- no fine-tuning needed; simply provide a few examples in the prompt
- Training data: ~570GB of filtered text
- Significance: Scale brought a qualitative shift, giving rise to few-shot capabilities
InstructGPT (2022)
- Built on GPT-3 + RLHF
- Core idea: Align model behavior through human feedback
- Significantly improved helpfulness and safety
- The 1.3B InstructGPT outperformed the 175B GPT-3 in human evaluations
ChatGPT (2022)
- A conversational model trained using the InstructGPT methodology
- Optimized for multi-turn dialogue scenarios
- Drove large-scale commercial adoption of large language models
GPT-4 (2023)
- Multimodal input (text + images)
- Substantially improved reasoning, achieving human-level performance on various professional exams
- Architecture not publicly disclosed; speculated to use a MoE architecture
GPT-4o (2024)
- Natively multimodal: unified processing of text, images, and audio
- Real-time voice conversation capabilities
- Faster inference speed
GPT Series Evolution:
GPT-1 GPT-2 GPT-3 InstructGPT ChatGPT GPT-4 GPT-4o
117M 1.5B 175B 1.3B/175B - MoE? -
Finetune Zero-shot ICL RLHF 对话 多模态 全模态
2018 2019 2020 2022 2022 2023 2024
Architecture: Decoder-only Transformer
Modern LLMs have almost universally adopted the Decoder-only Transformer architecture.
Basic Structure
Input Tokens → Embedding → [Decoder Block x N] → LM Head → Next Token
Decoder Block:
├── Causal Self-Attention (带因果掩码)
├── LayerNorm (Pre-Norm)
└── Feed-Forward Network (SwiGLU)
Next-Token Prediction
The training objective of an LLM is to predict the next token:
Despite the extreme simplicity of this objective, when the model is large enough and the data abundant enough, the model implicitly learns grammar, semantics, reasoning, and even world knowledge.
Architectural Improvements in Modern LLMs
Compared to the original Transformer, modern LLMs incorporate several improvements:
| Improvement | Original Transformer | Modern LLMs |
|---|---|---|
| Normalization | Post-Norm | Pre-Norm (RMSNorm) |
| Activation Function | ReLU | SwiGLU |
| Positional Encoding | Sinusoidal | RoPE (Rotary Position Embedding) |
| Attention | Multi-Head | GQA (Grouped Query Attention) |
The core idea of RoPE is to encode positional information as rotation matrices:
This ensures that attention scores depend only on the relative position \(m - n\).
GQA shares key and value heads across groups, reducing the memory footprint of the KV cache and improving inference efficiency.
Scaling Law
Scaling Laws describe the relationship between model performance and scale. For details, see Scaling and Architecture.
Kaplan Scaling Law (OpenAI, 2020)
The model's test loss follows a power-law relationship with the number of parameters \(N\), the amount of data \(D\), and the compute budget \(C\):
Conclusion: Under a fixed budget, priority should be given to increasing the number of model parameters.
Chinchilla Scaling Law (DeepMind, 2022)
This revised Kaplan's conclusion, proposing that parameter count and data volume should be scaled proportionally:
Practical guideline: Approximately 20 training tokens are needed per parameter. For example, a 70B model requires roughly 1.4T tokens.
Emergent Abilities
In-context Learning (ICL)
ICL refers to the model's ability to perform tasks directly from examples provided in the context, without any parameter updates:
Prompt:
Translate English to French:
sea otter => loutre de mer
plush giraffe => girafe en peluche
cheese => ?
Output: fromage
The mechanisms behind ICL are still under active research. One theory suggests that the Transformer implicitly performs gradient descent during its forward pass.
Chain-of-Thought (CoT)
CoT prompting guides the model to reason step by step, significantly improving performance on mathematical and logical tasks:
where \(r_1, \ldots, r_k\) are intermediate reasoning steps.
Instruction Following
After instruction tuning, the model can understand and execute natural language instructions, rather than merely completing text.
RLHF Pipeline
RLHF (Reinforcement Learning from Human Feedback) is the core method for aligning LLMs with human preferences.
Three-Stage Process
Stage 1: SFT (Supervised Fine-Tuning)
Collect high-quality (instruction, response) data pairs
Perform supervised fine-tuning on the pretrained LLM
↓
Stage 2: Reward Model (RM) Training
Human annotators rank multiple model responses
Train a reward model to learn human preferences
RM Loss: L = -log(sigma(r(y_w) - r(y_l)))
↓
Stage 3: PPO Optimization
Use the RM score as the reward signal
Optimize the LLM's policy via the PPO algorithm
Apply a KL penalty to prevent excessive divergence from the SFT model
The PPO optimization objective:
where \(r_\phi\) is the Reward Model, \(\pi_{\text{ref}}\) is the SFT model, and \(\beta\) controls the strength of the KL penalty.
For further discussion on alignment methods, see Safety and Alignment.
Open-Source LLM Ecosystem
LLaMA Series (Meta)
- LLaMA (2023): 7B/13B/33B/65B, trained on publicly available data, demonstrating the viability of open-source LLMs
- LLaMA 2 (2023): Expanded training data to 2T tokens, incorporated RLHF alignment
- LLaMA 3 (2024): 8B/70B/405B, trained on 15T+ tokens, performance approaching GPT-4
Mistral Series (Mistral AI)
- Mistral 7B (2023): Introduced Sliding Window Attention + GQA; strongest performance at the 7B scale
- Mixtral 8x7B (2024): MoE architecture with 8 experts selecting 2, 12.9B active parameters
Qwen Series (Alibaba)
- Qwen (2023): Optimized for Chinese-English bilingual performance
- Qwen2 (2024): Multiple sizes (0.5B--72B), supporting 128K context length
- Qwen2.5 (2024): Further improvements in coding and mathematical capabilities
DeepSeek Series (DeepSeek)
- DeepSeek-V2 (2024): MLA (Multi-head Latent Attention) + DeepSeekMoE
- DeepSeek-V3 (2025): 671B total parameters, 37B active parameters, FP8 training
- DeepSeek-R1 (2025): Reinforcement learning-driven reasoning model
Major LLM Comparison
| Model | Parameters | Architecture | Training Data | Context Length | Open Source |
|---|---|---|---|---|---|
| GPT-4 | Undisclosed (speculated MoE) | Decoder-only | Undisclosed | 128K | No |
| LLaMA 3 405B | 405B | Dense Decoder | 15T+ tokens | 128K | Yes |
| Mistral Large | Undisclosed | Speculated MoE | Undisclosed | 128K | No |
| Mixtral 8x7B | 46.7B (12.9B active) | MoE | Undisclosed | 32K | Yes |
| Qwen2.5 72B | 72B | Dense Decoder | 18T+ tokens | 128K | Yes |
| DeepSeek-V3 | 671B (37B active) | MoE + MLA | 14.8T tokens | 128K | Yes |
Summary
The success of LLMs as Foundation Models stems from several key factors:
- A simple yet powerful objective: Next-token prediction is sufficient to learn rich world knowledge
- Scaling effects: Scaling Laws guarantee that performance improves predictably with scale
- Emergent abilities: Capabilities such as ICL and CoT arise naturally at large scale
- Alignment techniques: RLHF aligns model behavior with human expectations
Current trends in LLM development include: more efficient architectures (MoE, MLA), longer context windows, stronger reasoning capabilities, and multimodal unification.