GPT (Generative Pre-trained Transformer)

GPT is a series of autoregressive language models based on the Transformer Decoder, proposed by OpenAI. From GPT-1 in 2018 to GPT-4 and beyond, this model family has profoundly transformed the research paradigm of natural language processing and the broader field of artificial intelligence. This note provides a systematic overview of GPT's architectural design, mathematical foundations, evolution across versions, and its far-reaching impact on AI development.

Background and Motivation

From Feature Engineering to Pre-training: Why Does NLP Need Pre-training?

Before the deep learning era, NLP relied heavily on hand-crafted features (such as TF-IDF, n-grams, POS tags, etc.). This feature engineering was not only labor-intensive but also struggled to capture the deeper semantic structures of language.

Representation learning in NLP has undergone the following evolution:

Stage	Method	Limitations
Discrete Representations	One-Hot, Bag-of-Words	Cannot express semantic similarity; curse of dimensionality
Static Word Vectors	Word2Vec, GloVe	One vector per word; cannot handle polysemy
Contextualized Representations	ELMo (Bidirectional LSTM)	Feature extraction-based; shallow bidirectionality; RNN-based, no parallelism
Pre-train + Fine-tune	GPT-1, BERT	Deep Transformer; end-to-end fine-tuning; significant performance gains

The core problem is: labeled data is scarce and expensive, while unlabeled text is virtually unlimited. How can we leverage massive amounts of unlabeled text to learn general-purpose language representations and then transfer them to downstream tasks? This is the motivation behind pre-training.

The Core Idea of GPT-1: Unsupervised Pre-training + Supervised Fine-tuning

GPT-1 (2018) proposed a simple yet effective two-stage paradigm:

Stage 1: Unsupervised Pre-training - Train a language model on large-scale unlabeled corpora - Objective: predict the next word given the preceding words - Through this simple task, the model learns grammar, semantics, and world knowledge
Stage 2: Supervised Fine-tuning - Fine-tune on labeled data for specific downstream tasks (classification, QA, reasoning, etc.) - Retain the parameters learned during pre-training; only a small amount of labeled data is needed to achieve strong performance

Key Insight

The title of the GPT-1 paper is "Improving Language Understanding by Generative Pre-Training." The core insight is: generative pre-training can improve language understanding. Although the pre-training objective is generative (predicting the next word), the learned representations are equally effective for understanding tasks (classification, reasoning, etc.).

The Decoder-only Choice: Why Autoregressive Instead of Bidirectional?

In the era of GPT-1, researchers faced a critical design choice:

Bidirectional models (such as the later BERT): each token can see the full context (both left and right), well-suited for understanding tasks
Autoregressive models (such as GPT): each token can only see preceding tokens, naturally suited for generation tasks

GPT chose the autoregressive (Decoder-only) approach for the following reasons:

Natural generation capability: Autoregressive models can directly generate text token by token without requiring an additional decoding mechanism
Unified training objective: Language modeling (predicting the next word) is an unsupervised task that does not require designing complex pre-training objectives
Scaling-friendly: Subsequent practice has shown that the Decoder-only architecture exhibits better scaling behavior as model size increases

Historical Perspective

In 2018, GPT-1 and ELMo proposed the pre-training paradigm almost simultaneously. Shortly after, BERT used a bidirectional Encoder to significantly outperform GPT-1 on understanding tasks, which initially led many to believe that bidirectional models were the right approach. However, the emergence of GPT-2 and GPT-3 demonstrated that with sufficient model size and data, autoregressive models can achieve remarkable results on both understanding and generation tasks.

GPT Architecture in Detail

Overall Architecture

GPT uses the Decoder portion of the Transformer, but with one key modification: the Cross-Attention layer is removed. Since GPT has no Encoder, there is no need for Cross-Attention to attend to Encoder outputs. Each layer retains only Masked Self-Attention and the Feed-Forward Network.

输入序列: "我 喜欢 深度 学习"
   |
   v
+---------------------------+
|   Token Embedding         |  将每个 token 映射为向量
|   + Position Embedding    |  加入位置信息
+---------------------------+
   |
   v
+---------------------------+
|  Masked Self-Attention    |  每个位置只能看到自己和之前的 token
|  + Add & LayerNorm        |  残差连接 + 层归一化
|---------------------------|
|  Feed-Forward Network     |  位置独立的两层全连接
|  + Add & LayerNorm        |  残差连接 + 层归一化
+---------------------------+
   |        x L 层 (GPT-1: L=12)
   v
+---------------------------+
|  Linear + Softmax         |  映射到词表大小，预测下一个 token
+---------------------------+
   |
   v
输出: 下一个 token 的概率分布

Differences from the Standard Transformer Decoder

The standard Transformer (Vaswani et al., 2017) Decoder has three sub-layers per layer:

Sub-layer	Standard Transformer Decoder	GPT (Decoder-only)
Masked Self-Attention	Yes	Yes
Cross-Attention	Yes (attends to Encoder output)	No (no Encoder)
Feed-Forward Network	Yes	Yes

GPT is essentially a "Transformer Decoder with Cross-Attention removed." More precisely, the per-layer structure of GPT is:

\[ h_l = \text{FFN}(\text{LayerNorm}(\text{MaskedSelfAttn}(\text{LayerNorm}(h_{l-1})) + h_{l-1})) + \text{MaskedSelfAttn}(\text{LayerNorm}(h_{l-1})) + h_{l-1} \]

Simplified notation (Pre-Norm form, adopted from GPT-2 onward):

\[ \begin{aligned} a_l &= h_{l-1} + \text{MaskedSelfAttn}(\text{LN}(h_{l-1})) \\ h_l &= a_l + \text{FFN}(\text{LN}(a_l)) \end{aligned} \]

Change in Norm Placement

GPT-1 uses Post-Norm (attention/FFN first, then LayerNorm), consistent with the original Transformer. GPT-2 switched to Pre-Norm (LayerNorm first, then attention/FFN), which makes training more stable, especially in deep networks.

Masked Self-Attention (Causal Attention)

Masked Self-Attention is the core mechanism of GPT. The key idea is: a causal mask ensures that each position can only attend to itself and preceding positions, preventing any "peeking" at future tokens.

For a sequence of length \(n\), the causal mask matrix \(M\) is a lower-triangular matrix:

\[ M = \begin{pmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{pmatrix} \]

The Masked Self-Attention computation:

\[ \text{MaskedAttn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V \]

By adding \(-\infty\) to the attention scores at future positions, these positions receive zero weight after softmax, thereby achieving the "look only at the past" effect.

因果掩码可视化 (4 个 token):

         token1  token2  token3  token4
token1  [  1       0       0       0   ]   token1 只能看自己
token2  [  1       1       0       0   ]   token2 能看 token1 和自己
token3  [  1       1       1       0   ]   token3 能看 token1~3
token4  [  1       1       1       1   ]   token4 能看所有 token

(1 = 可见, 0 = 被遮挡)

Token Embedding and Position Embedding

GPT's input processing is similar to the standard Transformer, but with one distinction: GPT uses learned positional embeddings rather than sinusoidal positional encodings.

\[ h_0 = W_e \cdot x + W_p \]

Where:

\(W_e \in \mathbb{R}^{V \times d}\) is the token embedding matrix, \(V\) is the vocabulary size, \(d\) is the model dimension
\(W_p \in \mathbb{R}^{T \times d}\) is the positional embedding matrix, \(T\) is the maximum sequence length
\(x\) is the one-hot representation of the input token sequence

Learned Positional Embeddings vs. Sinusoidal Positional Encodings

The original Transformer uses fixed sinusoidal positional encodings, which have the theoretical advantage of extrapolating to longer sequences. GPT chose learned positional embeddings; experiments showed that the two approaches perform comparably, but learned embeddings are simpler to implement. The downside is that the sequence length is limited to the maximum length seen during training.

Feed-Forward Network

The FFN in each layer is the same as in the standard Transformer -- a position-wise two-layer fully connected network:

\[ \text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2 \]

Note that GPT uses the GELU activation function instead of ReLU. GELU (Gaussian Error Linear Unit) is defined as:

\[ \text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right] \]

Compared to ReLU, GELU is smoother, has non-zero gradients near zero, and leads to more stable training in practice.

Mathematical Foundations of Autoregressive Generation

Chain Rule Decomposition

The core idea of GPT is to decompose the joint probability into a product of conditional probabilities (the chain rule of probability):

\[ P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i \mid x_1, x_2, \ldots, x_{i-1}) \]

At each step, the model predicts the conditional probability distribution of the next token:

\[ P(x_i \mid x_1, \ldots, x_{i-1}) = \text{softmax}(h_L^{(i)} \cdot W_e^T) \]

Where \(h_L^{(i)}\) is the hidden state at position \(i\) in the final layer, and \(W_e\) is the token embedding matrix (weight tying).

Training Objective: Maximum Likelihood

Given a corpus \(\mathcal{U} = \{u_1, u_2, \ldots, u_n\}\), the training objective is to maximize the log-likelihood:

\[ L(\mathcal{U}) = \sum_{i=1}^{n} \log P(u_i \mid u_1, \ldots, u_{i-1}; \Theta) \]

Equivalently, minimize the cross-entropy loss:

\[ \mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \log P(u_i \mid u_1, \ldots, u_{i-1}; \Theta) \]

Perplexity

Language models are typically evaluated using perplexity (PPL), the exponentiated form of cross-entropy:

\[PPL = \exp\left(-\frac{1}{n}\sum_{i=1}^{n} \log P(u_i \mid u_1, \ldots, u_{i-1})\right)\]

Perplexity can be intuitively understood as: on average, how many equally probable candidate words the model must choose from at each step. Lower PPL indicates a better model.

Fine-tuning Objective (GPT-1)

During fine-tuning, GPT-1 uses the language modeling loss as an auxiliary objective, jointly optimized with the task-specific loss:

\[ L_{\text{finetune}} = L_{\text{task}}(y \mid x_1, \ldots, x_m) + \lambda \cdot L_{\text{LM}}(x_1, \ldots, x_m) \]

Where \(\lambda\) is a weighting hyperparameter. This auxiliary loss helps to:

Improve generalization during fine-tuning
Accelerate convergence
Prevent catastrophic forgetting of knowledge learned during pre-training

Decoding Strategies

After training, generating text requires selecting tokens from the probability distribution. Common decoding strategies include:

1. Greedy Decoding

Select the highest-probability token at each step:

\[ x_t = \arg\max_{x} P(x \mid x_1, \ldots, x_{t-1}) \]

Pros: simple and fast. Cons: prone to repetition and cannot backtrack.

2. Beam Search

Maintain \(k\) (beam width) best candidate sequences; at each step, expand all candidates and keep the top \(k\) by total probability:

\[ \text{Score}(y_1, \ldots, y_t) = \sum_{i=1}^{t} \log P(y_i \mid y_1, \ldots, y_{i-1}) \]

Length normalization is typically applied to avoid a bias toward shorter sequences.

3. Top-k Sampling

Sample from the \(k\) highest-probability tokens according to their probabilities:

\[ P'(x) = \begin{cases} \frac{P(x)}{\sum_{x' \in \text{Top-}k} P(x')} & \text{if } x \in \text{Top-}k \\ 0 & \text{otherwise} \end{cases} \]

4. Top-p (Nucleus) Sampling

Select the smallest set of tokens \(V_p\) whose cumulative probability exceeds a threshold \(p\), then sample from this set:

\[ V_p = \min\{V' \subseteq V : \sum_{x \in V'} P(x) \geq p\} \]

The advantage of Top-p is that it adaptively adjusts the candidate set size: fewer tokens when the distribution is concentrated, more tokens when the distribution is flat.

5. Temperature Scaling

Adjust the "sharpness" of the probability distribution using a temperature parameter \(\tau\):

\[ P_\tau(x_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)} \]

\(\tau < 1\): sharper distribution, approaching greedy (more deterministic)
\(\tau = 1\): original distribution
\(\tau > 1\): flatter distribution, increased randomness (more diverse)
\(\tau \to 0\): equivalent to greedy decoding
\(\tau \to \infty\): equivalent to a uniform distribution

Forward Pass Numerical Example

To build concrete intuition for GPT's computation process, let us walk through a minimal example by hand.

Setup

Sequence length \(n = 4\), token sequence \([t_1, t_2, t_3, t_4]\)
Model dimension \(d_{\text{model}} = 4\)
Single-head attention (\(h = 1\)), \(d_k = 4\)

Assume that after Token Embedding + Position Embedding, the input matrix \(H_0\) is:

\[ H_0 = \begin{pmatrix} 1.0 & 0.5 & 0.3 & 0.1 \\ 0.2 & 1.0 & 0.7 & 0.4 \\ 0.6 & 0.3 & 1.0 & 0.8 \\ 0.4 & 0.7 & 0.5 & 1.0 \end{pmatrix} \]

Each row is the 4-dimensional representation of a token.

Step 1: Compute Q, K, V

For simplicity, assume \(W_Q = W_K = W_V = I\) (identity matrix), so \(Q = K = V = H_0\).

Step 2: Compute Attention Scores

\[ S = \frac{QK^T}{\sqrt{d_k}} = \frac{H_0 H_0^T}{\sqrt{4}} = \frac{H_0 H_0^T}{2} \]

First compute \(H_0 H_0^T\):

\[ H_0 H_0^T = \begin{pmatrix} 1.35 & 0.95 & 1.07 & 1.04 \\ 0.95 & 1.69 & 1.34 & 1.43 \\ 1.07 & 1.34 & 2.09 & 1.67 \\ 1.04 & 1.43 & 1.67 & 1.90 \end{pmatrix} \]

Divide by \(\sqrt{4} = 2\):

\[ S = \begin{pmatrix} 0.675 & 0.475 & 0.535 & 0.520 \\ 0.475 & 0.845 & 0.670 & 0.715 \\ 0.535 & 0.670 & 1.045 & 0.835 \\ 0.520 & 0.715 & 0.835 & 0.950 \end{pmatrix} \]

Step 3: Apply the Causal Mask

Set the upper-triangular entries to \(-\infty\) (in practice, a very large negative number):

\[ S_{\text{masked}} = \begin{pmatrix} 0.675 & -\infty & -\infty & -\infty \\ 0.475 & 0.845 & -\infty & -\infty \\ 0.535 & 0.670 & 1.045 & -\infty \\ 0.520 & 0.715 & 0.835 & 0.950 \end{pmatrix} \]

Step 4: Softmax (Row-wise)

Row 1: \(\text{softmax}([0.675]) = [1.0, 0, 0, 0]\) -- \(t_1\) can only attend to itself
Row 2: \(\text{softmax}([0.475, 0.845]) \approx [0.41, 0.59, 0, 0]\) -- \(t_2\) attends more to itself
Row 3: \(\text{softmax}([0.535, 0.670, 1.045]) \approx [0.26, 0.30, 0.44, 0]\) -- \(t_3\) attends most to itself
Row 4: \(\text{softmax}([0.520, 0.715, 0.835, 0.950]) \approx [0.19, 0.23, 0.26, 0.29]\) -- \(t_4\) attends relatively evenly but slightly favors itself (the most recent token)

\[ A = \begin{pmatrix} 1.00 & 0.00 & 0.00 & 0.00 \\ 0.41 & 0.59 & 0.00 & 0.00 \\ 0.26 & 0.30 & 0.44 & 0.00 \\ 0.19 & 0.23 & 0.26 & 0.29 \end{pmatrix} \]

Step 5: Weighted Sum

\[ \text{Output} = A \cdot V = A \cdot H_0 \]

Row 1 (output for \(t_1\)): \(= 1.0 \times [1.0, 0.5, 0.3, 0.1] = [1.0, 0.5, 0.3, 0.1]\)
Row 2 (output for \(t_2\)): \(= 0.41 \times [1.0, 0.5, 0.3, 0.1] + 0.59 \times [0.2, 1.0, 0.7, 0.4] \approx [0.53, 0.80, 0.54, 0.28]\)
And so on...

Observation

The attention matrix \(A\) clearly shows the effect of the causal mask: it is lower-triangular. The output of \(t_1\) is determined entirely by itself; \(t_2\) fuses information from \(t_1\) and itself; and so on. The output at each position contains only information from "past" and "current" positions, never leaking any "future" information. This is the foundation of autoregressive generation.

Evolution of the GPT Series

Parameter Comparison

Version	Year	Layers \(L\)	Hidden Dim \(d\)	Attention Heads \(h\)	Parameters	Context Length	Training Data
GPT-1	2018	12	768	12	117M	512	BookCorpus (~5GB)
GPT-2	2019	48	1600	25	1.5B	1024	WebText (~40GB)
GPT-3	2020	96	12288	96	175B	2048	Mixed datasets (~570GB)
GPT-4	2023	Undisclosed	Undisclosed	Undisclosed	Undisclosed (rumored ~1.8T)	8K/32K/128K	Undisclosed

GPT-1 (2018): Pioneer of the Pre-train + Fine-tune Paradigm

Paper: "Improving Language Understanding by Generative Pre-Training"

The core contribution of GPT-1 was proposing and validating the "unsupervised pre-training + supervised fine-tuning" paradigm:

Pre-training: Language modeling on BookCorpus (approximately 7,000 books)
Fine-tuning: Adapting to different downstream tasks through simple input format transformations

Input format transformations for different task types in GPT-1:

文本分类:    [BOS] 文本内容 [EOS]  --> 分类头
文本蕴含:    [BOS] 前提 [SEP] 假设 [EOS]  --> 分类头
相似度:      [BOS] 句子A [SEP] 句子B [EOS]  --> 分类头
             + [BOS] 句子B [SEP] 句子A [EOS]  (对称处理)
多选题:      对每个选项分别: [BOS] 上下文 [SEP] 选项_k [EOS]  --> 分类头

GPT-1 achieved state-of-the-art results on 9 out of 12 tasks, demonstrating the effectiveness of the pre-training paradigm.

GPT-2 (2019): Zero-shot and the Power of Scale

Paper: "Language Models are Unsupervised Multitask Learners"

The core insight of GPT-2 was: a sufficiently large language model can perform downstream tasks without any fine-tuning (zero-shot).

Key improvements:

Larger model: Scaled from 117M to 1.5B parameters
Larger and better data: WebText (web pages crawled from highly upvoted Reddit links, quality-filtered)
No more fine-tuning needed: Tasks are guided through natural language descriptions (the precursor to prompting)
Architectural refinements: - Moved LayerNorm before each sub-layer (Pre-Norm) - Added an extra LayerNorm after the final Self-Attention layer - Scaled residual layer initialization weights by \(1/\sqrt{N}\) (\(N\) = number of layers)

A Unified View of Tasks

GPT-2 put forth an important idea: all NLP tasks can be viewed as conditional text generation. For example, translation can be formulated as "translate English to French: [English text] = [French text]." This idea was further developed in GPT-3 and T5.

GPT-3 (2020): The Rise of In-Context Learning

Paper: "Language Models are Few-Shot Learners"

GPT-3 was a landmark model with 175B parameters, demonstrating remarkable In-Context Learning capabilities:

Zero-shot: Only a task description, no examples provided
One-shot: One example provided
Few-shot: Several examples provided (typically 10-100)

Few-shot 示例 (情感分析):

"Review: This movie was fantastic! Sentiment: Positive
 Review: Terrible acting and boring plot. Sentiment: Negative
 Review: A masterpiece of modern cinema. Sentiment:"

模型生成: " Positive"

Key findings from GPT-3:

Scale brings qualitative change: The larger the model, the stronger the in-context learning ability (emergent abilities)
No gradient updates required: Few-shot requires no fine-tuning; model parameters remain entirely unchanged; the model "learns" solely through examples in the prompt
Task generalization: Can handle tasks never explicitly seen during training

GPT-3's training data was a mixture from multiple sources:

Dataset	Size (tokens)	Training Weight
Common Crawl (filtered)	410B	60%
WebText2	19B	22%
Books1	12B	8%
Books2	55B	8%
Wikipedia	3B	3%

GPT-4 (2023): Multimodality and Further Scaling

OpenAI did not disclose the technical details of GPT-4 (the Technical Report contains almost no architectural information). What is known includes:

Multimodal: Can process both text and image inputs
Longer context: Supports context windows of 8K, 32K, and even 128K tokens
Stronger reasoning: Achieves human-level performance on multiple professional exams
Rumored architecture: Reportedly uses a Mixture of Experts (MoE) architecture with total parameters of approximately 1.8T, but only a subset of experts is activated during each forward pass

About GPT-4 Information

Since OpenAI has not disclosed GPT-4's architectural details, the "rumored" information above comes from unofficial sources and is not guaranteed to be accurate. OpenAI's Technical Report mentions that they chose not to disclose this information due to competitive and safety considerations.

In-Context Learning and Prompt Engineering

Definition of In-Context Learning

In-Context Learning (ICL) is a capability demonstrated by GPT-3: the model can perform new tasks by providing task descriptions and a few examples in the prompt, without updating any model parameters.

Three modes:

Zero-shot:
  "将以下句子翻译成英文：今天天气真好。"

One-shot:
  "将以下句子翻译成英文。
   例子：我喜欢猫。 -> I like cats.
   今天天气真好。 ->"

Few-shot:
  "将以下句子翻译成英文。
   例子1：我喜欢猫。 -> I like cats.
   例子2：他在读书。 -> He is reading.
   例子3：她很开心。 -> She is very happy.
   今天天气真好。 ->"

Why Can GPT-3 Do In-Context Learning?

This remains an active research question. Several theories have been proposed:

1. Implicit Bayesian Inference

The model learns the distribution of tasks present in text during pre-training. Given few-shot examples, the model implicitly performs Bayesian inference -- inferring what the current task is based on the examples, and then executing that task.

2. Implicit Gradient Descent

Research has shown (Akyurek et al., 2022; von Oswald et al., 2023) that the forward pass of a Transformer can simulate gradient descent. Few-shot examples serve as "training data," and the model implicitly performs one (or several) gradient descent steps on this data during its forward pass.

3. Connection to Meta-Learning

In-Context Learning can be viewed as a form of meta-learning: the pre-training phase is meta-training (learning "how to learn"), and few-shot inference at test time is meta-testing (applying the learned "learning ability" to rapidly adapt to new tasks).

Chain-of-Thought (CoT) Prompting

Chain-of-Thought (Wei et al., 2022) is an important prompting technique that guides the model to perform step-by-step reasoning by demonstrating the reasoning process in examples:

标准 Prompting:
  Q: Roger有5个网球。他又买了2罐网球，每罐3个。他现在有多少网球？
  A: 11

CoT Prompting:
  Q: Roger有5个网球。他又买了2罐网球，每罐3个。他现在有多少网球？
  A: Roger一开始有5个网球。他买了2罐，每罐3个，所以他买了2x3=6个。
     5+6=11。答案是11。

CoT is particularly effective with large models (typically >100B parameters) and is considered a way of leveraging the model's own computational capacity for "slow thinking."

GPT vs. BERT Comparison

GPT and BERT represent the two major paradigms of pre-trained language models, embodying fundamentally different design philosophies.

Dimension	GPT (Decoder-only)	BERT (Encoder-only)
Architecture	Transformer Decoder (Masked Self-Attention)	Transformer Encoder (Bidirectional Self-Attention)
Attention Direction	Unidirectional (left context only)	Bidirectional (full context)
Pre-training Objective	Autoregressive language modeling (predict next token)	Masked Language Modeling (MLM) + Next Sentence Prediction (NSP)
Best Suited For	Generation tasks (text generation, dialogue, translation)	Understanding tasks (classification, NER, extractive QA)
Usage	GPT-1: fine-tuning; GPT-2/3: Prompting / In-Context Learning	Fine-tuning (with task-specific heads)
Scaling Trend	Consistently benefits from increased scale	Diminishing returns at larger scales (relatively)

GPT (自回归, 单向):        BERT (掩码语言模型, 双向):

  [我] --> [喜欢] --> [猫]    [我] <-> [MASK] <-> [猫]
   |        |         |       |         |          |
   v        v         v      v         v          v
  P(喜欢)  P(猫)  P(EOS)    P([MASK]=喜欢)
                              (可以同时看到"我"和"猫")

Cross-Reference

For a detailed analysis of BERT's architecture and pre-training strategies, see the BERT-related notes in the Transformer series. For a discussion of the pre-training paradigms behind both models (autoregressive vs. autoencoding), see the pre-training notes in the Foundation Models section.

Why Did Decoder-only Ultimately Prevail?

Although BERT dominated NLP leaderboards from 2018 to 2020, the Decoder-only architecture (GPT series) gradually became the mainstream choice for large models. The reasons include:

Unified generative paradigm: All tasks can be cast as text generation, eliminating the need to design task-specific head structures
Zero/Few-shot capability: New tasks can be completed without fine-tuning, significantly lowering the barrier to application
Better scaling behavior: Experiments show that Decoder-only models exhibit smoother and more sustained performance improvement as parameters increase
Training efficiency: The autoregressive objective computes the loss on every token, whereas MLM only computes the loss on masked tokens (typically 15%)
Emergent abilities: Capabilities such as in-context learning and chain-of-thought reasoning primarily emerge in large-scale autoregressive models

Scaling Laws and Insights from GPT

The evolution of the GPT series has profoundly revealed the role of scaling laws in language models.

Kaplan Scaling Law

OpenAI's research (Kaplan et al., 2020) found that language model performance (cross-entropy loss) follows a power-law relationship with three factors:

\[ L(N) \propto N^{-\alpha_N}, \quad L(D) \propto D^{-\alpha_D}, \quad L(C) \propto C^{-\alpha_C} \]

Where \(N\) is the number of parameters, \(D\) is the amount of data, and \(C\) is the compute budget.

Key findings:

Model performance improves smoothly and predictably as a power law with increasing scale
Under a fixed compute budget, scaling up the model is more efficient than adding more data (later revised by Chinchilla)
Architectural details (layer-to-width ratio, number of attention heads, etc.) have a relatively minor impact on performance

Observing Scaling Effects Across the GPT Series

参数量 (对数尺度):

GPT-1        GPT-2         GPT-3              GPT-4
117M         1.5B          175B               ~1.8T (传闻)
  |            |              |                  |
  +----13x-----+----117x-----+-----~10x---------+
  |            |              |                  |
  微调        Zero-shot     In-Context         多模态
  12个任务     初步涌现      Learning           强推理
  SOTA         能力         Chain-of-Thought    专家考试水平

Each leap in scale has brought qualitative -- not merely quantitative -- changes. This is precisely what makes scaling laws so compelling.

Reflections and Discussion

GPT's Impact on AI Development

Paradigm shift: From "training a specialized model for each task" to "solving all tasks with a single general-purpose model," GPT has led the foundation model paradigm
Scaling above all: The success of the GPT series has reinforced the belief that "bigger is better," fueling the compute arms race
Rise of prompt engineering: Unlocking model capabilities no longer depends on fine-tuning, but on how prompts are designed
AI safety and alignment: The powerful capabilities of the GPT series have also raised safety concerns (misinformation, bias, misuse, etc.), driving alignment research (RLHF, etc.)

Open Questions

Why is the Transformer's scaling behavior so smooth? The theoretical explanation for this phenomenon remains incomplete
What is the true nature of In-Context Learning? Is it genuine "learning" or pattern matching?
The data ceiling: When high-quality text data is nearly exhausted, can scaling continue? Is synthetic data the way forward?
Efficiency concerns: Autoregressive generation is inherently sequential, producing only one token at a time. Is there a more efficient generation paradigm?
Predictability of emergent abilities: Can we predict in advance at what scale a model will develop a particular capability?

Summary

From a 117M-parameter model in 2018 to today's ultra-large-scale multimodal models, the GPT series has not only pushed the boundaries of natural language processing technically, but has fundamentally changed how we think about AI systems. The paradigm of "pre-trained large models + prompting/alignment" has become the central pathway in current AI research. Understanding GPT's architectural design, training methodology, and scaling behavior is an essential foundation for understanding modern AI development.