GPT (Generative Pre-trained Transformer)
GPT is a series of autoregressive language models based on the Transformer Decoder, proposed by OpenAI. From GPT-1 in 2018 to GPT-4 and beyond, this model family has profoundly transformed the research paradigm of natural language processing and the broader field of artificial intelligence. This note provides a systematic overview of GPT's architectural design, mathematical foundations, evolution across versions, and its far-reaching impact on AI development.
Background and Motivation
From Feature Engineering to Pre-training: Why Does NLP Need Pre-training?
Before the deep learning era, NLP relied heavily on hand-crafted features (such as TF-IDF, n-grams, POS tags, etc.). This feature engineering was not only labor-intensive but also struggled to capture the deeper semantic structures of language.
Representation learning in NLP has undergone the following evolution:
| Stage | Method | Limitations |
|---|---|---|
| Discrete Representations | One-Hot, Bag-of-Words | Cannot express semantic similarity; curse of dimensionality |
| Static Word Vectors | Word2Vec, GloVe | One vector per word; cannot handle polysemy |
| Contextualized Representations | ELMo (Bidirectional LSTM) | Feature extraction-based; shallow bidirectionality; RNN-based, no parallelism |
| Pre-train + Fine-tune | GPT-1, BERT | Deep Transformer; end-to-end fine-tuning; significant performance gains |
The core problem is: labeled data is scarce and expensive, while unlabeled text is virtually unlimited. How can we leverage massive amounts of unlabeled text to learn general-purpose language representations and then transfer them to downstream tasks? This is the motivation behind pre-training.
The Core Idea of GPT-1: Unsupervised Pre-training + Supervised Fine-tuning
GPT-1 (2018) proposed a simple yet effective two-stage paradigm:
-
Stage 1: Unsupervised Pre-training - Train a language model on large-scale unlabeled corpora - Objective: predict the next word given the preceding words - Through this simple task, the model learns grammar, semantics, and world knowledge
-
Stage 2: Supervised Fine-tuning - Fine-tune on labeled data for specific downstream tasks (classification, QA, reasoning, etc.) - Retain the parameters learned during pre-training; only a small amount of labeled data is needed to achieve strong performance
Key Insight
The title of the GPT-1 paper is "Improving Language Understanding by Generative Pre-Training." The core insight is: generative pre-training can improve language understanding. Although the pre-training objective is generative (predicting the next word), the learned representations are equally effective for understanding tasks (classification, reasoning, etc.).
The Decoder-only Choice: Why Autoregressive Instead of Bidirectional?
In the era of GPT-1, researchers faced a critical design choice:
- Bidirectional models (such as the later BERT): each token can see the full context (both left and right), well-suited for understanding tasks
- Autoregressive models (such as GPT): each token can only see preceding tokens, naturally suited for generation tasks
GPT chose the autoregressive (Decoder-only) approach for the following reasons:
- Natural generation capability: Autoregressive models can directly generate text token by token without requiring an additional decoding mechanism
- Unified training objective: Language modeling (predicting the next word) is an unsupervised task that does not require designing complex pre-training objectives
- Scaling-friendly: Subsequent practice has shown that the Decoder-only architecture exhibits better scaling behavior as model size increases
Historical Perspective
In 2018, GPT-1 and ELMo proposed the pre-training paradigm almost simultaneously. Shortly after, BERT used a bidirectional Encoder to significantly outperform GPT-1 on understanding tasks, which initially led many to believe that bidirectional models were the right approach. However, the emergence of GPT-2 and GPT-3 demonstrated that with sufficient model size and data, autoregressive models can achieve remarkable results on both understanding and generation tasks.
GPT Architecture in Detail
Overall Architecture
GPT uses the Decoder portion of the Transformer, but with one key modification: the Cross-Attention layer is removed. Since GPT has no Encoder, there is no need for Cross-Attention to attend to Encoder outputs. Each layer retains only Masked Self-Attention and the Feed-Forward Network.
输入序列: "我 喜欢 深度 学习"
|
v
+---------------------------+
| Token Embedding | 将每个 token 映射为向量
| + Position Embedding | 加入位置信息
+---------------------------+
|
v
+---------------------------+
| Masked Self-Attention | 每个位置只能看到自己和之前的 token
| + Add & LayerNorm | 残差连接 + 层归一化
|---------------------------|
| Feed-Forward Network | 位置独立的两层全连接
| + Add & LayerNorm | 残差连接 + 层归一化
+---------------------------+
| x L 层 (GPT-1: L=12)
v
+---------------------------+
| Linear + Softmax | 映射到词表大小,预测下一个 token
+---------------------------+
|
v
输出: 下一个 token 的概率分布
Differences from the Standard Transformer Decoder
The standard Transformer (Vaswani et al., 2017) Decoder has three sub-layers per layer:
| Sub-layer | Standard Transformer Decoder | GPT (Decoder-only) |
|---|---|---|
| Masked Self-Attention | Yes | Yes |
| Cross-Attention | Yes (attends to Encoder output) | No (no Encoder) |
| Feed-Forward Network | Yes | Yes |
GPT is essentially a "Transformer Decoder with Cross-Attention removed." More precisely, the per-layer structure of GPT is:
Simplified notation (Pre-Norm form, adopted from GPT-2 onward):
Change in Norm Placement
GPT-1 uses Post-Norm (attention/FFN first, then LayerNorm), consistent with the original Transformer. GPT-2 switched to Pre-Norm (LayerNorm first, then attention/FFN), which makes training more stable, especially in deep networks.
Masked Self-Attention (Causal Attention)
Masked Self-Attention is the core mechanism of GPT. The key idea is: a causal mask ensures that each position can only attend to itself and preceding positions, preventing any "peeking" at future tokens.
For a sequence of length \(n\), the causal mask matrix \(M\) is a lower-triangular matrix:
The Masked Self-Attention computation:
By adding \(-\infty\) to the attention scores at future positions, these positions receive zero weight after softmax, thereby achieving the "look only at the past" effect.
因果掩码可视化 (4 个 token):
token1 token2 token3 token4
token1 [ 1 0 0 0 ] token1 只能看自己
token2 [ 1 1 0 0 ] token2 能看 token1 和自己
token3 [ 1 1 1 0 ] token3 能看 token1~3
token4 [ 1 1 1 1 ] token4 能看所有 token
(1 = 可见, 0 = 被遮挡)
Token Embedding and Position Embedding
GPT's input processing is similar to the standard Transformer, but with one distinction: GPT uses learned positional embeddings rather than sinusoidal positional encodings.
Where:
- \(W_e \in \mathbb{R}^{V \times d}\) is the token embedding matrix, \(V\) is the vocabulary size, \(d\) is the model dimension
- \(W_p \in \mathbb{R}^{T \times d}\) is the positional embedding matrix, \(T\) is the maximum sequence length
- \(x\) is the one-hot representation of the input token sequence
Learned Positional Embeddings vs. Sinusoidal Positional Encodings
The original Transformer uses fixed sinusoidal positional encodings, which have the theoretical advantage of extrapolating to longer sequences. GPT chose learned positional embeddings; experiments showed that the two approaches perform comparably, but learned embeddings are simpler to implement. The downside is that the sequence length is limited to the maximum length seen during training.
Feed-Forward Network
The FFN in each layer is the same as in the standard Transformer -- a position-wise two-layer fully connected network:
Note that GPT uses the GELU activation function instead of ReLU. GELU (Gaussian Error Linear Unit) is defined as:
Compared to ReLU, GELU is smoother, has non-zero gradients near zero, and leads to more stable training in practice.
Mathematical Foundations of Autoregressive Generation
Chain Rule Decomposition
The core idea of GPT is to decompose the joint probability into a product of conditional probabilities (the chain rule of probability):
At each step, the model predicts the conditional probability distribution of the next token:
Where \(h_L^{(i)}\) is the hidden state at position \(i\) in the final layer, and \(W_e\) is the token embedding matrix (weight tying).
Training Objective: Maximum Likelihood
Given a corpus \(\mathcal{U} = \{u_1, u_2, \ldots, u_n\}\), the training objective is to maximize the log-likelihood:
Equivalently, minimize the cross-entropy loss:
Perplexity
Language models are typically evaluated using perplexity (PPL), the exponentiated form of cross-entropy:
Perplexity can be intuitively understood as: on average, how many equally probable candidate words the model must choose from at each step. Lower PPL indicates a better model.
Fine-tuning Objective (GPT-1)
During fine-tuning, GPT-1 uses the language modeling loss as an auxiliary objective, jointly optimized with the task-specific loss:
Where \(\lambda\) is a weighting hyperparameter. This auxiliary loss helps to:
- Improve generalization during fine-tuning
- Accelerate convergence
- Prevent catastrophic forgetting of knowledge learned during pre-training
Decoding Strategies
After training, generating text requires selecting tokens from the probability distribution. Common decoding strategies include:
1. Greedy Decoding
Select the highest-probability token at each step:
Pros: simple and fast. Cons: prone to repetition and cannot backtrack.
2. Beam Search
Maintain \(k\) (beam width) best candidate sequences; at each step, expand all candidates and keep the top \(k\) by total probability:
Length normalization is typically applied to avoid a bias toward shorter sequences.
3. Top-k Sampling
Sample from the \(k\) highest-probability tokens according to their probabilities:
4. Top-p (Nucleus) Sampling
Select the smallest set of tokens \(V_p\) whose cumulative probability exceeds a threshold \(p\), then sample from this set:
The advantage of Top-p is that it adaptively adjusts the candidate set size: fewer tokens when the distribution is concentrated, more tokens when the distribution is flat.
5. Temperature Scaling
Adjust the "sharpness" of the probability distribution using a temperature parameter \(\tau\):
- \(\tau < 1\): sharper distribution, approaching greedy (more deterministic)
- \(\tau = 1\): original distribution
- \(\tau > 1\): flatter distribution, increased randomness (more diverse)
- \(\tau \to 0\): equivalent to greedy decoding
- \(\tau \to \infty\): equivalent to a uniform distribution
Forward Pass Numerical Example
To build concrete intuition for GPT's computation process, let us walk through a minimal example by hand.
Setup
- Sequence length \(n = 4\), token sequence \([t_1, t_2, t_3, t_4]\)
- Model dimension \(d_{\text{model}} = 4\)
- Single-head attention (\(h = 1\)), \(d_k = 4\)
Assume that after Token Embedding + Position Embedding, the input matrix \(H_0\) is:
Each row is the 4-dimensional representation of a token.
Step 1: Compute Q, K, V
For simplicity, assume \(W_Q = W_K = W_V = I\) (identity matrix), so \(Q = K = V = H_0\).
Step 2: Compute Attention Scores
First compute \(H_0 H_0^T\):
Divide by \(\sqrt{4} = 2\):
Step 3: Apply the Causal Mask
Set the upper-triangular entries to \(-\infty\) (in practice, a very large negative number):
Step 4: Softmax (Row-wise)
- Row 1: \(\text{softmax}([0.675]) = [1.0, 0, 0, 0]\) -- \(t_1\) can only attend to itself
- Row 2: \(\text{softmax}([0.475, 0.845]) \approx [0.41, 0.59, 0, 0]\) -- \(t_2\) attends more to itself
- Row 3: \(\text{softmax}([0.535, 0.670, 1.045]) \approx [0.26, 0.30, 0.44, 0]\) -- \(t_3\) attends most to itself
- Row 4: \(\text{softmax}([0.520, 0.715, 0.835, 0.950]) \approx [0.19, 0.23, 0.26, 0.29]\) -- \(t_4\) attends relatively evenly but slightly favors itself (the most recent token)
Step 5: Weighted Sum
- Row 1 (output for \(t_1\)): \(= 1.0 \times [1.0, 0.5, 0.3, 0.1] = [1.0, 0.5, 0.3, 0.1]\)
- Row 2 (output for \(t_2\)): \(= 0.41 \times [1.0, 0.5, 0.3, 0.1] + 0.59 \times [0.2, 1.0, 0.7, 0.4] \approx [0.53, 0.80, 0.54, 0.28]\)
- And so on...
Observation
The attention matrix \(A\) clearly shows the effect of the causal mask: it is lower-triangular. The output of \(t_1\) is determined entirely by itself; \(t_2\) fuses information from \(t_1\) and itself; and so on. The output at each position contains only information from "past" and "current" positions, never leaking any "future" information. This is the foundation of autoregressive generation.
Evolution of the GPT Series
Parameter Comparison
| Version | Year | Layers \(L\) | Hidden Dim \(d\) | Attention Heads \(h\) | Parameters | Context Length | Training Data |
|---|---|---|---|---|---|---|---|
| GPT-1 | 2018 | 12 | 768 | 12 | 117M | 512 | BookCorpus (~5GB) |
| GPT-2 | 2019 | 48 | 1600 | 25 | 1.5B | 1024 | WebText (~40GB) |
| GPT-3 | 2020 | 96 | 12288 | 96 | 175B | 2048 | Mixed datasets (~570GB) |
| GPT-4 | 2023 | Undisclosed | Undisclosed | Undisclosed | Undisclosed (rumored ~1.8T) | 8K/32K/128K | Undisclosed |
GPT-1 (2018): Pioneer of the Pre-train + Fine-tune Paradigm
Paper: "Improving Language Understanding by Generative Pre-Training"
The core contribution of GPT-1 was proposing and validating the "unsupervised pre-training + supervised fine-tuning" paradigm:
- Pre-training: Language modeling on BookCorpus (approximately 7,000 books)
- Fine-tuning: Adapting to different downstream tasks through simple input format transformations
Input format transformations for different task types in GPT-1:
文本分类: [BOS] 文本内容 [EOS] --> 分类头
文本蕴含: [BOS] 前提 [SEP] 假设 [EOS] --> 分类头
相似度: [BOS] 句子A [SEP] 句子B [EOS] --> 分类头
+ [BOS] 句子B [SEP] 句子A [EOS] (对称处理)
多选题: 对每个选项分别: [BOS] 上下文 [SEP] 选项_k [EOS] --> 分类头
GPT-1 achieved state-of-the-art results on 9 out of 12 tasks, demonstrating the effectiveness of the pre-training paradigm.
GPT-2 (2019): Zero-shot and the Power of Scale
Paper: "Language Models are Unsupervised Multitask Learners"
The core insight of GPT-2 was: a sufficiently large language model can perform downstream tasks without any fine-tuning (zero-shot).
Key improvements:
- Larger model: Scaled from 117M to 1.5B parameters
- Larger and better data: WebText (web pages crawled from highly upvoted Reddit links, quality-filtered)
- No more fine-tuning needed: Tasks are guided through natural language descriptions (the precursor to prompting)
- Architectural refinements: - Moved LayerNorm before each sub-layer (Pre-Norm) - Added an extra LayerNorm after the final Self-Attention layer - Scaled residual layer initialization weights by \(1/\sqrt{N}\) (\(N\) = number of layers)
A Unified View of Tasks
GPT-2 put forth an important idea: all NLP tasks can be viewed as conditional text generation. For example, translation can be formulated as "translate English to French: [English text] = [French text]." This idea was further developed in GPT-3 and T5.
GPT-3 (2020): The Rise of In-Context Learning
Paper: "Language Models are Few-Shot Learners"
GPT-3 was a landmark model with 175B parameters, demonstrating remarkable In-Context Learning capabilities:
- Zero-shot: Only a task description, no examples provided
- One-shot: One example provided
- Few-shot: Several examples provided (typically 10-100)
Few-shot 示例 (情感分析):
"Review: This movie was fantastic! Sentiment: Positive
Review: Terrible acting and boring plot. Sentiment: Negative
Review: A masterpiece of modern cinema. Sentiment:"
模型生成: " Positive"
Key findings from GPT-3:
- Scale brings qualitative change: The larger the model, the stronger the in-context learning ability (emergent abilities)
- No gradient updates required: Few-shot requires no fine-tuning; model parameters remain entirely unchanged; the model "learns" solely through examples in the prompt
- Task generalization: Can handle tasks never explicitly seen during training
GPT-3's training data was a mixture from multiple sources:
| Dataset | Size (tokens) | Training Weight |
|---|---|---|
| Common Crawl (filtered) | 410B | 60% |
| WebText2 | 19B | 22% |
| Books1 | 12B | 8% |
| Books2 | 55B | 8% |
| Wikipedia | 3B | 3% |
GPT-4 (2023): Multimodality and Further Scaling
OpenAI did not disclose the technical details of GPT-4 (the Technical Report contains almost no architectural information). What is known includes:
- Multimodal: Can process both text and image inputs
- Longer context: Supports context windows of 8K, 32K, and even 128K tokens
- Stronger reasoning: Achieves human-level performance on multiple professional exams
- Rumored architecture: Reportedly uses a Mixture of Experts (MoE) architecture with total parameters of approximately 1.8T, but only a subset of experts is activated during each forward pass
About GPT-4 Information
Since OpenAI has not disclosed GPT-4's architectural details, the "rumored" information above comes from unofficial sources and is not guaranteed to be accurate. OpenAI's Technical Report mentions that they chose not to disclose this information due to competitive and safety considerations.
In-Context Learning and Prompt Engineering
Definition of In-Context Learning
In-Context Learning (ICL) is a capability demonstrated by GPT-3: the model can perform new tasks by providing task descriptions and a few examples in the prompt, without updating any model parameters.
Three modes:
Zero-shot:
"将以下句子翻译成英文:今天天气真好。"
One-shot:
"将以下句子翻译成英文。
例子:我喜欢猫。 -> I like cats.
今天天气真好。 ->"
Few-shot:
"将以下句子翻译成英文。
例子1:我喜欢猫。 -> I like cats.
例子2:他在读书。 -> He is reading.
例子3:她很开心。 -> She is very happy.
今天天气真好。 ->"
Why Can GPT-3 Do In-Context Learning?
This remains an active research question. Several theories have been proposed:
1. Implicit Bayesian Inference
The model learns the distribution of tasks present in text during pre-training. Given few-shot examples, the model implicitly performs Bayesian inference -- inferring what the current task is based on the examples, and then executing that task.
2. Implicit Gradient Descent
Research has shown (Akyurek et al., 2022; von Oswald et al., 2023) that the forward pass of a Transformer can simulate gradient descent. Few-shot examples serve as "training data," and the model implicitly performs one (or several) gradient descent steps on this data during its forward pass.
3. Connection to Meta-Learning
In-Context Learning can be viewed as a form of meta-learning: the pre-training phase is meta-training (learning "how to learn"), and few-shot inference at test time is meta-testing (applying the learned "learning ability" to rapidly adapt to new tasks).
Chain-of-Thought (CoT) Prompting
Chain-of-Thought (Wei et al., 2022) is an important prompting technique that guides the model to perform step-by-step reasoning by demonstrating the reasoning process in examples:
标准 Prompting:
Q: Roger有5个网球。他又买了2罐网球,每罐3个。他现在有多少网球?
A: 11
CoT Prompting:
Q: Roger有5个网球。他又买了2罐网球,每罐3个。他现在有多少网球?
A: Roger一开始有5个网球。他买了2罐,每罐3个,所以他买了2x3=6个。
5+6=11。答案是11。
CoT is particularly effective with large models (typically >100B parameters) and is considered a way of leveraging the model's own computational capacity for "slow thinking."
GPT vs. BERT Comparison
GPT and BERT represent the two major paradigms of pre-trained language models, embodying fundamentally different design philosophies.
| Dimension | GPT (Decoder-only) | BERT (Encoder-only) |
|---|---|---|
| Architecture | Transformer Decoder (Masked Self-Attention) | Transformer Encoder (Bidirectional Self-Attention) |
| Attention Direction | Unidirectional (left context only) | Bidirectional (full context) |
| Pre-training Objective | Autoregressive language modeling (predict next token) | Masked Language Modeling (MLM) + Next Sentence Prediction (NSP) |
| Best Suited For | Generation tasks (text generation, dialogue, translation) | Understanding tasks (classification, NER, extractive QA) |
| Usage | GPT-1: fine-tuning; GPT-2/3: Prompting / In-Context Learning | Fine-tuning (with task-specific heads) |
| Scaling Trend | Consistently benefits from increased scale | Diminishing returns at larger scales (relatively) |
GPT (自回归, 单向): BERT (掩码语言模型, 双向):
[我] --> [喜欢] --> [猫] [我] <-> [MASK] <-> [猫]
| | | | | |
v v v v v v
P(喜欢) P(猫) P(EOS) P([MASK]=喜欢)
(可以同时看到"我"和"猫")
Cross-Reference
For a detailed analysis of BERT's architecture and pre-training strategies, see the BERT-related notes in the Transformer series. For a discussion of the pre-training paradigms behind both models (autoregressive vs. autoencoding), see the pre-training notes in the Foundation Models section.
Why Did Decoder-only Ultimately Prevail?
Although BERT dominated NLP leaderboards from 2018 to 2020, the Decoder-only architecture (GPT series) gradually became the mainstream choice for large models. The reasons include:
- Unified generative paradigm: All tasks can be cast as text generation, eliminating the need to design task-specific head structures
- Zero/Few-shot capability: New tasks can be completed without fine-tuning, significantly lowering the barrier to application
- Better scaling behavior: Experiments show that Decoder-only models exhibit smoother and more sustained performance improvement as parameters increase
- Training efficiency: The autoregressive objective computes the loss on every token, whereas MLM only computes the loss on masked tokens (typically 15%)
- Emergent abilities: Capabilities such as in-context learning and chain-of-thought reasoning primarily emerge in large-scale autoregressive models
Scaling Laws and Insights from GPT
The evolution of the GPT series has profoundly revealed the role of scaling laws in language models.
Kaplan Scaling Law
OpenAI's research (Kaplan et al., 2020) found that language model performance (cross-entropy loss) follows a power-law relationship with three factors:
Where \(N\) is the number of parameters, \(D\) is the amount of data, and \(C\) is the compute budget.
Key findings:
- Model performance improves smoothly and predictably as a power law with increasing scale
- Under a fixed compute budget, scaling up the model is more efficient than adding more data (later revised by Chinchilla)
- Architectural details (layer-to-width ratio, number of attention heads, etc.) have a relatively minor impact on performance
Observing Scaling Effects Across the GPT Series
参数量 (对数尺度):
GPT-1 GPT-2 GPT-3 GPT-4
117M 1.5B 175B ~1.8T (传闻)
| | | |
+----13x-----+----117x-----+-----~10x---------+
| | | |
微调 Zero-shot In-Context 多模态
12个任务 初步涌现 Learning 强推理
SOTA 能力 Chain-of-Thought 专家考试水平
Each leap in scale has brought qualitative -- not merely quantitative -- changes. This is precisely what makes scaling laws so compelling.
Reflections and Discussion
GPT's Impact on AI Development
- Paradigm shift: From "training a specialized model for each task" to "solving all tasks with a single general-purpose model," GPT has led the foundation model paradigm
- Scaling above all: The success of the GPT series has reinforced the belief that "bigger is better," fueling the compute arms race
- Rise of prompt engineering: Unlocking model capabilities no longer depends on fine-tuning, but on how prompts are designed
- AI safety and alignment: The powerful capabilities of the GPT series have also raised safety concerns (misinformation, bias, misuse, etc.), driving alignment research (RLHF, etc.)
Open Questions
- Why is the Transformer's scaling behavior so smooth? The theoretical explanation for this phenomenon remains incomplete
- What is the true nature of In-Context Learning? Is it genuine "learning" or pattern matching?
- The data ceiling: When high-quality text data is nearly exhausted, can scaling continue? Is synthetic data the way forward?
- Efficiency concerns: Autoregressive generation is inherently sequential, producing only one token at a time. Is there a more efficient generation paradigm?
- Predictability of emergent abilities: Can we predict in advance at what scale a model will develop a particular capability?
Summary
From a 117M-parameter model in 2018 to today's ultra-large-scale multimodal models, the GPT series has not only pushed the boundaries of natural language processing technically, but has fundamentally changed how we think about AI systems. The paradigm of "pre-trained large models + prompting/alignment" has become the central pathway in current AI research. Understanding GPT's architectural design, training methodology, and scaling behavior is an essential foundation for understanding modern AI development.