BERT: Bidirectional Encoder Representations from Transformers
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model proposed by Google in 2018. It marked the formal transition of NLP from the era of "feature engineering" to the "pre-training + fine-tuning" paradigm. BERT set new state-of-the-art results on 11 NLP benchmark tasks and profoundly reshaped the direction of natural language processing research.
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" -- Devlin et al., Google AI Language, 2018
Background and Motivation
Evolution of NLP Pre-training
Before BERT, the NLP field followed a clear trajectory of pre-training techniques:
| Stage | Method | Year | Core Idea | Limitation |
|---|---|---|---|---|
| 1 | Word2Vec / GloVe | 2013 | Static word vectors with a fixed representation per word | Cannot handle polysemy ("苹果" — fruit or company?) |
| 2 | ELMo | 2018 | Context-dependent word representations via bidirectional LSTMs | Feature concatenation, not deeply bidirectional; LSTM-based |
| 3 | GPT-1 | 2018 | Unidirectional language model pre-training with Transformer Decoder | Unidirectional (left-to-right), cannot see both sides of context simultaneously |
| 4 | BERT | 2018 | Bidirectional pre-training with Transformer Encoder | Pretrain-finetune discrepancy (the [MASK] problem) |
Why Is Bidirectionality Needed?
GPT-1 uses a standard autoregressive language model that can only see the preceding \(t-1\) tokens when predicting the \(t\)-th token:
This means the model can only leverage the left context when encoding a token's representation. Consider the following two sentences:
(1) 我去银行存钱。 → "银行" = financial institution
(2) 我去河的银行散步。 → "银行" = riverbank (bank)
In sentence (1), if the model can only see "我去" (I went to), it cannot determine the meaning of "银行" (bank); but if it can also see "存钱" (deposit money) on the right, disambiguation becomes straightforward. This is the value of bidirectional context.
ELMo's 'bidirectionality' differs from BERT's
ELMo trains a forward LSTM and a backward LSTM separately, then concatenates the hidden states from both directions. This is a concatenation of two unidirectional models, not true deep bidirectional interaction. In BERT, the Self-Attention at every layer allows each token to attend to all positions in the sequence simultaneously, achieving genuine bidirectionality.
BERT's Core Innovation
Traditional language models can only be unidirectional because allowing the model to see both left and right context would let it "see the answer" when predicting a given token. BERT proposed an elegant solution: the Masked Language Model (MLM).
The key idea: instead of predicting the next token, randomly mask some tokens in the input and have the model predict the masked tokens based on bidirectional context. This avoids the "seeing the answer" problem while enabling truly bidirectional pre-training.
BERT Architecture in Detail
Overall Architecture
BERT uses only the Encoder portion of the Transformer (no Decoder), stacking multiple Transformer Encoder layers to build deep bidirectional representations.
BERT 整体架构
═══════════════════════════════════════════════════════
输入文本: [CLS] 我 喜欢 自然 语言 处理 [SEP]
| | | | | | |
v v v v v v v
┌─────────────────────────────────────────┐
│ Token Embedding (词嵌入) │
│ + Segment Embedding (段嵌入) │
│ + Position Embedding (位置嵌入) │
└─────────────────────────────────────────┘
| | | | | | |
v v v v v v v
┌─────────────────────────────────────────┐
│ Transformer Encoder Layer 1 │
│ [Multi-Head Self-Attention + FFN] │
└─────────────────────────────────────────┘
| | | | | | |
v v v v v v v
┌─────────────────────────────────────────┐
│ Transformer Encoder Layer 2 │
│ [Multi-Head Self-Attention + FFN] │
└─────────────────────────────────────────┘
| | | | | | |
: : : : : : :
| | | | | | |
v v v v v v v
┌─────────────────────────────────────────┐
│ Transformer Encoder Layer L │
│ [Multi-Head Self-Attention + FFN] │
└─────────────────────────────────────────┘
| | | | | | |
v v v v v v v
T_CLS T_1 T_2 T_3 T_4 T_5 T_SEP
|
v
分类/池化输出 (用于下游任务)
Why does BERT use only the Encoder and not the Decoder?
The Transformer Decoder includes a causal mask that restricts each position to attending only to preceding positions — exactly what BERT aims to avoid. BERT's goal is to let every token attend to all positions in the sequence, so only the Encoder's unmasked Self-Attention is needed.
Input Representation
BERT's input representation is the element-wise sum of three embeddings:
Token: [CLS] 我 喜欢 猫 [SEP] 猫 是 动物 [SEP]
| | | | | | | | |
Token Emb: E_CLS E_我 E_喜欢 E_猫 E_SEP E_猫 E_是 E_动物 E_SEP
+ + + + + + + + + +
Segment: E_A E_A E_A E_A E_A E_B E_B E_B E_B
+ + + + + + + + + +
Position: E_0 E_1 E_2 E_3 E_4 E_5 E_6 E_7 E_8
= = = = = = = = = =
最终输入: [ 每个位置得到一个 d_model 维的向量 ]
Description of the three embeddings:
| Embedding | Dimensions | Description |
|---|---|---|
| Token Embedding | \(V \times d_{\text{model}}\) | Word embeddings after WordPiece tokenization. Vocabulary size \(V = 30522\) |
| Segment Embedding | \(2 \times d_{\text{model}}\) | Distinguishes sentence A from sentence B. Only two possible values: \(E_A\) and \(E_B\) |
| Position Embedding | \(L_{\max} \times d_{\text{model}}\) | Learned positional embeddings (unlike the sinusoidal positional encoding in the original Transformer), \(L_{\max} = 512\) |
Difference in Position Embedding
The original Transformer uses fixed sinusoidal/cosine positional encoding, whereas BERT uses learned positional embeddings. Experiments show little difference in performance between the two approaches, but the learned version is simpler to implement. The downside is that the maximum sequence length is bounded by \(L_{\max}\) used during training and cannot extrapolate to longer sequences.
Special Tokens
BERT introduces two key special tokens:
[CLS] (Classification Token)
- Always placed at the very beginning of the input sequence
- After passing through \(L\) Transformer Encoder layers, the output vector at the [CLS] position \(T_{\text{CLS}} \in \mathbb{R}^{d_{\text{model}}}\) serves as the aggregate representation of the entire sequence
- For classification tasks, a classification head is applied directly on top of \(T_{\text{CLS}}\)
- Why can [CLS] represent the entire sequence? Because Self-Attention allows it to attend to every token in the sequence
[SEP] (Separator Token)
- Placed at the end of each sentence to separate two sentences
- Single-sentence input:
[CLS] Sentence A [SEP] - Sentence-pair input:
[CLS] Sentence A [SEP] Sentence B [SEP]
Model Specifications
BERT was released in two versions:
| Parameter | BERT-Base | BERT-Large |
|---|---|---|
| Number of Transformer layers \(L\) | 12 | 24 |
| Hidden size \(H\) (\(d_{\text{model}}\)) | 768 | 1024 |
| Number of attention heads \(A\) | 12 | 16 |
| Dimension per head \(d_k = H/A\) | 64 | 64 |
| Feed-Forward dimension \(d_{ff}\) | 3072 (\(4H\)) | 4096 (\(4H\)) |
| Total parameters | 110M | 340M |
| Maximum sequence length | 512 | 512 |
BERT-Base has a parameter count comparable to GPT-1 (110M vs. 117M), enabling a fair comparison.
Pre-training Tasks
BERT is pre-trained with two unsupervised tasks: MLM and NSP.
Task 1: Masked Language Model (MLM)
MLM is BERT's most central innovation. It randomly masks 15% of the input tokens and trains the model to predict the original tokens at those positions.
Masking strategy:
For the selected 15% of tokens, rather than simply replacing all of them with [MASK], the following strategy is used:
| Probability | Operation | Example (original word: "猫") | Purpose |
|---|---|---|---|
| 80% | Replace with [MASK] |
"我 喜欢 [MASK]" | Train the model to infer masked words from context |
| 10% | Replace with a random word | "我 喜欢 汽车" | Force the model not to predict only when it sees [MASK] |
| 10% | Keep unchanged | "我 喜欢 猫" | Let the model learn to preserve correct representations, biasing toward real input |
Why not use [MASK] for everything?
If all selected tokens were replaced with [MASK], a serious problem would arise: pretrain-finetune discrepancy. During fine-tuning, [MASK] tokens never appear in the input, and the model would never have learned to handle "normal" input. The 80/10/10 mixed strategy mitigates this: the 10% kept unchanged teaches the model to represent normal input, while the 10% random replacement prevents the model from activating its prediction capability only upon seeing [MASK].
MLM loss function:
Only the cross-entropy loss over masked tokens is computed (non-masked positions do not contribute to the loss):
where \(\mathcal{M}\) is the set of masked token positions and \(\mathbf{x}_{\backslash \mathcal{M}}\) denotes all input tokens except the masked ones.
Detailed computation process:
原始输入: [CLS] 我 喜欢 自然 语言 处理 [SEP]
↓ 随机选中 "自然" 和 "处理"
遮蔽输入: [CLS] 我 喜欢 [MASK] 语言 处理 [SEP]
↓ "处理" 被随机替换
修改输入: [CLS] 我 喜欢 [MASK] 语言 苹果 [SEP]
↓ 送入 BERT Encoder
输出: T_CLS T_1 T_2 T_3 T_4 T_5 T_SEP
| |
v v
预测 "自然" 预测 "处理"
(softmax) (softmax)
Task 2: Next Sentence Prediction (NSP)
NSP trains the model to understand inter-sentence relationships. Sentence pairs are constructed during training:
| Label | Construction | Proportion | Example |
|---|---|---|---|
| IsNext | Sentence B is the actual next sentence after A | 50% | A="猫坐在垫子上。" B="它看起来很舒服。" |
| NotNext | Sentence B is randomly sampled from the corpus | 50% | A="猫坐在垫子上。" B="股市今天大涨。" |
Binary classification is performed using the output at the [CLS] position \(T_{\text{CLS}}\):
Total loss:
The NSP Controversy
Subsequent research (notably RoBERTa) found that NSP provides little benefit or is even harmful for most downstream tasks. Possible reasons include: (1) the "random negative sample" task is too easy — the model only needs to determine whether two sentences share the same topic; (2) NSP and MLM may have conflicting objectives. RoBERTa dropped NSP entirely and achieved better results.
Training Details
| Item | Setting |
|---|---|
| Training data | BooksCorpus (800M words) + English Wikipedia (2500M words) |
| Total tokens | Approximately 3.3 billion |
| Batch Size | 256 sequences |
| Training steps | 1,000,000 steps |
| Learning rate | \(1 \times 10^{-4}\) (Adam, warmup + linear decay) |
| Training time | 4 days (16 TPU v3 chips / BERT-Base), 4 days (64 TPU v3 chips / BERT-Large) |
| Sequence length | Length 128 for the first 90% of steps, length 512 for the last 10% |
Forward Pass: A Numerical Example
To build intuition for BERT's computation, we walk through a minimal example.
Setup: Sequence length \(n = 4\), \(d_{\text{model}} = 4\), single-head attention (\(A = 1\)), single Encoder layer.
Step 1: Input Embedding
Assume the input is [CLS] 我 喜欢 [SEP], with the three embeddings summed:
Summing all three yields the input matrix \(X \in \mathbb{R}^{4 \times 4}\):
Step 2: Self-Attention Computation
Assume the weight matrices \(W^Q, W^K, W^V \in \mathbb{R}^{4 \times 4}\) are all simplified identity matrices (for demonstration purposes only), so \(Q = X, K = X, V = X\).
(a) Compute attention scores \(QK^T\):
For example, \(S_{11} = 0.21^2 + 0.22^2 + 0.41^2 + 0.42^2 = 0.04 + 0.05 + 0.17 + 0.18 = 0.44\)
(b) Scale: divide by \(\sqrt{d_k} = \sqrt{4} = 2\)
(c) Softmax: apply softmax along each row
The attention weights in each row sum to 1. Note that in BERT's Encoder there is no causal mask, so every position can attend to all positions (including itself and subsequent positions).
(d) Weighted sum:
The key difference between BERT and GPT lies right here
In GPT, after computing the attention scores, a causal mask is applied to set all values above the diagonal to \(-\infty\), ensuring each position can only attend to itself and previous positions. BERT uses no such mask (aside from padding masks), allowing every position to freely attend to the entire sequence — this is what "bidirectional" means.
Step 3: FFN + Residual + LayerNorm
The output from Self-Attention is then passed through a Feed-Forward Network:
Note that BERT uses GELU (Gaussian Error Linear Unit) as its activation function, rather than the ReLU used in the original Transformer:
Each sub-layer (Self-Attention and FFN) is wrapped with a residual connection and Layer Normalization:
Fine-tuning
BERT's pre-trained model can be adapted to a wide variety of downstream tasks by simply adding an output layer, requiring only a small amount of labeled data for fine-tuning.
Core Idea of Fine-tuning
预训练 BERT (无监督, 大规模语料)
|
v
┌─────────────────────────────┐
│ 冻结或微调 BERT 的参数 │
│ + 新增一个任务特定的输出层 │
└─────────────────────────────┘
|
v
下游任务 (少量标注数据)
All parameters (both BERT's original parameters and the newly added output layer) are fine-tuned end-to-end using labeled data from the downstream task. The learning rate is typically set very low (\(2 \times 10^{-5}\) to \(5 \times 10^{-5}\)) to avoid destroying the knowledge acquired during pre-training.
Task 1: Text Classification
输入: [CLS] 这 部 电影 太 好看 了 [SEP]
| | | | | | | |
v v v v v v v v
┌──────────────────────────────────────┐
│ BERT Encoder │
└──────────────────────────────────────┘
| | | | | | | |
v
T_[CLS] (其余位置的输出忽略)
|
v
┌──────────────┐
│ Linear + Softmax │
│ W ∈ R^(H x C) │
└──────────────┘
|
v
情感标签: 正面 (p=0.92)
where \(C\) is the number of classes and \(H\) is the hidden dimension.
Task 2: Sequence Labeling (NER / POS Tagging)
输入: [CLS] 乔布斯 在 加州 创办 了 苹果 [SEP]
| | | | | | | |
v v v v v v v v
┌──────────────────────────────────────────┐
│ BERT Encoder │
└──────────────────────────────────────────┘
| | | | | | | |
v v v v v v
T_1 T_2 T_3 T_4 T_5 T_6
| | | | | |
v v v v v v
┌─────────────────────────────┐
│ 每个 token: Linear + Softmax │
└─────────────────────────────┘
| | | | | |
v v v v v v
B-PER O B-LOC O O B-ORG
Each token's output \(T_i\) independently passes through a classification head to predict its label:
Task 3: Question Answering (Extractive QA)
Given a passage (context) and a question, the model must locate the start position and end position of the answer within the passage.
输入: [CLS] 问题 tokens [SEP] 上下文 tokens [SEP]
| | | | |
v v v v v
┌──────────────────────────────────────────┐
│ BERT Encoder │
└──────────────────────────────────────────┘
| | | | |
v
上下文每个位置的输出
|
┌───────┴───────┐
v v
Start 向量 S End 向量 E
(S ∈ R^H) (E ∈ R^H)
| |
v v
与每个 token 与每个 token
输出做点积 输出做点积
| |
v v
softmax softmax
| |
v v
start position end position
where \(S, E \in \mathbb{R}^H\) are two learnable vectors. The final answer is the span \([i, j]\) that maximizes \(P_{\text{start}}(i) \times P_{\text{end}}(j)\) (subject to \(j \geq i\)).
Task 4: Sentence-Pair Relationship (NLI / Semantic Similarity)
输入: [CLS] 句子 A [SEP] 句子 B [SEP]
| | | | |
v v v v v
┌──────────────────────────────────────┐
│ BERT Encoder │
└──────────────────────────────────────┘
|
v
T_[CLS]
|
v
┌──────────────────┐
│ Linear + Softmax │
└──────────────────┘
|
v
蕴含 / 矛盾 / 中性
Recommended Fine-tuning Hyperparameters
| Parameter | Recommended Range |
|---|---|
| Learning rate | \(2 \times 10^{-5}\), \(3 \times 10^{-5}\), \(5 \times 10^{-5}\) |
| Batch Size | 16, 32 |
| Epochs | 2, 3, 4 |
| Warmup | 10% of total steps |
| Maximum sequence length | 128 (short text) or 512 (long text) |
The Key Advantage of Fine-tuning
Compared to training from scratch, BERT fine-tuning requires very little labeled data and compute. A classification task fine-tuned on a 16GB GPU with BERT-Base typically needs only a few thousand labeled examples and less than an hour of training to achieve strong results. This dramatically lowers the barrier to entry for NLP.
Limitations of BERT and Subsequent Improvements
Main Limitations
1. Pretrain-Finetune Discrepancy
During pre-training, the input contains [MASK] tokens, but [MASK] never appears during fine-tuning or inference. Although the 80/10/10 strategy partially mitigates this issue, the inconsistency remains.
2. Independence Assumption
MLM assumes that masked tokens are mutually independent. For example, given the input "New [MASK] [MASK]", MLM predicts each masked token separately without considering their interdependence (when in reality "York" and "City" should be predicted jointly).
3. Poorly Designed NSP Task
Randomly sampled negative examples are too easy, and the model may only need to assess topic consistency rather than perform genuine inter-sentence reasoning.
4. Low Training Efficiency
Only 15% of tokens contribute to the loss at each step, whereas autoregressive models (e.g., GPT) compute the loss over every token, providing a denser training signal.
5. Unsuitable for Generation Tasks
BERT is an Encoder-only architecture that lacks autoregressive generation capability, making it ill-suited for text generation, translation, and similar tasks.
Subsequent Improvements
| Model | Year | Core Improvement | Key Changes |
|---|---|---|---|
| RoBERTa | 2019 | More robust training strategy | Removed NSP; dynamic masking; larger batch size; more data; longer training |
| ALBERT | 2019 | Parameter efficiency | Embedding factorization (\(V \times H \to V \times E + E \times H\)); cross-layer parameter sharing |
| DistilBERT | 2019 | Knowledge distillation | 6 layers instead of 12, retaining 97% performance at 60% faster speed |
| ELECTRA | 2020 | Replaced Token Detection | A generator produces replacement tokens; a discriminator determines whether each token has been replaced. All tokens participate in training |
| DeBERTa | 2020 | Disentangled attention | Content and position compute attention separately; enhanced mask decoder |
| SpanBERT | 2020 | Contiguous span masking | Masks contiguous spans instead of random tokens; removes NSP; adds Span Boundary Objective |
ELECTRA's Clever Design
ELECTRA fundamentally solves BERT's 15% training efficiency problem. It uses a small generator (similar to a small BERT) to "fill in" masked tokens, then has a discriminator (the main model) determine whether each token in the sequence is original or was replaced by the generator. This way, 100% of tokens participate in training, greatly improving efficiency. Under equivalent compute budgets, ELECTRA significantly outperforms BERT.
BERT vs. GPT Comparison
These represent the two most important technical trajectories in NLP pre-training. Understanding their differences helps grasp the overall development of the field.
| Dimension | BERT | GPT |
|---|---|---|
| Architecture | Transformer Encoder | Transformer Decoder |
| Context direction | Bidirectional | Unidirectional (Left-to-right) |
| Pre-training objective | MLM + NSP | Autoregressive LM \(P(w_t \| w_{<t})\) |
| Attention mask | No causal mask (fully visible) | Causal mask (sees only the past) |
| Special input tokens | [CLS], [SEP], Segment Emb | Only BOS/EOS |
| Strengths | Understanding tasks: classification, NER, QA, NLI | Generation tasks: text generation, dialogue, translation |
| Fine-tuning approach | Add output layer, end-to-end fine-tuning | Early: add output layer for fine-tuning; Later (GPT-3+): in-context learning |
| Training efficiency | Low (only 15% of tokens contribute to loss) | High (every token contributes to loss) |
| Scaling trend | Relatively slow parameter growth | Rapid parameter growth (GPT-3: 175B, GPT-4: undisclosed) |
| Notable successors | RoBERTa, DeBERTa, ELECTRA | GPT-2, GPT-3, GPT-4, ChatGPT |
BERT (Encoder-only) GPT (Decoder-only)
┌──────────────┐ ┌──────────────┐
│ 完全可见的 │ │ 因果掩码的 │
│ Self-Attention│ │ Self-Attention│
│ │ │ ╲ │
│ ○ ○ ○ ○ │ │ ○ ╲ ╲ ╲ │
│ 每个token │ │ 每个token │
│ 看到所有 │ │ 只看到左边 │
└──────────────┘ └──────────────┘
| |
v v
理解 (Understanding) 生成 (Generation)
分类、匹配、抽取 续写、对话、翻译
Reflections and Discussion
Why Is BERT Poor at Generation Tasks?
BERT's fundamental design philosophy is understanding, not generation. The specific reasons are:
-
Lack of autoregressive capability: BERT has no causal mask and cannot generate tokens sequentially. When generating the \(t\)-th token, it would need information from the \(t+1, t+2, \ldots\) tokens — which have not yet been generated.
-
Training objective mismatch: MLM trains a "fill-in-the-blank" capability (predicting missing words given context), whereas generation tasks require a "continuation" capability (generating subsequent text given a prefix). These are fundamentally different abilities.
-
Conditional independence assumption: MLM assumes that multiple masked tokens are independent, making it unable to model the sequential dependencies between tokens during generation.
The Philosophical Difference: Encoder-only vs. Decoder-only
Encoder-only (BERT) Decoder-only (GPT)
───────────────── ──────────────────
"我要理解整个输入" "我要一个接一个地生成"
| |
全部信息同时可见 只看已知信息,预测下一个
| |
适合需要全局理解的任务 适合需要顺序生成的任务
(分类、匹配、抽取) (对话、写作、翻译)
| |
双向注意力 单向注意力(因果)
| |
高效编码,但不能生成 可以生成,但编码有偏
From an information-theoretic perspective:
- BERT models \(P(x_{\text{mask}} | x_{\text{observed}})\): given context, predict the missing parts
- GPT models \(P(x_t | x_{<t})\): given history, predict the future
Both are fundamentally learning the probability distribution of language, but in different ways: BERT follows the approach of a Denoising Autoencoder, while GPT follows that of an Autoregressive Model.
History's Choice
Interestingly, although BERT dominated the NLP academic leaderboards from 2018 to 2020, the mainstream path for large-scale language models ultimately converged on GPT's Decoder-only architecture. Reasons include: (1) Decoder-only scales up more easily; (2) the autoregressive objective naturally supports generation; (3) in-context learning enables Decoder-only models to handle understanding tasks without fine-tuning; (4) it is simpler and more unified from an engineering perspective. Nevertheless, BERT-family models remain widely used in search engines (e.g., Google Search), recommendation systems, and short-text classification, where their efficiency and effectiveness on understanding tasks continue to excel.
BERT's Historical Significance
BERT's contribution extends far beyond its specific technical details — it established the "pre-training + fine-tuning" paradigm for NLP:
- Paradigm shift: From "designing a task-specific model for each task" to "one general pre-trained model + lightweight fine-tuning"
- Validation of scaling effects: Demonstrated that pre-training on large-scale unlabeled data can learn general linguistic knowledge
- Success of transfer learning: Showed that representations learned in one domain/task can be effectively transferred to other tasks
- The power of open source: Google open-sourced the pre-trained models and code, greatly democratizing NLP research
BERT's impact on NLP is analogous to the impact of ImageNet pre-training on computer vision — it made "pre-training + fine-tuning" the default approach. Even in today's GPT-dominated era, BERT's core ideas (bidirectional understanding, pre-trained transfer) continue to profoundly influence the design of every new model.
Summary
BERT 知识脉络总览
═══════════════════════════════════════════════════════
Word2Vec ──→ ELMo ──→ GPT-1 ──→ BERT ──→ RoBERTa
(静态) (浅层双向) (深层单向) (深层双向) (更好训练)
|
├──→ ALBERT (参数高效)
├──→ DistilBERT (蒸馏压缩)
├──→ ELECTRA (替换检测)
├──→ SpanBERT (span masking)
└──→ DeBERTa (解耦注意力)
BERT 的核心贡献:
┌─────────────────────────────────────────────────────┐
│ 1. Masked LM ──→ 真正的双向预训练 │
│ 2. [CLS] + 输出层 ──→ 通用的微调范式 │
│ 3. 大规模预训练 ──→ 少量数据即可达到好效果 │
│ 4. 开源模型 ──→ 推动 NLP 研究民主化 │
└─────────────────────────────────────────────────────┘