Skip to content

BERT: Bidirectional Encoder Representations from Transformers

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model proposed by Google in 2018. It marked the formal transition of NLP from the era of "feature engineering" to the "pre-training + fine-tuning" paradigm. BERT set new state-of-the-art results on 11 NLP benchmark tasks and profoundly reshaped the direction of natural language processing research.

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" -- Devlin et al., Google AI Language, 2018


Background and Motivation

Evolution of NLP Pre-training

Before BERT, the NLP field followed a clear trajectory of pre-training techniques:

Stage Method Year Core Idea Limitation
1 Word2Vec / GloVe 2013 Static word vectors with a fixed representation per word Cannot handle polysemy ("苹果" — fruit or company?)
2 ELMo 2018 Context-dependent word representations via bidirectional LSTMs Feature concatenation, not deeply bidirectional; LSTM-based
3 GPT-1 2018 Unidirectional language model pre-training with Transformer Decoder Unidirectional (left-to-right), cannot see both sides of context simultaneously
4 BERT 2018 Bidirectional pre-training with Transformer Encoder Pretrain-finetune discrepancy (the [MASK] problem)

Why Is Bidirectionality Needed?

GPT-1 uses a standard autoregressive language model that can only see the preceding \(t-1\) tokens when predicting the \(t\)-th token:

\[ P(w_t | w_1, w_2, \ldots, w_{t-1}) \]

This means the model can only leverage the left context when encoding a token's representation. Consider the following two sentences:

(1) 我去银行存钱。        → "银行" = financial institution
(2) 我去河的银行散步。    → "银行" = riverbank (bank)

In sentence (1), if the model can only see "我去" (I went to), it cannot determine the meaning of "银行" (bank); but if it can also see "存钱" (deposit money) on the right, disambiguation becomes straightforward. This is the value of bidirectional context.

ELMo's 'bidirectionality' differs from BERT's

ELMo trains a forward LSTM and a backward LSTM separately, then concatenates the hidden states from both directions. This is a concatenation of two unidirectional models, not true deep bidirectional interaction. In BERT, the Self-Attention at every layer allows each token to attend to all positions in the sequence simultaneously, achieving genuine bidirectionality.

BERT's Core Innovation

Traditional language models can only be unidirectional because allowing the model to see both left and right context would let it "see the answer" when predicting a given token. BERT proposed an elegant solution: the Masked Language Model (MLM).

The key idea: instead of predicting the next token, randomly mask some tokens in the input and have the model predict the masked tokens based on bidirectional context. This avoids the "seeing the answer" problem while enabling truly bidirectional pre-training.


BERT Architecture in Detail

Overall Architecture

BERT uses only the Encoder portion of the Transformer (no Decoder), stacking multiple Transformer Encoder layers to build deep bidirectional representations.

                        BERT 整体架构
 ═══════════════════════════════════════════════════════

 输入文本:  [CLS]  我  喜欢  自然  语言  处理  [SEP]
              |    |    |     |    |    |     |
              v    v    v     v    v    v     v
         ┌─────────────────────────────────────────┐
         │  Token Embedding (词嵌入)                │
         │  +  Segment Embedding (段嵌入)           │
         │  +  Position Embedding (位置嵌入)        │
         └─────────────────────────────────────────┘
              |    |    |     |    |    |     |
              v    v    v     v    v    v     v
         ┌─────────────────────────────────────────┐
         │         Transformer Encoder Layer 1      │
         │   [Multi-Head Self-Attention + FFN]      │
         └─────────────────────────────────────────┘
              |    |    |     |    |    |     |
              v    v    v     v    v    v     v
         ┌─────────────────────────────────────────┐
         │         Transformer Encoder Layer 2      │
         │   [Multi-Head Self-Attention + FFN]      │
         └─────────────────────────────────────────┘
              |    |    |     |    |    |     |
              :    :    :     :    :    :     :
              |    |    |     |    |    |     |
              v    v    v     v    v    v     v
         ┌─────────────────────────────────────────┐
         │         Transformer Encoder Layer L      │
         │   [Multi-Head Self-Attention + FFN]      │
         └─────────────────────────────────────────┘
              |    |    |     |    |    |     |
              v    v    v     v    v    v     v
           T_CLS  T_1  T_2   T_3  T_4  T_5  T_SEP
              |
              v
         分类/池化输出 (用于下游任务)

Why does BERT use only the Encoder and not the Decoder?

The Transformer Decoder includes a causal mask that restricts each position to attending only to preceding positions — exactly what BERT aims to avoid. BERT's goal is to let every token attend to all positions in the sequence, so only the Encoder's unmasked Self-Attention is needed.

Input Representation

BERT's input representation is the element-wise sum of three embeddings:

\[ \text{Input}(x_i) = \text{TokenEmb}(x_i) + \text{SegmentEmb}(x_i) + \text{PositionEmb}(i) \]
 Token:     [CLS]  我  喜欢  猫  [SEP]  猫  是  动物  [SEP]
              |    |    |    |    |      |   |    |     |
 Token Emb:  E_CLS E_我 E_喜欢 E_猫 E_SEP  E_猫 E_是 E_动物 E_SEP
     +        +    +    +    +    +      +   +    +     +
 Segment:    E_A  E_A  E_A  E_A  E_A   E_B E_B  E_B   E_B
     +        +    +    +    +    +      +   +    +     +
 Position:   E_0  E_1  E_2  E_3  E_4   E_5 E_6  E_7   E_8
     =        =    =    =    =    =      =   =    =     =
 最终输入:   [  每个位置得到一个 d_model 维的向量  ]

Description of the three embeddings:

Embedding Dimensions Description
Token Embedding \(V \times d_{\text{model}}\) Word embeddings after WordPiece tokenization. Vocabulary size \(V = 30522\)
Segment Embedding \(2 \times d_{\text{model}}\) Distinguishes sentence A from sentence B. Only two possible values: \(E_A\) and \(E_B\)
Position Embedding \(L_{\max} \times d_{\text{model}}\) Learned positional embeddings (unlike the sinusoidal positional encoding in the original Transformer), \(L_{\max} = 512\)

Difference in Position Embedding

The original Transformer uses fixed sinusoidal/cosine positional encoding, whereas BERT uses learned positional embeddings. Experiments show little difference in performance between the two approaches, but the learned version is simpler to implement. The downside is that the maximum sequence length is bounded by \(L_{\max}\) used during training and cannot extrapolate to longer sequences.

Special Tokens

BERT introduces two key special tokens:

[CLS] (Classification Token)

  • Always placed at the very beginning of the input sequence
  • After passing through \(L\) Transformer Encoder layers, the output vector at the [CLS] position \(T_{\text{CLS}} \in \mathbb{R}^{d_{\text{model}}}\) serves as the aggregate representation of the entire sequence
  • For classification tasks, a classification head is applied directly on top of \(T_{\text{CLS}}\)
  • Why can [CLS] represent the entire sequence? Because Self-Attention allows it to attend to every token in the sequence

[SEP] (Separator Token)

  • Placed at the end of each sentence to separate two sentences
  • Single-sentence input: [CLS] Sentence A [SEP]
  • Sentence-pair input: [CLS] Sentence A [SEP] Sentence B [SEP]

Model Specifications

BERT was released in two versions:

Parameter BERT-Base BERT-Large
Number of Transformer layers \(L\) 12 24
Hidden size \(H\) (\(d_{\text{model}}\)) 768 1024
Number of attention heads \(A\) 12 16
Dimension per head \(d_k = H/A\) 64 64
Feed-Forward dimension \(d_{ff}\) 3072 (\(4H\)) 4096 (\(4H\))
Total parameters 110M 340M
Maximum sequence length 512 512

BERT-Base has a parameter count comparable to GPT-1 (110M vs. 117M), enabling a fair comparison.


Pre-training Tasks

BERT is pre-trained with two unsupervised tasks: MLM and NSP.

Task 1: Masked Language Model (MLM)

MLM is BERT's most central innovation. It randomly masks 15% of the input tokens and trains the model to predict the original tokens at those positions.

Masking strategy:

For the selected 15% of tokens, rather than simply replacing all of them with [MASK], the following strategy is used:

Probability Operation Example (original word: "猫") Purpose
80% Replace with [MASK] "我 喜欢 [MASK]" Train the model to infer masked words from context
10% Replace with a random word "我 喜欢 汽车" Force the model not to predict only when it sees [MASK]
10% Keep unchanged "我 喜欢 猫" Let the model learn to preserve correct representations, biasing toward real input

Why not use [MASK] for everything?

If all selected tokens were replaced with [MASK], a serious problem would arise: pretrain-finetune discrepancy. During fine-tuning, [MASK] tokens never appear in the input, and the model would never have learned to handle "normal" input. The 80/10/10 mixed strategy mitigates this: the 10% kept unchanged teaches the model to represent normal input, while the 10% random replacement prevents the model from activating its prediction capability only upon seeing [MASK].

MLM loss function:

Only the cross-entropy loss over masked tokens is computed (non-masked positions do not contribute to the loss):

\[ \mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | \mathbf{x}_{\backslash \mathcal{M}}) \]

where \(\mathcal{M}\) is the set of masked token positions and \(\mathbf{x}_{\backslash \mathcal{M}}\) denotes all input tokens except the masked ones.

Detailed computation process:

原始输入: [CLS] 我 喜欢 自然 语言 处理 [SEP]
                          ↓ 随机选中 "自然" 和 "处理"
遮蔽输入: [CLS] 我 喜欢 [MASK] 语言 处理 [SEP]
                                          ↓ "处理" 被随机替换
修改输入: [CLS] 我 喜欢 [MASK] 语言 苹果 [SEP]
                   ↓ 送入 BERT Encoder
输出:     T_CLS  T_1  T_2  T_3   T_4  T_5  T_SEP
                          |             |
                          v             v
                     预测 "自然"     预测 "处理"
                      (softmax)      (softmax)

Task 2: Next Sentence Prediction (NSP)

NSP trains the model to understand inter-sentence relationships. Sentence pairs are constructed during training:

Label Construction Proportion Example
IsNext Sentence B is the actual next sentence after A 50% A="猫坐在垫子上。" B="它看起来很舒服。"
NotNext Sentence B is randomly sampled from the corpus 50% A="猫坐在垫子上。" B="股市今天大涨。"

Binary classification is performed using the output at the [CLS] position \(T_{\text{CLS}}\):

\[ P(\text{IsNext} | T_{\text{CLS}}) = \text{softmax}(T_{\text{CLS}} \cdot W_{\text{NSP}}) \]
\[ \mathcal{L}_{\text{NSP}} = -\left[y \log P(\text{IsNext}) + (1 - y) \log(1 - P(\text{IsNext}))\right] \]

Total loss:

\[ \mathcal{L} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}} \]

The NSP Controversy

Subsequent research (notably RoBERTa) found that NSP provides little benefit or is even harmful for most downstream tasks. Possible reasons include: (1) the "random negative sample" task is too easy — the model only needs to determine whether two sentences share the same topic; (2) NSP and MLM may have conflicting objectives. RoBERTa dropped NSP entirely and achieved better results.

Training Details

Item Setting
Training data BooksCorpus (800M words) + English Wikipedia (2500M words)
Total tokens Approximately 3.3 billion
Batch Size 256 sequences
Training steps 1,000,000 steps
Learning rate \(1 \times 10^{-4}\) (Adam, warmup + linear decay)
Training time 4 days (16 TPU v3 chips / BERT-Base), 4 days (64 TPU v3 chips / BERT-Large)
Sequence length Length 128 for the first 90% of steps, length 512 for the last 10%

Forward Pass: A Numerical Example

To build intuition for BERT's computation, we walk through a minimal example.

Setup: Sequence length \(n = 4\), \(d_{\text{model}} = 4\), single-head attention (\(A = 1\)), single Encoder layer.

Step 1: Input Embedding

Assume the input is [CLS] 我 喜欢 [SEP], with the three embeddings summed:

\[ \text{TokenEmb} = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.1 & 0.2 & 0.3 \\ 0.3 & 0.4 & 0.1 & 0.5 \\ 0.2 & 0.3 & 0.4 & 0.1 \end{bmatrix} \]
\[ \text{SegmentEmb} = \begin{bmatrix} 0.01 & 0.02 & 0.01 & 0.02 \\ 0.01 & 0.02 & 0.01 & 0.02 \\ 0.01 & 0.02 & 0.01 & 0.02 \\ 0.01 & 0.02 & 0.01 & 0.02 \end{bmatrix} \quad \text{(all belonging to sentence A)} \]
\[ \text{PositionEmb} = \begin{bmatrix} 0.10 & 0.00 & 0.10 & 0.00 \\ 0.04 & 0.08 & 0.04 & 0.08 \\ 0.01 & 0.05 & 0.01 & 0.05 \\ 0.00 & 0.03 & 0.00 & 0.03 \end{bmatrix} \]

Summing all three yields the input matrix \(X \in \mathbb{R}^{4 \times 4}\):

\[ X = \begin{bmatrix} 0.21 & 0.22 & 0.41 & 0.42 \\ 0.55 & 0.20 & 0.25 & 0.40 \\ 0.32 & 0.47 & 0.12 & 0.57 \\ 0.21 & 0.35 & 0.41 & 0.15 \end{bmatrix} \]

Step 2: Self-Attention Computation

Assume the weight matrices \(W^Q, W^K, W^V \in \mathbb{R}^{4 \times 4}\) are all simplified identity matrices (for demonstration purposes only), so \(Q = X, K = X, V = X\).

(a) Compute attention scores \(QK^T\):

\[ S = QK^T = X \cdot X^T \in \mathbb{R}^{4 \times 4} \]

For example, \(S_{11} = 0.21^2 + 0.22^2 + 0.41^2 + 0.42^2 = 0.04 + 0.05 + 0.17 + 0.18 = 0.44\)

(b) Scale: divide by \(\sqrt{d_k} = \sqrt{4} = 2\)

\[ S_{\text{scaled}} = \frac{S}{2} \]

(c) Softmax: apply softmax along each row

\[ \alpha_{ij} = \frac{\exp(S_{\text{scaled},ij})}{\sum_{k=1}^{4} \exp(S_{\text{scaled},ik})} \]

The attention weights in each row sum to 1. Note that in BERT's Encoder there is no causal mask, so every position can attend to all positions (including itself and subsequent positions).

(d) Weighted sum:

\[ \text{Output} = \alpha \cdot V = \alpha \cdot X \in \mathbb{R}^{4 \times 4} \]

The key difference between BERT and GPT lies right here

In GPT, after computing the attention scores, a causal mask is applied to set all values above the diagonal to \(-\infty\), ensuring each position can only attend to itself and previous positions. BERT uses no such mask (aside from padding masks), allowing every position to freely attend to the entire sequence — this is what "bidirectional" means.

Step 3: FFN + Residual + LayerNorm

The output from Self-Attention is then passed through a Feed-Forward Network:

\[ \text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2 \]

Note that BERT uses GELU (Gaussian Error Linear Unit) as its activation function, rather than the ReLU used in the original Transformer:

\[ \text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right] \]

Each sub-layer (Self-Attention and FFN) is wrapped with a residual connection and Layer Normalization:

\[ \text{output} = \text{LayerNorm}(x + \text{SubLayer}(x)) \]

Fine-tuning

BERT's pre-trained model can be adapted to a wide variety of downstream tasks by simply adding an output layer, requiring only a small amount of labeled data for fine-tuning.

Core Idea of Fine-tuning

 预训练 BERT (无监督, 大规模语料)
              |
              v
 ┌─────────────────────────────┐
 │   冻结或微调 BERT 的参数     │
 │   + 新增一个任务特定的输出层  │
 └─────────────────────────────┘
              |
              v
 下游任务 (少量标注数据)

All parameters (both BERT's original parameters and the newly added output layer) are fine-tuned end-to-end using labeled data from the downstream task. The learning rate is typically set very low (\(2 \times 10^{-5}\) to \(5 \times 10^{-5}\)) to avoid destroying the knowledge acquired during pre-training.

Task 1: Text Classification

 输入:   [CLS]  这  部  电影  太  好看  了  [SEP]
           |    |   |    |   |    |   |    |
           v    v   v    v   v    v   v    v
        ┌──────────────────────────────────────┐
        │              BERT Encoder             │
        └──────────────────────────────────────┘
           |    |   |    |   |    |   |    |
           v
        T_[CLS]  (其余位置的输出忽略)
           |
           v
        ┌──────────────┐
        │  Linear + Softmax  │
        │  W ∈ R^(H x C)    │
        └──────────────┘
           |
           v
        情感标签: 正面 (p=0.92)
\[ P(c | \text{text}) = \text{softmax}(T_{\text{CLS}} \cdot W + b), \quad W \in \mathbb{R}^{H \times C} \]

where \(C\) is the number of classes and \(H\) is the hidden dimension.

Task 2: Sequence Labeling (NER / POS Tagging)

 输入:   [CLS]  乔布斯  在  加州  创办  了  苹果  [SEP]
           |      |    |    |    |    |    |     |
           v      v    v    v    v    v    v     v
        ┌──────────────────────────────────────────┐
        │                BERT Encoder               │
        └──────────────────────────────────────────┘
           |      |    |    |    |    |    |     |
                  v    v    v    v    v    v
               T_1   T_2  T_3  T_4  T_5  T_6
                |     |    |    |    |    |
                v     v    v    v    v    v
              ┌─────────────────────────────┐
              │  每个 token: Linear + Softmax  │
              └─────────────────────────────┘
                |     |    |    |    |    |
                v     v    v    v    v    v
              B-PER  O  B-LOC  O    O  B-ORG

Each token's output \(T_i\) independently passes through a classification head to predict its label:

\[ P(y_i | x) = \text{softmax}(T_i \cdot W + b), \quad i = 1, 2, \ldots, n \]

Task 3: Question Answering (Extractive QA)

Given a passage (context) and a question, the model must locate the start position and end position of the answer within the passage.

 输入:   [CLS] 问题 tokens [SEP] 上下文 tokens [SEP]
           |       |         |        |          |
           v       v         v        v          v
        ┌──────────────────────────────────────────┐
        │                BERT Encoder               │
        └──────────────────────────────────────────┘
           |       |         |        |          |
                                      v
                              上下文每个位置的输出
                                      |
                              ┌───────┴───────┐
                              v               v
                         Start 向量 S     End 向量 E
                         (S ∈ R^H)       (E ∈ R^H)
                              |               |
                              v               v
                         与每个 token      与每个 token
                         输出做点积        输出做点积
                              |               |
                              v               v
                         softmax          softmax
                              |               |
                              v               v
                        start position   end position
\[ P_{\text{start}}(i) = \frac{\exp(S \cdot T_i)}{\sum_j \exp(S \cdot T_j)}, \quad P_{\text{end}}(i) = \frac{\exp(E \cdot T_i)}{\sum_j \exp(E \cdot T_j)} \]

where \(S, E \in \mathbb{R}^H\) are two learnable vectors. The final answer is the span \([i, j]\) that maximizes \(P_{\text{start}}(i) \times P_{\text{end}}(j)\) (subject to \(j \geq i\)).

Task 4: Sentence-Pair Relationship (NLI / Semantic Similarity)

 输入:   [CLS]  句子 A  [SEP]  句子 B  [SEP]
           |       |      |       |      |
           v       v      v       v      v
        ┌──────────────────────────────────────┐
        │              BERT Encoder             │
        └──────────────────────────────────────┘
           |
           v
        T_[CLS]
           |
           v
        ┌──────────────────┐
        │  Linear + Softmax │
        └──────────────────┘
           |
           v
        蕴含 / 矛盾 / 中性
Parameter Recommended Range
Learning rate \(2 \times 10^{-5}\), \(3 \times 10^{-5}\), \(5 \times 10^{-5}\)
Batch Size 16, 32
Epochs 2, 3, 4
Warmup 10% of total steps
Maximum sequence length 128 (short text) or 512 (long text)

The Key Advantage of Fine-tuning

Compared to training from scratch, BERT fine-tuning requires very little labeled data and compute. A classification task fine-tuned on a 16GB GPU with BERT-Base typically needs only a few thousand labeled examples and less than an hour of training to achieve strong results. This dramatically lowers the barrier to entry for NLP.


Limitations of BERT and Subsequent Improvements

Main Limitations

1. Pretrain-Finetune Discrepancy

During pre-training, the input contains [MASK] tokens, but [MASK] never appears during fine-tuning or inference. Although the 80/10/10 strategy partially mitigates this issue, the inconsistency remains.

2. Independence Assumption

MLM assumes that masked tokens are mutually independent. For example, given the input "New [MASK] [MASK]", MLM predicts each masked token separately without considering their interdependence (when in reality "York" and "City" should be predicted jointly).

3. Poorly Designed NSP Task

Randomly sampled negative examples are too easy, and the model may only need to assess topic consistency rather than perform genuine inter-sentence reasoning.

4. Low Training Efficiency

Only 15% of tokens contribute to the loss at each step, whereas autoregressive models (e.g., GPT) compute the loss over every token, providing a denser training signal.

5. Unsuitable for Generation Tasks

BERT is an Encoder-only architecture that lacks autoregressive generation capability, making it ill-suited for text generation, translation, and similar tasks.

Subsequent Improvements

Model Year Core Improvement Key Changes
RoBERTa 2019 More robust training strategy Removed NSP; dynamic masking; larger batch size; more data; longer training
ALBERT 2019 Parameter efficiency Embedding factorization (\(V \times H \to V \times E + E \times H\)); cross-layer parameter sharing
DistilBERT 2019 Knowledge distillation 6 layers instead of 12, retaining 97% performance at 60% faster speed
ELECTRA 2020 Replaced Token Detection A generator produces replacement tokens; a discriminator determines whether each token has been replaced. All tokens participate in training
DeBERTa 2020 Disentangled attention Content and position compute attention separately; enhanced mask decoder
SpanBERT 2020 Contiguous span masking Masks contiguous spans instead of random tokens; removes NSP; adds Span Boundary Objective

ELECTRA's Clever Design

ELECTRA fundamentally solves BERT's 15% training efficiency problem. It uses a small generator (similar to a small BERT) to "fill in" masked tokens, then has a discriminator (the main model) determine whether each token in the sequence is original or was replaced by the generator. This way, 100% of tokens participate in training, greatly improving efficiency. Under equivalent compute budgets, ELECTRA significantly outperforms BERT.


BERT vs. GPT Comparison

These represent the two most important technical trajectories in NLP pre-training. Understanding their differences helps grasp the overall development of the field.

Dimension BERT GPT
Architecture Transformer Encoder Transformer Decoder
Context direction Bidirectional Unidirectional (Left-to-right)
Pre-training objective MLM + NSP Autoregressive LM \(P(w_t \| w_{<t})\)
Attention mask No causal mask (fully visible) Causal mask (sees only the past)
Special input tokens [CLS], [SEP], Segment Emb Only BOS/EOS
Strengths Understanding tasks: classification, NER, QA, NLI Generation tasks: text generation, dialogue, translation
Fine-tuning approach Add output layer, end-to-end fine-tuning Early: add output layer for fine-tuning; Later (GPT-3+): in-context learning
Training efficiency Low (only 15% of tokens contribute to loss) High (every token contributes to loss)
Scaling trend Relatively slow parameter growth Rapid parameter growth (GPT-3: 175B, GPT-4: undisclosed)
Notable successors RoBERTa, DeBERTa, ELECTRA GPT-2, GPT-3, GPT-4, ChatGPT
 BERT (Encoder-only)              GPT (Decoder-only)
 ┌──────────────┐                ┌──────────────┐
 │  完全可见的   │                │  因果掩码的   │
 │ Self-Attention│                │ Self-Attention│
 │              │                │   ╲           │
 │  ○ ○ ○ ○    │                │  ○ ╲ ╲ ╲    │
 │  每个token   │                │  每个token    │
 │  看到所有    │                │  只看到左边   │
 └──────────────┘                └──────────────┘
        |                               |
        v                               v
   理解 (Understanding)           生成 (Generation)
   分类、匹配、抽取              续写、对话、翻译

Reflections and Discussion

Why Is BERT Poor at Generation Tasks?

BERT's fundamental design philosophy is understanding, not generation. The specific reasons are:

  1. Lack of autoregressive capability: BERT has no causal mask and cannot generate tokens sequentially. When generating the \(t\)-th token, it would need information from the \(t+1, t+2, \ldots\) tokens — which have not yet been generated.

  2. Training objective mismatch: MLM trains a "fill-in-the-blank" capability (predicting missing words given context), whereas generation tasks require a "continuation" capability (generating subsequent text given a prefix). These are fundamentally different abilities.

  3. Conditional independence assumption: MLM assumes that multiple masked tokens are independent, making it unable to model the sequential dependencies between tokens during generation.

The Philosophical Difference: Encoder-only vs. Decoder-only

  Encoder-only (BERT)                    Decoder-only (GPT)
  ─────────────────                      ──────────────────
  "我要理解整个输入"                      "我要一个接一个地生成"
       |                                       |
  全部信息同时可见                         只看已知信息,预测下一个
       |                                       |
  适合需要全局理解的任务                   适合需要顺序生成的任务
  (分类、匹配、抽取)                      (对话、写作、翻译)
       |                                       |
  双向注意力                               单向注意力(因果)
       |                                       |
  高效编码,但不能生成                     可以生成,但编码有偏

From an information-theoretic perspective:

  • BERT models \(P(x_{\text{mask}} | x_{\text{observed}})\): given context, predict the missing parts
  • GPT models \(P(x_t | x_{<t})\): given history, predict the future

Both are fundamentally learning the probability distribution of language, but in different ways: BERT follows the approach of a Denoising Autoencoder, while GPT follows that of an Autoregressive Model.

History's Choice

Interestingly, although BERT dominated the NLP academic leaderboards from 2018 to 2020, the mainstream path for large-scale language models ultimately converged on GPT's Decoder-only architecture. Reasons include: (1) Decoder-only scales up more easily; (2) the autoregressive objective naturally supports generation; (3) in-context learning enables Decoder-only models to handle understanding tasks without fine-tuning; (4) it is simpler and more unified from an engineering perspective. Nevertheless, BERT-family models remain widely used in search engines (e.g., Google Search), recommendation systems, and short-text classification, where their efficiency and effectiveness on understanding tasks continue to excel.

BERT's Historical Significance

BERT's contribution extends far beyond its specific technical details — it established the "pre-training + fine-tuning" paradigm for NLP:

  1. Paradigm shift: From "designing a task-specific model for each task" to "one general pre-trained model + lightweight fine-tuning"
  2. Validation of scaling effects: Demonstrated that pre-training on large-scale unlabeled data can learn general linguistic knowledge
  3. Success of transfer learning: Showed that representations learned in one domain/task can be effectively transferred to other tasks
  4. The power of open source: Google open-sourced the pre-trained models and code, greatly democratizing NLP research

BERT's impact on NLP is analogous to the impact of ImageNet pre-training on computer vision — it made "pre-training + fine-tuning" the default approach. Even in today's GPT-dominated era, BERT's core ideas (bidirectional understanding, pre-trained transfer) continue to profoundly influence the design of every new model.


Summary

 BERT 知识脉络总览
 ═══════════════════════════════════════════════════════

 Word2Vec ──→ ELMo ──→ GPT-1 ──→ BERT ──→ RoBERTa
 (静态)     (浅层双向) (深层单向) (深层双向) (更好训练)
                                    |
                                    ├──→ ALBERT   (参数高效)
                                    ├──→ DistilBERT (蒸馏压缩)
                                    ├──→ ELECTRA  (替换检测)
                                    ├──→ SpanBERT (span masking)
                                    └──→ DeBERTa  (解耦注意力)

 BERT 的核心贡献:
 ┌─────────────────────────────────────────────────────┐
 │ 1. Masked LM ──→ 真正的双向预训练                   │
 │ 2. [CLS] + 输出层 ──→ 通用的微调范式                │
 │ 3. 大规模预训练 ──→ 少量数据即可达到好效果           │
 │ 4. 开源模型 ──→ 推动 NLP 研究民主化                 │
 └─────────────────────────────────────────────────────┘

评论 #