Traditional NLP
The Transformer was originally proposed to solve NLP problems, and the original paper demonstrated it on a translation task. Let us first examine what problems traditional NLP was trying to solve and what methods it used. Understanding the limitations of traditional NLP is a key prerequisite for understanding why the Transformer was revolutionary.
Core Tasks in Natural Language Processing
The goal of Natural Language Processing (NLP) is to enable machines to "understand" and "generate" human language. Core tasks include:
| Task | Description | Example |
|---|---|---|
| Text Classification | Assign labels to text | Sentiment analysis: Is this review positive or negative? |
| Named Entity Recognition (NER) | Identify proper nouns in text | "Jobs founded Apple in California" → Person, Location, Organization |
| Machine Translation | Translate from one language to another | French → English |
| Question Answering | Answer questions given context | Given a Wikipedia passage, answer "Who invented the telephone?" |
| Text Generation | Generate coherent natural language | Automatic summarization, dialogue systems |
| Sequence Labeling | Label each word in a sequence | Part-of-speech tagging (noun, verb, adjective...) |
The first question all these tasks face is: How do we represent a "word" for a computer? Computers only understand numbers, not text.
The Evolution of Text Representation
One-Hot Encoding
The most naive idea: suppose the vocabulary has \(V\) words. Each word is represented as a \(V\)-dimensional vector with a 1 at its corresponding position and 0s everywhere else.
Fatal problems:
- Curse of dimensionality: If the vocabulary has 50,000 words, each word becomes a 50,000-dimensional sparse vector, making computation extremely inefficient
- Cannot express semantic similarity: The dot product of any two different word vectors is 0 — under this representation, the similarity between "cat" and "dog" = the similarity between "cat" and "airplane" = 0
Bag of Words & TF-IDF
The Bag of Words model represents a sentence/document as a word frequency vector: it counts how many times each word appears, completely ignoring word order.
TF-IDF (Term Frequency–Inverse Document Frequency) adds a weighting layer on top of this: if a word appears in nearly all documents (e.g., "the", "is"), its weight is low; if it appears in only a few documents, its weight is high.
- \(\text{TF}(t,d)\): frequency of term \(t\) in document \(d\)
- \(\text{DF}(t)\): number of documents containing term \(t\)
- \(N\): total number of documents
Limitation: "Dog bites man" and "Man bites dog" are identical under the bag-of-words model — word order information is completely lost.
Word Embeddings
Core breakthrough: Map each word to a low-dimensional dense vector space, so that semantically similar words are close together in vector space.
What Are Word Vectors?
A word vector is a word's coordinates in a low-dimensional embedding space — a set of concrete floating-point numbers. For example, a 4-dimensional (extremely simplified) word vector might look like this:
"Cat" and "dog" have similar vectors (both are animals), while "car" is far away from them.
Dense vectors vs. sparse vectors
One-hot encoding is also a kind of "vector," but it is high-dimensional and sparse — only 1 out of 50,000 dimensions has a nonzero value. Word embeddings are low-dimensional and dense — every dimension has a nonzero value and contributes to encoding semantics.
One-Hot(5万维): [0, 0, 0, ..., 1, ..., 0, 0, 0] ← 只有1维有用
├──────── 49999个0 ────────┤
Word2Vec(300维): [0.82, -0.31, 0.56, ..., 0.12] ← 每一维都有用
├────── 300个有意义的值 ──────┤
The core difference: one-hot is a symbolic marker ("this is the 3,724th word in the dictionary," carrying no semantic information), whereas a word vector is a semantic description (each dimension encodes some latent feature).
What does each dimension encode?
The dimensions of a word vector are not manually defined — they are learned automatically by the model. However, post-hoc analysis reveals that some dimensions do loosely correspond to interpretable semantic directions:
| Semantic Direction (conceptual) | High-value words | Low-value words |
|---|---|---|
| "Animacy" | cat, dog, bird | table, chair, computer |
| "Gender" | queen, sister, she | king, brother, he |
| "Size" | elephant, building, ocean | ant, atom, cell |
| "Sentiment polarity" | happy, beautiful, great | sad, terrible, bad |
It is important to note: these "semantic directions" typically do not align precisely with single dimensions, but are distributed across combinations of multiple dimensions. This distributed representation is precisely what makes word vectors powerful — they can encode infinitely many semantic relationships using a finite number of dimensions.
Common word vector dimensions
| Model/Scenario | Typical Dimensions | Notes |
|---|---|---|
| Word2Vec | 100 ~ 300 | Google's pretrained version uses 300 dimensions |
| GloVe | 50 / 100 / 200 /300 | Multiple pretrained dimension options available |
| Transformer Base | 512 | Original "Attention is All You Need" paper |
| GPT-2 | 768 ~ 1600 | Small to XL |
| GPT-3 | 12288 | 175B parameter model |
| LLaMA-7B | 4096 | Meta's open-source model |
Higher dimensions yield stronger expressive power but also higher computational cost. The 300 dimensions used in early models were already sufficient to capture rich semantic relationships; in the era of large language models, larger dimensions are needed to support more complex contextual understanding.
Similarity between word vectors
Since word vectors are coordinates in a space, we can measure the distance between words. The most commonly used metric is cosine similarity:
- Range is \([-1, 1]\): 1 means identical direction, 0 means orthogonal (unrelated), -1 means opposite direction
- Only compares direction, not magnitude, so it is unaffected by vector norms
Examples of actual cosine similarity (based on pretrained Word2Vec):
| Word Pair | Cosine Similarity | Notes |
|---|---|---|
| (cat, dog) | ~0.76 | Both are pets, highly similar |
| (king, queen) | ~0.65 | Both are rulers, related |
| (cat, car) | ~0.05 | Nearly unrelated |
| (good, bad) | ~0.40 | Although semantically opposite, they frequently appear in similar contexts |
Note the last pair: "good" and "bad" have a non-trivial similarity! This is because Word2Vec is based on the distributional hypothesis — "good" and "bad" often appear in the same contexts ("This movie is very ___"), so their vectors end up being relatively close. This is a well-known limitation of word embeddings: they capture co-occurrence relationships rather than semantic opposition.
The Distributional Hypothesis
"You shall know a word by the company it keeps." — J.R. Firth, 1957
A word's meaning is determined by its context. "Cat" and "dog" frequently appear in similar contexts ("feed the ___", "the ___ is sleeping on the sofa"), so their vectors should be close.
Word2Vec (Mikolov et al., 2013)
Two training modes:
CBOW (Continuous Bag of Words): predict the center word from context
Given context words within a window of size \(c\), predict the center word \(w_t\).
Skip-gram: predict the context from the center word (works better in practice)
Given the center word \(w_t\), predict each word within the surrounding window.
The training process is essentially a shallow neural network (with only one hidden layer). After training, the hidden layer's weight matrix is used as the word vectors.
Word2Vec's classic discovery — vector arithmetic:
This demonstrates that the word embedding space captures certain "semantic directions," such as a "gender direction," a "tense direction," and so on.
GloVe (Pennington et al., 2014)
GloVe (Global Vectors) combines global statistical information with local context. It directly factorizes the word-word co-occurrence matrix:
where \(X_{ij}\) is the co-occurrence count of words \(i\) and \(j\) within a context window, and \(f\) is a weighting function (to prevent high-frequency words from dominating).
Limitations of Word Embeddings
The static embedding problem: Each word has only one fixed vector, making it impossible to handle polysemy:
- "I ate an apple" → fruit
- "I bought an Apple phone" → company
In both sentences, the Word2Vec/GloVe vector for "apple" is exactly the same. We need context-dependent dynamic representations — this is precisely what later models like ELMo, BERT, and GPT were designed to address.
Sequence Models
Natural language is fundamentally a sequence — word order is crucial. To handle this sequential structure, recurrent neural networks were proposed.
RNN (Recurrent Neural Network)
Core idea: The model reads the sequence word by word, maintaining a "hidden state" \(h_t\) at each step as a memory of all previous information.
x₁ → [RNN] → h₁
↓
x₂ → [RNN] → h₂
↓
x₃ → [RNN] → h₃
↓
...
Mathematical formulation:
- \(x_t\): input at time step \(t\) (word embedding vector)
- \(h_t\): hidden state at time step \(t\)
- \(W_{hh}, W_{xh}, W_{hy}\): weight matrices
- Each \(h_t\) is a function of \(h_{t-1}\) (historical memory) and \(x_t\) (current input)
Fatal problem — vanishing and exploding gradients:
Backpropagation must be unrolled through time (Backpropagation Through Time, BPTT), and the gradient involves a chain of multiplications:
- If the largest eigenvalue of \(W_{hh}\) < 1: the gradient decays exponentially → vanishing gradient → the model cannot learn long-range dependencies
- If the largest eigenvalue of \(W_{hh}\) > 1: the gradient grows exponentially → exploding gradient → unstable training
In practice, vanishing gradients are more common. This means that when processing long sentences, information from early words can barely influence gradient updates for later words.
Concrete Example: Sentiment Analysis with RNN
Task: Determine whether the movie review "This movie is really great" is positive or negative.
Step 1: Word Embedding
Each word is converted to a word vector through an embedding matrix (assume dimension = 4):
Step 2: RNN processes words sequentially
x₁"这部" → [RNN] → h₁ (知道了"这部",还不知道在说什么)
↓
x₂"电影" → [RNN] → h₂ (知道了"这部电影",开始有上下文)
↓
x₃"太" → [RNN] → h₃ (知道了程度词,在期待后面的情感词)
↓
x₄"好看" → [RNN] → h₄ (关键!正面情感信号进入隐藏状态)
↓
x₅"了" → [RNN] → h₅ (句末助词,h₅ 应该包含整句的信息)
Each step executes \(h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b)\), with the hidden state accumulating information like a snowball.
Step 3: Classification
Take the final hidden state \(h_5\) and pass it through a fully connected layer + Softmax to output class probabilities:
If \(P(\text{positive}) = 0.92\), the model classifies it as a positive review.
The problem with RNN for sentiment analysis: If the review is long — "Although the beginning was a bit boring and the pacing in the middle was slow, the final twist was truly spectacular, and overall this movie is very good" — by the time the model reaches "good," information about "although" and "beginning" has long been squeezed out of the hidden state, and the model may fail to correctly understand the "although...but..." contrastive structure. This is the long-range dependency problem caused by vanishing gradients.
LSTM (Long Short-Term Memory)
Core improvement: Introduce a gating mechanism and a cell state \(C_t\), allowing information to be transmitted directly like a "highway," bypassing the compression of nonlinear transformations.
┌──────────────────────────────────────────────┐
│ Cell State Cₜ │
│ Cₜ₋₁ ──→ [×forget] ──→ [+input] ──→ Cₜ ──→│
└──────────────────────────────────────────────┘
↑ ↑ ↓
Forget Gate Input Gate Output Gate
fₜ iₜ oₜ
Four key equations:
Forget Gate — decides which old information to discard:
Input Gate — decides which new information to write:
Cell state update — the key! The additive structure allows gradients to flow through directly:
Output Gate — decides which information to output:
where \(\sigma\) is the Sigmoid function (outputs 0 to 1, controlling the "openness" of the gate), and \(\odot\) is element-wise multiplication.
Why does LSTM mitigate vanishing gradients?
The key lies in the additive structure \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\). When the gradient propagates along \(C\):
As long as the forget gate \(f_t\) is close to 1 (i.e., the network learns "don't forget"), the gradient can pass through nearly unattenuated over long distances. Compared to the RNN, where the gradient must be repeatedly multiplied by \(W_{hh}\), the LSTM's gradient pathway is much more "spacious."
Concrete Example: Named Entity Recognition (NER) with LSTM
Task: Identify all named entities (person names, locations, organization names) in the sentence "乔布斯 在 加州 创办 了 苹果 公司" (Jobs founded Apple in California).
NER is a sequence labeling task — it outputs a label for each word, rather than a single label for the entire sentence. The commonly used labeling scheme is BIO tagging:
- B-PER: Beginning of a person name (Begin-Person)
- I-PER: Inside a person name (Inside-Person)
- B-LOC: Beginning of a location
- B-ORG: Beginning of an organization name
- I-ORG: Inside an organization name
- O: Not part of any entity (Outside)
Step 1: Word Embedding
Each word is converted to a word vector and fed into the LSTM.
Step 2: LSTM processes words sequentially, outputting a label at each step
Unlike sentiment analysis, NER requires output at every position, not just the last step:
词: 乔布斯 在 加州 创办 了 苹果 公司
↓ ↓ ↓ ↓ ↓ ↓ ↓
[LSTM]→[LSTM]→[LSTM]→[LSTM]→[LSTM]→[LSTM]→[LSTM]
↓ ↓ ↓ ↓ ↓ ↓ ↓
标签: B-PER O B-LOC O O B-ORG I-ORG
Step 3: Classification at each position
The hidden state \(h_t\) at each step is passed through a fully connected layer to predict a label:
The key advantage of LSTM in NER: To recognize "苹果 公司" (Apple Company) as an organization, the model needs to remember the verb "创办" (founded) from earlier — this signals that what follows is an organization. If the sentence were "吃 了 苹果" (ate an apple), the same word "苹果" would not be an entity. The LSTM's gating mechanism allows it to retain these long-range contextual cues.
Why would RNN perform worse on NER? In long sentences such as "According to Xinhua News Agency's report this morning, the famous entrepreneur Jobs founded Apple Company in San Francisco, California" — many words separate "founded" from "Apple Company." The RNN's vanishing gradient problem makes it difficult for the model to learn the influence of "founded" on the label of "Apple," whereas the LSTM can retain this signal through the forget gate.
A better approach in practice: BiLSTM-CRF
A unidirectional LSTM can only see preceding context. But in NER, the following context is also important — seeing "Company" helps confirm that "Apple" is an organization name. Therefore, in practice, a Bidirectional LSTM (BiLSTM) is commonly used:
前向: 乔布斯 → 在 → 加州 → 创办 → 了 → 苹果 → 公司
→ h⃗₆
后向: 乔布斯 ← 在 ← 加州 ← 创办 ← 了 ← 苹果 ← 公司
h⃖₆ ←
Adding a CRF (Conditional Random Field) layer for global label decoding ensures the label sequence is valid (e.g., I-ORG must be preceded by B-ORG or I-ORG, and cannot appear out of nowhere). BiLSTM-CRF was the standard approach for NER tasks before the Transformer era.
GRU (Gated Recurrent Unit)
GRU (Cho et al., 2014) is a simplified version of LSTM that merges the forget gate and input gate into a single update gate, resulting in fewer parameters and faster training:
Update Gate — controls how much old information to retain:
Reset Gate — controls how much old information to ignore:
Candidate hidden state:
Final hidden state — interpolates between the old state and the candidate state:
When \(z_t \approx 0\), information is copied directly (similar to the LSTM forget gate being 1); when \(z_t \approx 1\), the state is completely replaced with new information.
LSTM vs. GRU: Performance is usually very similar. GRU has fewer parameters and is better suited for small datasets, while LSTM has a slight advantage on very long sequences. In practice, the difference between the two is minimal.
Seq2Seq Architecture and the Fixed-Vector Bottleneck
Seq2Seq (Sutskever et al., 2014)
The core paradigm for machine translation: the Encoder-Decoder architecture.
Encoder Decoder
┌─────────────────────┐ ┌─────────────────────────┐
│ Bonjour → h₁ │ │ → I │
│ le → h₂ │ c │ I → am │
│ monde → h₃ ───────┼──→────│ am → happy │
│ │ │ happy → <EOS> │
└─────────────────────┘ └─────────────────────────┘
法语输入 上下文向量 英语输出
Workflow:
- Encoding phase: The encoder (an RNN/LSTM) reads the source-language sentence word by word, and uses the final hidden state \(h_T\) as the context vector \(c\)
- Decoding phase: The decoder (another RNN/LSTM) uses \(c\) as its initial hidden state and generates the target-language sentence word by word
Concrete Example: Chinese-to-English Translation with Seq2Seq (with Detailed Numerical Values)
Task: Translate "我 喜欢 猫" (I like cats) into "I like cats."
To make the process clear, we use an extremely simplified setup:
- Chinese vocabulary: {我, 喜欢, 猫, ...} — 1,000 words total
- English vocabulary: {I, like, cats, dog, the, am, happy, ...,
<BOS>,<EOS>} — 30,000 words total - Word embedding dimension: \(d = 4\) (typically 256–512 in real scenarios)
- LSTM hidden state dimension: \(d_h = 4\) (typically 256–512 in real scenarios)
Phase 1: Encoding — Compressing the Chinese sentence into a vector
The encoder is just forward propagation. It has no special "compression" or "understanding" mechanism — it simply runs an LSTM through the input sequence via forward propagation (matrix multiplication + activation function), updating the hidden state each time a word is read. The final hidden state is the context vector. If you understand forward propagation in a fully connected network (\(y = \sigma(Wx + b)\)), then the encoder just adds "recurrence" and "gating" on top of that.
Step 1: Word Embedding — Converting text to numbers
Each Chinese word is looked up in an embedding matrix \(E_{\text{zh}} \in \mathbb{R}^{1000 \times 4}\) (1,000 words, 4 dimensions each) to obtain its word vector:
These values are not manually specified — they are learned automatically during model training.
Step 2: LSTM processes words sequentially — the core of the encoder
The LSTM reads in each word vector from left to right, updating the hidden state \(h_t\) and cell state \(C_t\) at each step. Below we fully expand the internal LSTM computation — each step is just four matrix multiplications plus some element-wise operations, nothing more complex.
The LSTM has four sets of learnable parameters (weight matrices), one for each "gate":
- \(W_f \in \mathbb{R}^{4 \times 8}\), \(b_f \in \mathbb{R}^{4}\): forget gate parameters
- \(W_i \in \mathbb{R}^{4 \times 8}\), \(b_i \in \mathbb{R}^{4}\): input gate parameters
- \(W_C \in \mathbb{R}^{4 \times 8}\), \(b_C \in \mathbb{R}^{4}\): candidate memory parameters
- \(W_o \in \mathbb{R}^{4 \times 8}\), \(b_o \in \mathbb{R}^{4}\): output gate parameters
Why \(4 \times 8\)? Because the input at each step is \([h_{t-1};\ x_t]\) (the previous hidden state concatenated with the current word vector): \(h\) is 4-dimensional + \(x\) is 4-dimensional = 8 dimensions. The output is 4-dimensional (the hidden state dimension).
Processing the 1st word: "我" (I)
Initial state: \(h_0 = [0, 0, 0, 0]\), \(C_0 = [0, 0, 0, 0]\) (all zeros)
(a) Concatenate input
Simply concatenating two 4-dimensional vectors end-to-end into one 8-dimensional vector.
(b) Four matrix multiplications — one for each gate
Each gate does exactly the same thing: 8-dim input × weight matrix + bias → activation function → 4-dim output. Identical to a standard fully connected layer \(y = \sigma(Wx + b)\).
where \(\sigma\) is the Sigmoid function (outputs 0–1, controlling gate openness) and \(\tanh\) outputs values in the range -1 to 1.
(c) Update cell state — element-wise multiplication and addition
Expanded computation (\(\odot\) means element-wise multiplication):
Since \(C_0\) is all zeros (no old memory to forget), the forget gate \(f_1\) has no effect at this step. The cell state is determined entirely by the input gate \(i_1\) and candidate memory \(\tilde{C}_1\).
(d) Output hidden state
\(h_1 = [0.09,\ -0.11,\ 0.04,\ 0.02]\) — this is the hidden state after reading "我" (I).
Processing the 2nd word: "喜欢" (like)
(a) Concatenation: \([h_1;\ x_2] = [0.09,\ -0.11,\ 0.04,\ 0.02,\ 0.65,\ 0.33,\ -0.18,\ 0.51]\)
(b) Four matrix multiplications — using the same set of weight matrices (\(W_f, W_i, W_C, W_o\) are shared across all time steps):
(c) Update cell state — now the forget gate starts playing a role:
Note the third dimension: \(f_2[3] = 0.42\) means the forget gate retained only 42% of the old information (0.06 × 0.42 = 0.03), while simultaneously writing in new information (-0.43). This is the LSTM's "selective memory" process in action.
(d) Output hidden state:
Processing the 3rd word: "猫" (cat)
Exactly the same process: concatenate \([h_2; x_3]\) → four matrix multiplications → update \(C_3\) → output \(h_3\).
Step 3: Extract the context vector
This is the context vector. All the information of the entire sentence "我喜欢猫" is compressed into these 4 numbers (256–512 numbers in real scenarios).
Recap of what the encoder did:
"我" → 查嵌入表 → x₁ ─┐
├→ [h₀; x₁] → Wf×+b→sigmoid → f₁ ─┐
│ Wi×+b→sigmoid → i₁ ├→ C₁, h₁
│ Wc×+b→tanh → C̃₁ │
│ Wo×+b→sigmoid → o₁ ─┘
│ ↓
"喜欢" → 查嵌入表 → x₂ ─┤→ [h₁; x₂] → 同样四次矩阵乘法 → C₂, h₂
│ ↓
"猫" → 查嵌入表 → x₃ ─┘→ [h₂; x₃] → 同样四次矩阵乘法 → C₃, h₃ = c
↑
上下文向量
That's it. Three words, four matrix multiplications per word (\(W \times \text{input} + b\), then an activation function), for a total of 12 matrix multiplications. All time steps share the same set of weights. The encoder performs no operations beyond forward propagation.
The key point to understand: \(c\) is not "some kind of encoding of the Chinese sentence," nor is it "an index for looking up matches in English." It is an abstract semantic representation — a set of numbers that, through training, have been adjusted so that "the decoder can extract enough information from them to generate the correct translation." Each dimension has no human-readable meaning, but together they contain all the signals needed for generation.
Where does the "understanding" come from? Not from the encoder's architecture, but from training. During backpropagation, gradients tell the weight matrices \(W_f, W_i, W_C, W_o\) how to adjust so that \(h_3\) contains exactly the information the decoder needs to generate the correct translation. After training on hundreds of thousands of sentence pairs, these weights are tuned to a state where "forward propagation happens to produce useful semantic representations."
Phase 2: Decoding — Generating English words one by one from the vector
The decoder is a separate LSTM (its parameters are not shared with the encoder). Its operation is not "retrieval" or "matching" but rather sequential generation — at each step, it selects the word with the highest probability from among the 30,000 English words.
The decoder has two key components:
- Decoder LSTM: maintains the hidden state during the decoding process
- Output projection layer: a matrix \(W_{\text{vocab}} \in \mathbb{R}^{30000 \times 4}\) that maps the 4-dimensional hidden state to 30,000 dimensions (one per English word)
Step 1: Generate the first English word
输入: <BOS>(句首标记)的词向量: y₀ = Embed(<BOS>) = [0.01, 0.02, -0.01, 0.03]
初始: h₀ᵈᵉᶜ = c = [0.58, -0.07, 0.39, 0.44] ← 用上下文向量初始化!
(a) The decoder LSTM computes a new hidden state:
(b) Output projection — map the 4-dimensional hidden state onto 30,000 words:
This step performs a matrix multiplication. \(W_{\text{vocab}}\) is a \(30000 \times 4\) matrix where each row corresponds to an English word. The essence of this matrix multiplication is: computing a "relevance score" between the current hidden state and each English word.
(c) Softmax — convert scores to a probability distribution:
All 30,000 probabilities sum to 1. "I" has the highest probability (0.72), so the model outputs "I."
Why does "I" have the highest probability? Because the context vector \(c\) contains a "first-person subject" signal. Through extensive training data, the model has learned: when \(c\) contains this type of signal and the decoder is at the sentence start position, "I" is the most likely first word.
Step 2: Generate the second English word
输入: "I" 的词向量: y₁ = Embed("I") = [0.55, 0.18, -0.42, 0.33]
状态: h₁ᵈᵉᶜ = [0.71, 0.15, -0.23, 0.52] ← 上一步的隐藏状态
(a) LSTM updates the state:
At this point, \(h_2^{\text{dec}}\) encodes two kinds of information: the original Chinese semantics (inherited from \(c\)) + the fact that "I" has already been generated (from the input \(y_1\)).
(b) Projection + Softmax:
"like" has the highest probability (0.68), so the model outputs "like." Note that "love" also has a non-negligible probability (0.15), since "喜欢" can also be translated as "love" — the model is making a probabilistic choice here.
Step 3: Generate the third English word
输入: "like" 的词向量: y₂ = Embed("like")
状态: h₂ᵈᵉᶜ
"cats" has the highest probability, so the model outputs "cats."
Step 4: Generate the end-of-sentence token
The model outputs <EOS>, and translation ends. Final result: "I like cats".
Complete diagram of the decoding process:
上下文向量 c ──→ 初始化解码器
│
↓
<BOS> ─→ Embed ─→ [LSTM] ─→ h₁ ─→ W_vocab × h₁ ─→ softmax ─→ [0.72, 0.03, ...] ─→ "I"
↑ 3万个词的概率 选最大
↓
"I" ──→ Embed ──→ [LSTM] ─→ h₂ ─→ W_vocab × h₂ ─→ softmax ─→ [0.01, 0.68, ...] ─→ "like"
选最大
↓
"like" → Embed ──→ [LSTM] ─→ h₃ ─→ W_vocab × h₃ ─→ softmax ─→ [..., 0.61, ...] ─→ "cats"
选最大
↓
"cats" → Embed ──→ [LSTM] ─→ h₄ ─→ W_vocab × h₄ ─→ softmax ─→ [..., 0.91, ...] ─→ <EOS>
停止
A common misconception needs to be clarified: The decoder is not "searching" or "matching" — it does not look for a word in the English vocabulary that is most similar to \(c\). What it does is conditional generation: given \(c\) (the source meaning) and the words already generated, it computes a probability distribution over the next word through the learned mapping \(W_{\text{vocab}}\). Each row of \(W_{\text{vocab}}\) can be thought of as "what kind of hidden state is needed for a particular English word to be selected," while the LSTM is responsible for producing the appropriate hidden state at each step. The entire process is a conditional language model:
Training vs. Inference
During training, Teacher Forcing is used: regardless of what the model itself predicted, the ground-truth target word is fed to the next step. This accelerates convergence and prevents early prediction errors from snowballing.
训练时: <BOS> → [LSTM] → 预测"I" ✓
"I" → [LSTM] → 预测"like" ✓ ← 喂入的是真实标签"I",不是模型的预测
"like"→ [LSTM] → 预测"dogs" ✗ ← 即使预测错了
"cats"→ [LSTM] → 预测"<EOS>" ← 仍然喂入真实标签"cats"继续训练
推理时: <BOS> → [LSTM] → 预测"I"
"I" → [LSTM] → 预测"like" ← 喂入的是上一步自己的预测
"like"→ [LSTM] → 预测"cats"
"cats"→ [LSTM] → 预测"<EOS>" ← 如果中途预测错了,后面会一错到底
This also exposes another problem with Seq2Seq: the mismatch between training and inference input distributions (Exposure Bias). During training, the decoder always sees the correct answer; during inference, it must rely on its own potentially incorrect predictions.
The Fixed-Vector Bottleneck (Information Bottleneck)
Core problem: No matter how long the source-language sentence is (10 words or 100 words), all information is compressed into a single fixed-length vector \(c\).
This is like asking you to read an entire book and then summarize it in a single sentence, and then another person has to translate the entire book based on that one sentence.
Empirical evidence: Cho et al. (2014) found that when sentence length exceeds 20–30 words, the BLEU score of Seq2Seq drops sharply.
This bottleneck directly gave rise to the Attention Mechanism — allowing the decoder to no longer rely solely on a single compressed vector, but instead "look back" at all of the encoder's hidden states when generating each word.
Summary of Traditional NLP Limitations
| Method | Core Limitation | Subsequent Solution |
|---|---|---|
| One-Hot / Bag of Words / TF-IDF | No semantic information, word order lost | Word2Vec / GloVe |
| Word2Vec / GloVe | Static embeddings, cannot handle polysemy | ELMo / BERT (contextual embeddings) |
| RNN | Vanishing gradients, difficulty with long-range dependencies | LSTM / GRU |
| LSTM / GRU | Still sequential processing, cannot parallelize; still struggles with very long sequences | Attention Mechanism |
| Seq2Seq | Fixed-vector bottleneck | Attention |
| RNN + Attention | Still cannot parallelize training | Transformer (purely Attention-based) |
The logic behind the Transformer's creation: Since Attention had already become the most important component and RNNs had become the bottleneck for parallelization, why not completely discard RNNs and use only Attention? This is the core idea of the 2017 paper "Attention is All You Need."
Next: → Attention Mechanism (detailed mathematical derivation of Attention)