Attention Mechanism

The Attention Mechanism was not introduced by the Transformer — it was already a highly popular and arguably indispensable research direction well before the Transformer came into existence. The attention mechanism serves as the critical bridge connecting classical deep learning to modern generative AI.

Learning path: Seq2Seq bottleneck → Bahdanau Attention → Luong Attention → Self-Attention → Scaled Dot-Product → Multi-Head Attention

Starting from the Fixed-Vector Bottleneck

As we saw in the traditional NLP notes, the Seq2Seq architecture compresses the entire source sentence into a single fixed-length context vector \(c = h_T^{\text{enc}}\). This bottleneck causes the following problem:

"Le chat noir est assis sur le tapis rouge dans le salon"（黑猫坐在客厅的红地毯上）
                            ↓
              整个句子 → 压缩为一个向量 c
                            ↓
       "The black cat is sitting on the red carpet in the living room"

All information from 12 words is squeezed into a vector of a few hundred dimensions. The longer the sentence, the more severe the information loss.

The core idea of Attention: Instead of relying on a single compressed vector, the decoder can dynamically attend to information at different positions in the source sentence when generating each target word, assigning a different weight to each position.

Bahdanau Attention (Additive Attention, 2014)

"Neural Machine Translation by Jointly Learning to Align and Translate" — Bahdanau et al., 2014

This is the seminal work on attention mechanisms.

Architecture

In traditional Seq2Seq, the decoder receives only the final hidden state. Bahdanau Attention improves upon this by allowing the decoder to access all encoder hidden states at every step.

  Encoder (BiRNN)                    Decoder
  ┌────────────┐                  ┌──────────┐
  │ h₁ h₂ h₃ h₄│    α₁ α₂ α₃ α₄  │          │
  │  ↓  ↓  ↓  ↓ │──→ 加权求和 ──→ │ cₜ + sₜ │ → yₜ
  └────────────┘      = cₜ        └──────────┘
                                       ↑
                   每一步t都重新计算注意力权重α

Mathematical Derivation

Step 1: Compute the Alignment Score

The "relevance" between the decoder hidden state \(s_{t-1}\) at step \(t\) and the encoder hidden state \(h_j\) at position \(j\):

\[ e_{tj} = a(s_{t-1}, h_j) = v_a^T \tanh(W_a \cdot s_{t-1} + U_a \cdot h_j) \]

Here \(W_a, U_a, v_a\) are all learnable parameters. This scoring function is called Additive Attention because it first applies separate linear transformations to the two vectors and then adds them together.

Step 2: Compute the Attention Weights

Apply softmax normalization over the scores for all source positions to obtain a probability distribution:

\[ \alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T_x} \exp(e_{tk})} \]

\(\alpha_{tj}\) represents "how much the decoder attends to the \(j\)-th source word when generating the \(t\)-th target word." All \(\alpha_{tj}\) sum to 1.

Step 3: Compute the Context Vector

Take a weighted sum over the encoder hidden states:

\[ c_t = \sum_{j=1}^{T_x} \alpha_{tj} \cdot h_j \]

Note that a different \(c_t\) is generated at each step \(t\) — it is no longer a single fixed vector. When translating "cat," \(c_t\) may primarily attend to "chat"; when translating "red," \(c_t\) may primarily attend to "rouge."

Step 4: Decoding

Combine the context vector \(c_t\) with the current decoder state to predict the next word:

\[ s_t = f(s_{t-1}, y_{t-1}, c_t) \]

\[ P(y_t | y_{<t}, x) = \text{softmax}(W_o \cdot [s_t; c_t]) \]

Intuitive Understanding

Bahdanau Attention is like a translator who, instead of trying to memorize the entire source text, refers back to the original document while translating each word, focusing on the most relevant parts.

Luong Attention (Multiplicative Attention, 2015)

"Effective Approaches to Attention-based Neural Machine Translation" — Luong et al., 2015

Luong Attention simplified Bahdanau Attention and proposed three scoring functions:

Method	Formula	Description
dot	\(e_{tj} = s_t^T h_j\)	Simplest, no extra parameters
general	\(e_{tj} = s_t^T W_a h_j\)	Adds a learnable matrix
concat	\(e_{tj} = v_a^T \tanh(W_a [s_t; h_j])\)	Similar to Bahdanau

Dot-product attention was later adopted in the Transformer because it is the most computationally efficient (matrix multiplication can be highly parallelized).

Key differences between Bahdanau and Luong:

Bahdanau uses \(s_{t-1}\) (previous hidden state) to compute attention; Luong uses \(s_t\) (current step)
Bahdanau uses a bidirectional RNN encoder; Luong uses a unidirectional one
Luong's dot-product scoring function is more concise and efficient

Self-Attention

The attention mechanisms discussed above are all forms of cross-sequence attention (Cross-Attention): the decoder attends to the encoder. Self-Attention, on the other hand, is where a sequence attends to itself.

From Cross-Attention to Self-Attention

	Cross-Attention	Self-Attention
Query source	Decoder	The sequence itself
Key/Value source	Encoder	The sequence itself
Purpose	Align two sequences	Capture intra-sequence dependencies

Core Idea

For each word in the input sequence, Self-Attention lets it "ask" every other word in the sequence: "How relevant are you to me?" It then aggregates information based on these relevance scores through weighted summation.

Example: "The animal didn't cross the street because it was too tired."

What does "it" refer to? Humans immediately recognize it as "animal."
Self-Attention allows "it" to directly attend to "animal," regardless of how many words separate them.

This is the core advantage of Attention over RNNs: the path length between any two words is \(O(1)\), compared to \(O(n)\) in RNNs.

The Origin of Q, K, V

In Self-Attention, Q, K, and V all come from the same input sequence, but are obtained through different linear transformations:

\[ Q = X W^Q, \quad K = X W^K, \quad V = X W^V \]

where \(X \in \mathbb{R}^{n \times d_{\text{model}}}\) is the embedding matrix of the input sequence (\(n\) words, each of dimension \(d_{\text{model}}\)), and \(W^Q, W^K, W^V \in \mathbb{R}^{d_{\text{model}} \times d_k}\) are learnable projection matrices.

Intuitive Understanding of Q, K, V:

Think of Self-Attention as an information retrieval system:

Query: "What information am I looking for?" — the information the current word wants to obtain
Key: "What information can I provide?" — a label each word uses to be matched against
Value: "What information do I actually contain?" — the actual content of each word

An analogy: you walk into a library with a question (Query). Every book has a label (Key). You decide which books to read based on how well the question matches each label, and then synthesize the content (Value) of those books to get your answer.

Why are three separate matrices needed?

If Q = K = V = X, then each word would always attend most strongly to itself (since a word is most similar to itself). With different projection matrices, the model can learn: - In Q-space, "it" expresses "I am looking for a nominal antecedent" - In K-space, "animal" expresses "I am a noun that can be referred to by a pronoun" - In V-space, "animal" provides its semantic information

Scaled Dot-Product Attention

This is the specific attention computation used in the Transformer.

Full Mathematical Derivation

Inputs: - \(Q \in \mathbb{R}^{n \times d_k}\) (Query matrix) - \(K \in \mathbb{R}^{n \times d_k}\) (Key matrix) - \(V \in \mathbb{R}^{n \times d_v}\) (Value matrix)

Step 1: Compute Attention Scores

\[ S = Q K^T \in \mathbb{R}^{n \times n} \]

\(S_{ij}\) is the dot product of the \(i\)-th Query and the \(j\)-th Key, representing their "similarity." This step produces an \(n \times n\) attention matrix.

Step 2: Scaling

\[ S_{\text{scaled}} = \frac{Q K^T}{\sqrt{d_k}} \]

Why divide by \(\sqrt{d_k}\)? This is a critical detail:

Assume each element of \(Q\) and \(K\) is an independent random variable with mean 0 and variance 1. Then the variance of their dot product \(q \cdot k = \sum_{i=1}^{d_k} q_i k_i\) is \(d_k\) (each term has variance 1, and \(d_k\) independent terms are summed).

When \(d_k\) is large (e.g., 64 or 512), the dot product values become very large, pushing the softmax inputs into the saturation region where gradients are extremely small:

\[ \text{softmax}([100, 1, 1]) \approx [1.0, 0.0, 0.0] \]

Nearly all attention concentrates on a single position, the gradient approaches zero, and the model cannot learn. Dividing by \(\sqrt{d_k}\) renormalizes the variance to 1, keeping the softmax in a region with meaningful gradients.

Step 3: Softmax Normalization

\[ A = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) \]

Softmax is applied to each row, so that the attention weights from each Query over all Keys sum to 1.

Step 4: Weighted Summation

\[ \text{Attention}(Q, K, V) = A \cdot V = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \]

Full computation flow:

Q ──→ ┐
      ├──→ QKᵀ ──→ ÷√dₖ ──→ (Mask) ──→ Softmax ──→ × V ──→ Output
K ──→ ┘                        ↑
                          （可选：Decoder中的
                           因果掩码Causal Mask）
V ────────────────────────────────────────────────→ ┘

Numerical Example

Consider the sentence "I love cats" with \(d_k = 2\) (greatly simplified):

\[ Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0.5 & 0.5 \end{bmatrix}, \quad V = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix} \]

Step 1: \(QK^T = \begin{bmatrix} 1 & 0 & 0.5 \\ 0 & 1 & 0.5 \\ 1 & 1 & 1 \end{bmatrix}\)

Step 2: \(\frac{QK^T}{\sqrt{2}} = \begin{bmatrix} 0.71 & 0 & 0.35 \\ 0 & 0.71 & 0.35 \\ 0.71 & 0.71 & 0.71 \end{bmatrix}\)

Step 3: Row-wise softmax → \(A \approx \begin{bmatrix} 0.43 & 0.21 & 0.30 \\ 0.21 & 0.43 & 0.30 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}\)

Step 4: \(\text{Output} = A \cdot V\), where each row is a weighted average of the rows of V.

"I" (first row) attends most to itself (0.43), followed by "cats" (0.30), and least to "love" (0.21).

Multi-Head Attention

Motivation

A single set of \(W^Q, W^K, W^V\) can only learn one attention pattern. However, relationships in natural language are multi-dimensional:

Syntactic relations: subject → verb ("cat" → "sat")
Coreference relations: pronoun → antecedent ("it" → "cat")
Modification relations: adjective → noun ("black" → "cat")
Local relations: collocations between adjacent words

Multi-Head Attention enables the model to simultaneously attend to different types of relational patterns from multiple perspectives.

Mathematical Definition

Project the \(d_{\text{model}}\)-dimensional Q, K, V into \(h\) different lower-dimensional subspaces, perform attention independently in each, and then concatenate:

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W^O \]

where each head is:

\[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) \]

Parameter dimensions:

Input dimension: \(d_{\text{model}}\) (e.g., 512)
Number of heads: \(h\) (e.g., 8)
Dimension per head: \(d_k = d_v = d_{\text{model}} / h\) (e.g., 64)
Projection matrices: \(W_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\), \(W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\)
Output projection: \(W^O \in \mathbb{R}^{h \cdot d_v \times d_{\text{model}}}\)

Computation Flow

                    ┌→ head₁: Attention(QW₁Q, KW₁K, VW₁V) ─┐
                    │                                         │
Input Q,K,V ────→  ├→ head₂: Attention(QW₂Q, KW₂K, VW₂V) ─┼→ Concat → W^O → Output
                    │                                         │
                    ├→  ...                                   │
                    │                                         │
                    └→ headₕ: Attention(QWₕQ, KWₕK, VWₕV) ─┘

Parameter Count Analysis

Parameter count for single-head attention (using the full dimension \(d_{\text{model}}\)):

\[ 3 \times d_{\text{model}}^2 = 3 \times 512^2 = 786,432 \]

Parameter count for multi-head attention (\(h=8\)):

\[ h \times 3 \times d_{\text{model}} \times d_k + d_{\text{model}}^2 = 8 \times 3 \times 512 \times 64 + 512^2 = 786,432 + 262,144 = 1,048,576 \]

Wait — to be more precise: each \(W_i^Q, W_i^K, W_i^V\) has shape \(d_{\text{model}} \times d_k\), across \(h\) heads:

\[ \text{Projection parameters} = h \times 3 \times d_{\text{model}} \times d_k = 3 \times d_{\text{model}} \times (h \times d_k) = 3 \times d_{\text{model}}^2 \]

Plus the output projection \(W^O\): \(d_{\text{model}}^2\), giving a total of \(4 \times d_{\text{model}}^2\).

Key insight: The parameter count of multi-head attention is comparable to that of single-head full-dimension attention (because the per-head dimension \(d_k = d_{\text{model}}/h\) is reduced), yet it has greater expressive power since it can capture multiple relational patterns simultaneously.

Visualization of Multi-Head Attention

Different heads learn different attention patterns. Here is a conceptual example:

句子: "The cat sat on the mat because it was tired"

Head 1（语法关系）:          Head 2（指代关系）:
  "sat" ←──── "cat"          "it" ←──── "cat"
  "sat" ←──── "mat"          "was" ←── "it"

Head 3（局部关系）:          Head 4（长距离）:
  "the" ←──── "cat"          "tired" ←── "cat"
  "the" ←──── "mat"          "sat" ←──── "tired"

Each head acts as an "expert," analyzing the sentence from a different angle. The outputs of all heads are concatenated to aggregate information from all perspectives.

Attention Masking

In practice, attention computation often requires masking:

Padding Mask

Since sentences in a batch have different lengths, shorter sentences need to be padded. The attention weights at padding positions should be zero:

\[ S_{ij} = \begin{cases} q_i \cdot k_j & \text{if position } j \text{ is valid} \\ -\infty & \text{if position } j \text{ is padding} \end{cases} \]

After softmax, \(-\infty\) becomes 0, ensuring the model does not attend to padding positions.

Causal Mask (Look-ahead Mask)

In the Decoder, when generating the \(t\)-th word, the model must not see words after position \(t\) (otherwise it would be "cheating"). An upper-triangular matrix is used to mask future positions:

\[ \text{Mask} = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix} \]

This guarantees the causality of autoregressive generation: each position can only attend to itself and preceding positions.

Complexity Analysis

	Self-Attention	RNN	CNN
Computation per layer	\(O(n^2 \cdot d)\)	\(O(n \cdot d^2)\)	\(O(k \cdot n \cdot d^2)\)
Sequential operations	\(O(1)\)	\(O(n)\)	\(O(1)\)
Maximum path length	\(O(1)\)	\(O(n)\)	\(O(\log_k n)\)

\(n\): sequence length, \(d\): model dimension, \(k\): kernel size

Key trade-offs:

The computational cost of Self-Attention scales quadratically with sequence length (\(n^2\)), which is its main bottleneck. Processing a 100,000-token document is extremely expensive.
However, Self-Attention has \(O(1)\) sequential operations — all positions can be computed in parallel, whereas RNNs must process step by step. This makes Attention significantly faster than RNNs in practice on GPUs.
Self-Attention has a maximum path length of \(O(1)\) — any two positions can interact directly in a single step, unlike RNNs which require \(O(n)\) steps of information propagation. This is the fundamental reason why Attention can capture long-range dependencies.

From Attention to the Transformer

The development trajectory of the attention mechanism:

RNN Encoder-Decoder（2014）
        ↓ 固定向量瓶颈
Bahdanau Attention（2014）— 让Decoder回头看Encoder
        ↓ 简化评分函数
Luong Attention（2015）— 点积评分，更高效
        ↓ 序列内部自注意力
Self-Attention — 一个序列内部互相关注
        ↓ 多角度并行
Multi-Head Attention — 多个注意力头同时工作
        ↓ 完全抛弃RNN
Transformer（2017）— "Attention is All You Need"

The core contribution of the Transformer was not inventing the attention mechanism, but proving that: using only attention mechanisms (along with simple feed-forward networks and residual connections), without any recurrent or convolutional structures, one can achieve state-of-the-art performance on sequence modeling tasks.

Next: → Transformer Architecture (detailed walkthrough of the full Encoder-Decoder architecture)