Graph Attention Networks (GAT)
Graph Attention Networks (GAT) were proposed by Velickovic et al. in 2018 (ICLR). The core innovation of GAT is introducing an attention mechanism that adaptively assigns different weights to different neighbors, rather than using fixed normalized weights as in GCN. This design enables the model to implicitly learn the importance of each neighbor, yielding stronger expressive power.
Learning path: GCN/GraphSAGE fundamentals → Attention mechanism motivation → GAT attention computation → Multi-head attention → GATv2 improvements → Experiments and applications
Motivation: Adaptive vs. Fixed Weights
The Weight Problem in GCN
Recall GCN's message passing rule. The update for node \(v\) at layer \(l\) is:
The weights \(\frac{1}{\sqrt{d_v d_u}}\) are entirely determined by the graph topology (node degrees), which has the following limitations:
| Issue | Description |
|---|---|
| Fixed weights | Neighbor weights depend solely on degree, not on features |
| Indiscriminate treatment | All neighbors are treated equally; important neighbors cannot be distinguished |
| Lack of flexibility | The same node may need to attend to different neighbors under different tasks/contexts |
GAT's solution: Let the model automatically learn how much weight each neighbor should receive, analogous to the self-attention mechanism in Transformers.
Attention Mechanism
Attention Coefficient Computation
GAT computes the attention coefficient between node \(v\) and its neighbor \(u\) through the following steps:
Step 1: Linear Transformation
Apply a shared linear transformation to all node features:
where \(W \in \mathbb{R}^{F' \times F}\) is a learnable weight matrix.
Step 2: Compute Raw Attention Scores
Concatenate the transformed features and pass them through a shared attention vector \(\mathbf{a} \in \mathbb{R}^{2F'}\) with LeakyReLU activation to obtain raw attention scores:
where \(\|\) denotes vector concatenation and the negative slope of LeakyReLU is typically set to 0.2.
Step 3: Softmax Normalization
Apply softmax over all neighbors of node \(v\) to obtain normalized attention weights:
Step 4: Weighted Aggregation
Compute a weighted sum of neighbor features using the attention weights:
Intuition Behind the Attention Mechanism
| Aspect | Explanation |
|---|---|
| Information source | Attention weights are jointly determined by both nodes' features |
| Locality | Attention is computed only within first-order neighbors (masked attention) |
| Adaptiveness | Different nodes can assign different weights to the same neighbor |
| Parameter efficiency | Attention parameters \(\mathbf{a}\) are shared across all node pairs |
Multi-Head Attention
To enhance expressive power and training stability, GAT employs multi-head attention, similar to the Transformer:
Intermediate layers use concatenation:
where \(K\) is the number of heads and \(\|\) denotes concatenation. The output dimension is \(K \times F'\).
The final layer uses averaging:
| Multi-head attention parameter | Original paper setting |
|---|---|
| Number of heads \(K\) | 8 (intermediate layers) / 1 (output layer) |
| Per-head output dimension \(F'\) | 8 |
| Attention dropout | 0.6 |
| Feature dropout | 0.6 |
Comparison with GCN and GraphSAGE
| Dimension | GCN | GraphSAGE | GAT |
|---|---|---|---|
| Neighbor weights | Fixed (degree normalization) | Fixed (uniform/Pool) | Adaptive (attention-learned) |
| Weight basis | Graph structure | Graph structure + aggregator | Node features |
| Expressiveness | Low | Medium | High |
| Interpretability | Low | Low | High (attention weights can be visualized) |
| Computational complexity | \(O(\|E\| \cdot F \cdot F')\) | \(O(N \cdot S \cdot F \cdot F')\) | \(O(\|E\| \cdot F \cdot F' + \|E\| \cdot F')\) |
| Inductive capability | Transductive | Inductive | Inductive |
A unique advantage of GAT is interpretability: attention weights can be visualized to understand which neighbors the model "focuses on," which is highly valuable in bioinformatics and social network analysis.
GATv2: Fixing the Static Attention Problem
The Static Attention Issue in GAT
Brody et al. (2022) discovered that the original GAT suffers from a static attention problem.
The attention computation in the original GAT can be decomposed as:
where \(\mathbf{a} = [\mathbf{a}_l \| \mathbf{a}_r]\). The issue is that the attention score is a sum of two independent functions of \(v\) and \(u\) respectively, with LeakyReLU applied after the summation.
This means that for all query nodes \(v\), the ranking of neighbors may be fixed -- GAT's attention cannot actually achieve truly dynamic ranking.
GATv2's Fix
GATv2 resolves this by changing the order of operations -- first concatenating, then applying the nonlinearity:
| Comparison | GAT | GATv2 |
|---|---|---|
| Nonlinearity placement | Concat → Linear → LeakyReLU | Concat → Linear → LeakyReLU → Linear |
| Attention type | Static (fixed ranking) | Dynamic (ranking varies with query) |
| Expressiveness | Limited | Fully dynamic attention |
| Computational overhead | Slightly lower | Slightly higher (extra matrix multiplication) |
| Empirical performance | Baseline | Outperforms GAT on most tasks |
Experimental Results and Applications
Performance on Standard Benchmarks
GAT's node classification performance on commonly used citation network datasets:
| Dataset | Nodes | Feature dim | Classes | GAT accuracy |
|---|---|---|---|---|
| Cora | 2,708 | 1,433 | 7 | 83.0 +/- 0.7% |
| Citeseer | 3,327 | 3,703 | 6 | 72.5 +/- 0.7% |
| Pubmed | 19,717 | 500 | 3 | 79.0 +/- 0.3% |
Application Scenarios
| Domain | Task | GAT's advantage |
|---|---|---|
| Molecular property prediction | Atom-level interaction modeling | Attention weights reflect chemical bond importance |
| Knowledge graphs | Entity-relation reasoning | Adaptively focuses on different relation types |
| Text classification | Classification on document graphs | Captures semantic association weights between words/documents |
| Traffic prediction | Flow prediction on road networks | Attention distinguishes upstream/downstream road segment influence |
| Recommender systems | User-item interactions | Different items have varying appeal to users |
Connection to Transformers
GAT and the Transformer's self-attention share a deep connection:
- A Transformer can be viewed as GAT on a complete graph (every token is connected to all other tokens)
- GAT is local attention on a sparse graph
- Both use a query-key mechanism to compute attention weights
- This connection has inspired subsequent Graph Transformer research
GAT elegantly introduced the attention mechanism into graph neural networks, laying the foundation for many subsequent GNN variants (such as GATv2 and Graph Transformers). It stands as one of the most influential works in the field of graph deep learning.