Graph Attention Networks (GAT)

Graph Attention Networks (GAT) were proposed by Velickovic et al. in 2018 (ICLR). The core innovation of GAT is introducing an attention mechanism that adaptively assigns different weights to different neighbors, rather than using fixed normalized weights as in GCN. This design enables the model to implicitly learn the importance of each neighbor, yielding stronger expressive power.

Learning path: GCN/GraphSAGE fundamentals → Attention mechanism motivation → GAT attention computation → Multi-head attention → GATv2 improvements → Experiments and applications

Motivation: Adaptive vs. Fixed Weights

The Weight Problem in GCN

Recall GCN's message passing rule. The update for node \(v\) at layer \(l\) is:

\[ \mathbf{h}_v^{(l)} = \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{1}{\sqrt{d_v d_u}} W^{(l)} \mathbf{h}_u^{(l-1)}\right) \]

The weights \(\frac{1}{\sqrt{d_v d_u}}\) are entirely determined by the graph topology (node degrees), which has the following limitations:

Issue	Description
Fixed weights	Neighbor weights depend solely on degree, not on features
Indiscriminate treatment	All neighbors are treated equally; important neighbors cannot be distinguished
Lack of flexibility	The same node may need to attend to different neighbors under different tasks/contexts

GAT's solution: Let the model automatically learn how much weight each neighbor should receive, analogous to the self-attention mechanism in Transformers.

Attention Mechanism

Attention Coefficient Computation

GAT computes the attention coefficient between node \(v\) and its neighbor \(u\) through the following steps:

Step 1: Linear Transformation

Apply a shared linear transformation to all node features:

\[ \mathbf{z}_v = W \mathbf{h}_v, \quad \mathbf{z}_u = W \mathbf{h}_u \]

where \(W \in \mathbb{R}^{F' \times F}\) is a learnable weight matrix.

Step 2: Compute Raw Attention Scores

Concatenate the transformed features and pass them through a shared attention vector \(\mathbf{a} \in \mathbb{R}^{2F'}\) with LeakyReLU activation to obtain raw attention scores:

\[ e_{vu} = \text{LeakyReLU}\left(\mathbf{a}^T \left[\mathbf{z}_v \| \mathbf{z}_u\right]\right) \]

where \(\|\) denotes vector concatenation and the negative slope of LeakyReLU is typically set to 0.2.

Step 3: Softmax Normalization

Apply softmax over all neighbors of node \(v\) to obtain normalized attention weights:

\[ \alpha_{vu} = \text{softmax}_u(e_{vu}) = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v) \cup \{v\}} \exp(e_{vk})} \]

Step 4: Weighted Aggregation

Compute a weighted sum of neighbor features using the attention weights:

\[ \mathbf{h}_v^{(l)} = \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu} \mathbf{z}_u\right) \]

Intuition Behind the Attention Mechanism

Aspect	Explanation
Information source	Attention weights are jointly determined by both nodes' features
Locality	Attention is computed only within first-order neighbors (masked attention)
Adaptiveness	Different nodes can assign different weights to the same neighbor
Parameter efficiency	Attention parameters \(\mathbf{a}\) are shared across all node pairs

Multi-Head Attention

To enhance expressive power and training stability, GAT employs multi-head attention, similar to the Transformer:

Intermediate layers use concatenation:

\[ \mathbf{h}_v^{(l)} = \overset{K}{\underset{k=1}{\Big\|}} \; \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu}^k \; W^k \mathbf{h}_u^{(l-1)}\right) \]

where \(K\) is the number of heads and \(\|\) denotes concatenation. The output dimension is \(K \times F'\).

The final layer uses averaging:

\[ \mathbf{h}_v^{(L)} = \sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu}^k \; W^k \mathbf{h}_u^{(L-1)}\right) \]

Multi-head attention parameter	Original paper setting
Number of heads \(K\)	8 (intermediate layers) / 1 (output layer)
Per-head output dimension \(F'\)	8
Attention dropout	0.6
Feature dropout	0.6

Comparison with GCN and GraphSAGE

Dimension	GCN	GraphSAGE	GAT
Neighbor weights	Fixed (degree normalization)	Fixed (uniform/Pool)	Adaptive (attention-learned)
Weight basis	Graph structure	Graph structure + aggregator	Node features
Expressiveness	Low	Medium	High
Interpretability	Low	Low	High (attention weights can be visualized)
Computational complexity	\(O(\\|E\\| \cdot F \cdot F')\)	\(O(N \cdot S \cdot F \cdot F')\)	\(O(\\|E\\| \cdot F \cdot F' + \\|E\\| \cdot F')\)
Inductive capability	Transductive	Inductive	Inductive

A unique advantage of GAT is interpretability: attention weights can be visualized to understand which neighbors the model "focuses on," which is highly valuable in bioinformatics and social network analysis.

GATv2: Fixing the Static Attention Problem

The Static Attention Issue in GAT

Brody et al. (2022) discovered that the original GAT suffers from a static attention problem.

The attention computation in the original GAT can be decomposed as:

\[ e_{vu} = \mathbf{a}_l^T W_l \mathbf{h}_v + \mathbf{a}_r^T W_r \mathbf{h}_u \]

where \(\mathbf{a} = [\mathbf{a}_l \| \mathbf{a}_r]\). The issue is that the attention score is a sum of two independent functions of \(v\) and \(u\) respectively, with LeakyReLU applied after the summation.

This means that for all query nodes \(v\), the ranking of neighbors may be fixed -- GAT's attention cannot actually achieve truly dynamic ranking.

GATv2's Fix

GATv2 resolves this by changing the order of operations -- first concatenating, then applying the nonlinearity:

\[ e_{vu} = \mathbf{a}^T \text{LeakyReLU}\left(W \left[\mathbf{h}_v \| \mathbf{h}_u\right]\right) \]

Comparison	GAT	GATv2
Nonlinearity placement	Concat → Linear → LeakyReLU	Concat → Linear → LeakyReLU → Linear
Attention type	Static (fixed ranking)	Dynamic (ranking varies with query)
Expressiveness	Limited	Fully dynamic attention
Computational overhead	Slightly lower	Slightly higher (extra matrix multiplication)
Empirical performance	Baseline	Outperforms GAT on most tasks

Experimental Results and Applications

Performance on Standard Benchmarks

GAT's node classification performance on commonly used citation network datasets:

Dataset	Nodes	Feature dim	Classes	GAT accuracy
Cora	2,708	1,433	7	83.0 +/- 0.7%
Citeseer	3,327	3,703	6	72.5 +/- 0.7%
Pubmed	19,717	500	3	79.0 +/- 0.3%

Application Scenarios

Domain	Task	GAT's advantage
Molecular property prediction	Atom-level interaction modeling	Attention weights reflect chemical bond importance
Knowledge graphs	Entity-relation reasoning	Adaptively focuses on different relation types
Text classification	Classification on document graphs	Captures semantic association weights between words/documents
Traffic prediction	Flow prediction on road networks	Attention distinguishes upstream/downstream road segment influence
Recommender systems	User-item interactions	Different items have varying appeal to users

Connection to Transformers

GAT and the Transformer's self-attention share a deep connection:

A Transformer can be viewed as GAT on a complete graph (every token is connected to all other tokens)
GAT is local attention on a sparse graph
Both use a query-key mechanism to compute attention weights
This connection has inspired subsequent Graph Transformer research

GAT elegantly introduced the attention mechanism into graph neural networks, laying the foundation for many subsequent GNN variants (such as GATv2 and Graph Transformers). It stands as one of the most influential works in the field of graph deep learning.