Skip to content

Graph Attention Networks (GAT)

Graph Attention Networks (GAT) were proposed by Velickovic et al. in 2018 (ICLR). The core innovation of GAT is introducing an attention mechanism that adaptively assigns different weights to different neighbors, rather than using fixed normalized weights as in GCN. This design enables the model to implicitly learn the importance of each neighbor, yielding stronger expressive power.

Learning path: GCN/GraphSAGE fundamentals → Attention mechanism motivation → GAT attention computation → Multi-head attention → GATv2 improvements → Experiments and applications


Motivation: Adaptive vs. Fixed Weights

The Weight Problem in GCN

Recall GCN's message passing rule. The update for node \(v\) at layer \(l\) is:

\[ \mathbf{h}_v^{(l)} = \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \frac{1}{\sqrt{d_v d_u}} W^{(l)} \mathbf{h}_u^{(l-1)}\right) \]

The weights \(\frac{1}{\sqrt{d_v d_u}}\) are entirely determined by the graph topology (node degrees), which has the following limitations:

Issue Description
Fixed weights Neighbor weights depend solely on degree, not on features
Indiscriminate treatment All neighbors are treated equally; important neighbors cannot be distinguished
Lack of flexibility The same node may need to attend to different neighbors under different tasks/contexts

GAT's solution: Let the model automatically learn how much weight each neighbor should receive, analogous to the self-attention mechanism in Transformers.


Attention Mechanism

Attention Coefficient Computation

GAT computes the attention coefficient between node \(v\) and its neighbor \(u\) through the following steps:

Step 1: Linear Transformation

Apply a shared linear transformation to all node features:

\[ \mathbf{z}_v = W \mathbf{h}_v, \quad \mathbf{z}_u = W \mathbf{h}_u \]

where \(W \in \mathbb{R}^{F' \times F}\) is a learnable weight matrix.

Step 2: Compute Raw Attention Scores

Concatenate the transformed features and pass them through a shared attention vector \(\mathbf{a} \in \mathbb{R}^{2F'}\) with LeakyReLU activation to obtain raw attention scores:

\[ e_{vu} = \text{LeakyReLU}\left(\mathbf{a}^T \left[\mathbf{z}_v \| \mathbf{z}_u\right]\right) \]

where \(\|\) denotes vector concatenation and the negative slope of LeakyReLU is typically set to 0.2.

Step 3: Softmax Normalization

Apply softmax over all neighbors of node \(v\) to obtain normalized attention weights:

\[ \alpha_{vu} = \text{softmax}_u(e_{vu}) = \frac{\exp(e_{vu})}{\sum_{k \in \mathcal{N}(v) \cup \{v\}} \exp(e_{vk})} \]

Step 4: Weighted Aggregation

Compute a weighted sum of neighbor features using the attention weights:

\[ \mathbf{h}_v^{(l)} = \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu} \mathbf{z}_u\right) \]

Intuition Behind the Attention Mechanism

Aspect Explanation
Information source Attention weights are jointly determined by both nodes' features
Locality Attention is computed only within first-order neighbors (masked attention)
Adaptiveness Different nodes can assign different weights to the same neighbor
Parameter efficiency Attention parameters \(\mathbf{a}\) are shared across all node pairs

Multi-Head Attention

To enhance expressive power and training stability, GAT employs multi-head attention, similar to the Transformer:

Intermediate layers use concatenation:

\[ \mathbf{h}_v^{(l)} = \overset{K}{\underset{k=1}{\Big\|}} \; \sigma\left(\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu}^k \; W^k \mathbf{h}_u^{(l-1)}\right) \]

where \(K\) is the number of heads and \(\|\) denotes concatenation. The output dimension is \(K \times F'\).

The final layer uses averaging:

\[ \mathbf{h}_v^{(L)} = \sigma\left(\frac{1}{K}\sum_{k=1}^{K}\sum_{u \in \mathcal{N}(v) \cup \{v\}} \alpha_{vu}^k \; W^k \mathbf{h}_u^{(L-1)}\right) \]
Multi-head attention parameter Original paper setting
Number of heads \(K\) 8 (intermediate layers) / 1 (output layer)
Per-head output dimension \(F'\) 8
Attention dropout 0.6
Feature dropout 0.6

Comparison with GCN and GraphSAGE

Dimension GCN GraphSAGE GAT
Neighbor weights Fixed (degree normalization) Fixed (uniform/Pool) Adaptive (attention-learned)
Weight basis Graph structure Graph structure + aggregator Node features
Expressiveness Low Medium High
Interpretability Low Low High (attention weights can be visualized)
Computational complexity \(O(\|E\| \cdot F \cdot F')\) \(O(N \cdot S \cdot F \cdot F')\) \(O(\|E\| \cdot F \cdot F' + \|E\| \cdot F')\)
Inductive capability Transductive Inductive Inductive

A unique advantage of GAT is interpretability: attention weights can be visualized to understand which neighbors the model "focuses on," which is highly valuable in bioinformatics and social network analysis.


GATv2: Fixing the Static Attention Problem

The Static Attention Issue in GAT

Brody et al. (2022) discovered that the original GAT suffers from a static attention problem.

The attention computation in the original GAT can be decomposed as:

\[ e_{vu} = \mathbf{a}_l^T W_l \mathbf{h}_v + \mathbf{a}_r^T W_r \mathbf{h}_u \]

where \(\mathbf{a} = [\mathbf{a}_l \| \mathbf{a}_r]\). The issue is that the attention score is a sum of two independent functions of \(v\) and \(u\) respectively, with LeakyReLU applied after the summation.

This means that for all query nodes \(v\), the ranking of neighbors may be fixed -- GAT's attention cannot actually achieve truly dynamic ranking.

GATv2's Fix

GATv2 resolves this by changing the order of operations -- first concatenating, then applying the nonlinearity:

\[ e_{vu} = \mathbf{a}^T \text{LeakyReLU}\left(W \left[\mathbf{h}_v \| \mathbf{h}_u\right]\right) \]
Comparison GAT GATv2
Nonlinearity placement Concat → Linear → LeakyReLU Concat → Linear → LeakyReLU → Linear
Attention type Static (fixed ranking) Dynamic (ranking varies with query)
Expressiveness Limited Fully dynamic attention
Computational overhead Slightly lower Slightly higher (extra matrix multiplication)
Empirical performance Baseline Outperforms GAT on most tasks

Experimental Results and Applications

Performance on Standard Benchmarks

GAT's node classification performance on commonly used citation network datasets:

Dataset Nodes Feature dim Classes GAT accuracy
Cora 2,708 1,433 7 83.0 +/- 0.7%
Citeseer 3,327 3,703 6 72.5 +/- 0.7%
Pubmed 19,717 500 3 79.0 +/- 0.3%

Application Scenarios

Domain Task GAT's advantage
Molecular property prediction Atom-level interaction modeling Attention weights reflect chemical bond importance
Knowledge graphs Entity-relation reasoning Adaptively focuses on different relation types
Text classification Classification on document graphs Captures semantic association weights between words/documents
Traffic prediction Flow prediction on road networks Attention distinguishes upstream/downstream road segment influence
Recommender systems User-item interactions Different items have varying appeal to users

Connection to Transformers

GAT and the Transformer's self-attention share a deep connection:

  • A Transformer can be viewed as GAT on a complete graph (every token is connected to all other tokens)
  • GAT is local attention on a sparse graph
  • Both use a query-key mechanism to compute attention weights
  • This connection has inspired subsequent Graph Transformer research

GAT elegantly introduced the attention mechanism into graph neural networks, laying the foundation for many subsequent GNN variants (such as GATv2 and Graph Transformers). It stands as one of the most influential works in the field of graph deep learning.


评论 #