Graph Convolutional Networks (GCN)
Graph Convolutional Networks (GCN) are a semi-supervised node classification method proposed by Kipf & Welling in 2017. GCN generalizes the convolution operation from regular grid-structured data (such as images) to graph data with arbitrary topologies, and stands as one of the most influential foundational works in the field of Graph Neural Networks (GNN).
Learning path: Basic graph concepts → Spectral graph convolution → ChebNet → GCN simplification → Spatial interpretation → GNN variants (GraphSAGE, GAT, GIN)
Background and Motivation
Why Do We Need Graph Neural Networks?
Deep learning has achieved tremendous success over the past decade, but the domains where it excels share a common trait: the data has regular structure.
- CNNs handle grid-structured data (images are 2D grids of pixels)
- RNNs / Transformers handle sequential data (text is a 1D sequence of tokens)
However, a vast amount of real-world data naturally exists in the form of graphs:
| Domain | Nodes | Edges |
|---|---|---|
| Social networks | Users | Friendships |
| Molecular structures | Atoms | Chemical bonds |
| Knowledge graphs | Entities | Relations |
| Citation networks | Papers | Citations |
| Transportation networks | Intersections | Roads |
Such data has variable numbers of neighbors, no natural ordering of nodes, and irregular topology. The core assumptions of CNNs and RNNs -- locality with translation invariance, and sequential order -- no longer apply. We need a new paradigm capable of learning representations directly on graph structures.
Basic Graph Concepts
A graph \(G = (V, E)\) consists of a node set \(V\) and an edge set \(E\). Assuming the graph has \(N\) nodes, each with \(F\)-dimensional features, it can be described by the following matrices:
- Adjacency matrix \(A \in \mathbb{R}^{N \times N}\): \(A_{ij} = 1\) indicates an edge between nodes \(i\) and \(j\), and 0 otherwise (for unweighted, undirected graphs)
- Degree matrix \(D \in \mathbb{R}^{N \times N}\): a diagonal matrix where \(D_{ii} = \sum_j A_{ij}\), i.e., the number of neighbors of node \(i\)
- Feature matrix \(X \in \mathbb{R}^{N \times F}\): row \(i\) is the feature vector of node \(i\)
示例:3 个节点的无向图
0 --- 1
| /
| /
2
邻接矩阵 A: 度矩阵 D: 特征矩阵 X (2维特征):
┌ 0 1 1 ┐ ┌ 2 0 0 ┐ ┌ 1.0 0.5 ┐ 节点 0
│ 1 0 1 │ │ 0 2 0 │ │ 0.3 0.8 │ 节点 1
└ 1 1 0 ┘ └ 0 0 2 ┘ └ 0.7 0.2 ┘ 节点 2
Machine Learning Tasks on Graphs
Learning tasks on graphs can be broadly divided into three levels:
- Node Classification: Predict the label of each node. For example, determining the research field of papers in a citation network. The original GCN paper addresses exactly this problem.
- Link Prediction: Predict whether an edge exists between two nodes. For example, friend recommendations in social networks.
- Graph Classification: Predict a property of an entire graph. For example, determining whether a molecule exhibits a certain pharmacological activity.
Limitations of Traditional Methods
Before GCN, representation learning on graphs primarily relied on the following approaches:
- Handcrafted features: Statistical metrics such as node degree, clustering coefficient, and PageRank, which require domain expert design
- Graph embedding methods (DeepWalk, Node2Vec, LINE): Learn node embeddings via random walks + Skip-gram. However, these methods are transductive -- they cannot generalize to new nodes unseen during training, and the embedding learning is decoupled from downstream tasks rather than being end-to-end
GCN aims to provide an end-to-end learning framework that jointly leverages graph structure and node features.
Intuition Behind Graph Convolution
From CNN Convolution to Graph Convolution
Recall how CNNs work: a \(3 \times 3\) convolution kernel slides across an image, computing a weighted sum over each pixel's local neighborhood to extract local features.
CNN 卷积:规则网格上的邻域聚合
┌───┬───┬───┐
│ a │ b │ c │ 中心像素的新值
├───┼───┼───┤ = 邻域内所有像素的加权和
│ d │ e │ f │ (权重由卷积核决定)
├───┼───┼───┤
│ g │ h │ i │
└───┴───┴───┘
图卷积:不规则图上的邻域聚合
(v2) 节点 v1 的新表示
/ | \ = v1 自身 + v2, v3, v4 的信息聚合
(v1)─┼──(v3) (权重由图结构决定)
\ | /
(v4)
The core idea is the same: aggregate information over a local neighborhood. The difference is that in CNNs the neighborhood is a fixed-size grid, whereas in graph convolution the neighborhood is determined by the graph's topology and varies in size and shape.
The Message Passing Paradigm
Graph convolution can be understood through the unifying lens of message passing. Each GNN layer performs the following three steps:
- Message construction: Each node sends its own features as a "message" to its neighbors
- Message aggregation: Each node collects messages from all its neighbors and aggregates them via some function (sum, mean, max, etc.)
- State update: The aggregated information is combined with the node's own features and transformed (linear layer + activation function) to update the node's representation
An Intuitive Analogy
"You are defined by the company you keep." A person's characteristics depend not only on their own attributes but also on their social circle. Similarly, a node's representation in a graph should incorporate both its own features and those of its neighbors. Each GCN layer does exactly this: it lets every node "look at" its neighbors and then update its representation. Stacking \(k\) GCN layers allows each node to access information from neighbors up to \(k\) hops away.
Mathematical Derivation of GCN
Spectral Approach
The mathematical foundation of GCN originates from spectral graph theory in graph signal processing. The derivation is somewhat lengthy, but understanding it helps clarify where the GCN formula comes from.
Graph Laplacian Matrix
The graph Laplacian is the central tool in graph signal processing:
where \(D\) is the degree matrix and \(A\) is the adjacency matrix. \(L\) is a positive semi-definite symmetric matrix with desirable mathematical properties.
For the 3-node example above:
The normalized Laplacian:
Normalization eliminates scale discrepancies caused by varying node degrees, constraining the eigenvalues to the interval \([0, 2]\).
Graph Fourier Transform
Since \(L_{\text{sym}}\) is a real symmetric matrix, it admits an eigendecomposition:
where \(U = [u_1, u_2, \dots, u_N]\) is the matrix of eigenvectors (an orthogonal matrix) and \(\Lambda = \text{diag}(\lambda_1, \dots, \lambda_N)\) is the diagonal matrix of eigenvalues.
The eigenvectors \(u_i\) serve as the "frequency basis" on the graph, analogous to the sine/cosine basis functions in the classical Fourier transform. The graph Fourier transform of a graph signal \(x \in \mathbb{R}^N\) is:
The inverse transform is \(x = U \hat{x}\).
Spectral Convolution
In classical signal processing, convolution in the time domain is equivalent to multiplication in the frequency domain. Analogously, convolution on a graph can be defined as:
where \(g_\theta(\Lambda) = \text{diag}(\theta_1, \dots, \theta_N)\) is a spectral filter and \(\theta_i\) are learnable parameters.
Problems with Spectral Convolution
This definition has two critical drawbacks: (1) The eigendecomposition has \(O(N^3)\) computational complexity, making it infeasible for large graphs; (2) The filter has \(N\) parameters, lacks spatial locality, and is tied to a specific graph structure (since \(U\) is graph-dependent), preventing transfer to other graphs.
ChebNet: Chebyshev Polynomial Approximation
Hammond et al. (2011) proposed approximating the filter function using Chebyshev polynomials to avoid explicit eigendecomposition:
where \(\tilde{\Lambda} = \frac{2}{\lambda_{\max}} \Lambda - I\) rescales the eigenvalues to \([-1, 1]\), and \(T_k\) is the \(k\)-th order Chebyshev polynomial defined by the recurrence relation:
Substituting into the spectral convolution formula:
where \(\tilde{L}_{\text{sym}} = \frac{2}{\lambda_{\max}} L_{\text{sym}} - I\). The key advantage is that \(T_k(\tilde{L}_{\text{sym}})\) can be computed recursively via matrix multiplication without eigendecomposition. Moreover, a \(K\)-th order polynomial corresponds to a \(K\)-hop neighborhood, so the filter inherently possesses spatial locality. Defferrard et al. (2016) built ChebNet on this foundation.
GCN Simplification (Kipf & Welling, 2017)
The core contribution of Kipf & Welling was to drastically simplify ChebNet, resulting in an remarkably elegant model.
First-Order Approximation with K=1
Setting \(K = 1\) (retaining only the 0th and 1st order Chebyshev polynomials) and further approximating \(\lambda_{\max} \approx 2\):
To further reduce parameters, setting \(\theta = \theta_0 = -\theta_1\) yields:
Renormalization Trick
The eigenvalues of \(I + D^{-1/2} A D^{-1/2}\) lie in \([0, 2]\), and repeated multiplication can cause numerical instability (exploding or vanishing gradients). Kipf & Welling introduced an elegant renormalization trick:
where:
- \(\tilde{A} = A + I_N\) (the adjacency matrix with added self-loops, i.e., each node treats itself as a neighbor)
- \(\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}\) (the degree matrix after adding self-loops, i.e., \(\tilde{D} = D + I\))
The Significance of Self-Loops
Adding self-loops via \(A + I\) means that during message passing, each node aggregates information from its neighbors while also retaining its own information. This is very natural: when updating a node's representation, we should not discard the node's own features.
Core Propagation Formula
Extending the above simplification to multi-channel features (\(F\)-dimensional input, \(F'\)-dimensional output), we obtain the layer-wise propagation rule of GCN:
where:
- \(\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}\) is the normalized adjacency matrix with self-loops (precomputed once)
- \(H^{(l)} \in \mathbb{R}^{N \times F_l}\) is the node feature matrix at layer \(l\), with \(H^{(0)} = X\)
- \(W^{(l)} \in \mathbb{R}^{F_l \times F_{l+1}}\) is the learnable weight matrix at layer \(l\)
- \(\sigma(\cdot)\) is the activation function (typically ReLU)
Spatial Interpretation
From a spatial (node-level) perspective, the formula above performs the following intuitive operation for each node \(i\):
In words: the new features of node \(i\) are obtained by computing a weighted average of its own features and those of all its neighbors (with weights determined by the degrees of both endpoint nodes), applying a linear transformation, and then passing through an activation function.
The weight \(\frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}}\) is a symmetric normalization coefficient -- neighbors with higher degree contribute less, which prevents high-degree nodes from overwhelming low-degree nodes with their information.
Numerical Example of Forward Propagation
Using the 3-node graph from above, we walk through a complete hand calculation of one GCN layer's forward pass.
Step 1: Compute \(\tilde{A}\) and \(\tilde{D}\)
Step 2: Compute \(\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}\)
A Note on This Special Case
In this example, all nodes happen to have the same degree (all 2, becoming 3 after adding self-loops), so every entry of \(\hat{A}\) equals \(1/3\). In a general graph, different entries will differ: \(\hat{A}_{ij} = \frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}}\) (when \(\tilde{A}_{ij} = 1\)).
Step 3: Neighborhood Aggregation \(\hat{A} H^{(0)}\)
Since this is a complete graph (all nodes are connected to each other), the aggregated result is the mean of all node features, and all three nodes end up with the same aggregated representation. In non-complete graphs, different nodes will generally have different aggregation results.
Step 4: Linear Transformation + Activation
Suppose \(W^{(0)} \in \mathbb{R}^{2 \times 3}\) (mapping from 2 dimensions to 3):
After applying the ReLU activation \(\sigma(x) = \max(0, x)\):
Because this is a complete graph, this example reveals an interesting phenomenon: after just one GCN layer, all node representations become identical. This is an extreme manifestation of the over-smoothing problem.
Properties and Limitations of GCN
Semi-Supervised Learning
The most central application of GCN is semi-supervised node classification: only a small fraction of nodes in the graph have labels, but all nodes have features and connectivity information. GCN propagates label information through the graph structure, enabling correct classification of unlabeled nodes.
A typical two-layer GCN for node classification:
The loss function computes cross-entropy only over labeled nodes:
where \(\mathcal{Y}_L\) is the set of labeled nodes. Although the loss is computed on only a subset of nodes, gradients propagate to unlabeled nodes through the graph structure via \(\hat{A}\) during the forward pass.
Over-Smoothing
The Core Bottleneck of GCN
As the number of GCN layers increases, each node's receptive field grows exponentially. When enough layers are stacked, every node's receptive field covers the entire graph, causing all node representations to converge toward the same value and losing discriminative power. This is the over-smoothing problem.
Intuitively, repeated multiplication by \(\hat{A}\) is akin to performing Laplacian smoothing: each node continually averages with its neighbors. After many rounds of smoothing, the signal across the graph becomes flat and all node feature vectors converge to a single value.
In practice, GCN typically uses only 2--3 layers. This stands in stark contrast to CNNs and Transformers, which can be stacked to tens or even hundreds of layers.
Methods to mitigate over-smoothing include: residual connections (similar to ResNet), DropEdge (randomly dropping edges), and JKNet (Jumping Knowledge Networks, which aggregate outputs from all layers).
Limited to Homogeneous Graphs
Standard GCN assumes all nodes and edges are of the same type (homogeneous graph). For heterogeneous graphs with multiple node and edge types, such as knowledge graphs, variants like R-GCN (Relational GCN) are needed.
Scalability Issues
GCN's forward pass requires the full adjacency matrix \(\hat{A}\) and the feature matrix of all nodes, meaning it performs full-batch training over the entire graph. For large graphs with millions or even billions of nodes, this is infeasible in terms of both memory and computation.
Solutions include: - Mini-batch training: GraphSAGE enables mini-batch training through neighbor sampling - Cluster-GCN: Partitions the graph into clusters and trains on one subgraph at a time - FastGCN: Reduces computation via importance sampling
GNN Variants
After GCN pioneered the end-to-end learning paradigm for graph neural networks, numerous variants emerged. The following are among the most important.
GraphSAGE (Hamilton et al., 2017)
GraphSAGE (SAmple and aggreGatE) addresses two major limitations of GCN: scalability and inductive learning capability.
Core ideas:
- Sampling: Instead of using all neighbors, uniformly sample a fixed number of neighbors (e.g., \(K\)) for each node
- Aggregation: Use a learnable aggregation function (rather than the fixed \(\hat{A}\)) to aggregate neighbor information
The aggregation function can be Mean, LSTM, Max Pooling, etc. Since GraphSAGE learns an aggregation function rather than fixed node embeddings, it can generalize to new nodes unseen during training (inductive learning) -- a significant advance over GCN.
GAT (Graph Attention Network, Velickovic et al., 2018)
The core innovation of GAT is replacing GCN's fixed normalization coefficients \(\frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}}\) with an attention mechanism, allowing the model to adaptively learn the importance of each neighbor.
Computing attention coefficients:
where \(\|\) denotes concatenation and \(\mathbf{a}\) is a learnable attention vector. GAT also employs multi-head attention to stabilize training, following the same idea as Multi-Head Attention in Transformers.
Key Difference Between GCN and GAT
In GCN, neighbor weights are entirely determined by graph structure (node degrees) and are fixed; in GAT, neighbor weights are dynamically computed by the attention mechanism based on node features. This allows GAT to assign different importance to different neighbors, resulting in greater expressive power.
GIN (Graph Isomorphism Network, Xu et al., 2019)
GIN takes a theoretical perspective, proving that most GNNs (including GCN and GraphSAGE) are strictly less expressive than the Weisfeiler-Leman (WL) graph isomorphism test, and designs an architecture with expressive power equivalent to the WL test:
where \(\epsilon\) is a learnable parameter or a fixed value. GIN uses summation (rather than mean or max) as the aggregation function, because summation is theoretically injective and can distinguish different multisets.
Message Passing Neural Networks (MPNN)
The MPNN framework proposed by Gilmer et al. (2017) unifies all the above models under a single general paradigm:
where \(M^{(l)}\) is the message function, \(U^{(l)}\) is the update function, and \(e_{vu}\) is the edge feature. GCN, GraphSAGE, GAT, and others can all be viewed as special cases of MPNN, differing only in the specific choices of message and update functions.
Application Domains
Social Network Analysis
In social networks, users are nodes and follow/friend relationships are edges. GCN can be used for user classification (e.g., identifying bot accounts), community detection, and more. By propagating information through the graph structure, even users with inconspicuous individual features can have their attributes inferred from their social connections.
Recommender Systems
User-item interactions form a bipartite graph. PinSage (an industrial-scale application of GraphSAGE) is used by Pinterest to generate item embeddings from billions of pins for personalized recommendations. GNNs can simultaneously leverage content features and collaborative filtering signals.
Molecular Property Prediction and Drug Discovery
Molecules are natural graph structures (atoms as nodes, chemical bonds as edges). GNNs can predict molecular properties such as solubility, toxicity, and pharmacological activity, greatly accelerating the drug screening process. DeepMind's AlphaFold also uses graph neural networks to model interactions between amino acid residues.
Knowledge Graph Reasoning
Entities and relations in knowledge graphs form heterogeneous graphs. Methods such as R-GCN can be used for link prediction (inferring missing relations) and entity classification.
Traffic Flow Prediction
By modeling urban road networks as graphs (intersections as nodes, roads as edges) and combining temporal information, Spatiotemporal Graph Neural Networks (ST-GNNs) can predict future traffic flow and congestion.
GCN vs CNN vs MLP Comparison
| Property | MLP | CNN | GCN |
|---|---|---|---|
| Input data structure | Vectors (1D) | Grids (2D/3D) | Arbitrary graph structures |
| Neighborhood definition | None (fully connected) | Fixed-size local window | Adjacent nodes in the graph |
| Parameter sharing | None | Convolution kernels shared across space | Weight matrix \(W\) shared across all nodes |
| Receptive field | Global (single layer) | Grows linearly with depth | Grows exponentially with depth |
| Translation invariance | N/A | Yes (core assumption) | N/A (permutation invariance instead) |
| Inductive bias | None | Locality + translation invariance | Graph structure prior (homophily assumption) |
| Typical depth | Arbitrary | Tens to hundreds of layers | 2--3 layers |
| Application domains | Tabular data, general | Images, video, audio | Social networks, molecules, knowledge graphs |
Permutation Invariance
An important property of GCN is permutation invariance: regardless of how the node indices are reordered (i.e., applying row and column permutations to the adjacency matrix), the GCN output remains unchanged. This is analogous to CNN's translation invariance -- both are inductive biases built into the architecture. Mathematically, for any permutation matrix \(P\), we have \(f(PAP^T, PX) = P f(A, X)\), meaning the output undergoes the same permutation as the input (permutation equivariance).
Reflections and Discussion
The Expressive Power Ceiling of GNNs: The WL Test
The Weisfeiler-Leman (WL) graph isomorphism test is a classical algorithm for determining whether two graphs are isomorphic. Xu et al. (2019) proved that the expressive power of message passing GNNs is upper-bounded by the 1-WL test. This means there exist certain graph structures that the 1-WL test cannot distinguish, and GNNs likewise cannot distinguish them. For example, certain regular graphs (where all nodes have the same degree and symmetric neighborhood structures) are indistinguishable to GNNs.
Directions for breaking through this ceiling include: higher-order WL tests (k-WL), subgraph GNNs, and equivariant GNNs.
The Root Cause of Over-Smoothing
From a linear algebra perspective, repeated multiplication by \(\hat{A}\) is essentially a low-pass filter. As \(k\) increases, \(\hat{A}^k\) converges toward the eigenvector direction corresponding to the smallest eigenvalue of the graph Laplacian, meaning all node representations are projected onto the same low-dimensional subspace and the information is "smoothed away."
From a random walk perspective, each row of \(\hat{A}^k\) converges to the stationary distribution of the graph. All nodes end up "seeing" the same global distribution, naturally losing their local characteristics.
The Role of GNNs in the Era of Large Models
With the rise of LLMs, a natural question arises: will GNNs be replaced by LLMs?
The current consensus is that for data with inherent graph structure (molecules, proteins, physics simulations), GNNs remain the most appropriate inductive bias. LLMs excel at processing sequential linguistic information, but serializing graph structures for LLM input loses topological information and is inefficient. A more promising direction is the fusion of GNNs and LLMs: using GNNs to encode structural graph information and LLMs to process textual information, with the two complementing each other. For example, in knowledge graph-enhanced question answering systems, GNNs handle reasoning over the graph while LLMs handle natural language understanding and generation.
Another trend is the exploration of Graph Foundation Models: can we train a universal graph model that performs well across graph data from different domains? This remains an open question, as graphs from different domains exhibit vastly different feature spaces and semantics.
References
- Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.
- Defferrard, M., Breber, X., & Vandergheynst, P. (2016). Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. NeurIPS 2016.
- Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive Representation Learning on Large Graphs (GraphSAGE). NeurIPS 2017.
- Velickovic, P., et al. (2018). Graph Attention Networks. ICLR 2018.
- Xu, K., et al. (2019). How Powerful are Graph Neural Networks? ICLR 2019.
- Gilmer, J., et al. (2017). Neural Message Passing for Quantum Chemistry (MPNN). ICML 2017.