Graph Neural Networks
Цей контент ще не доступний вашою мовою.
Track: AI/ML Engineering | Complexity: Intermediate | Time: 90-120 minutes Prerequisites: Module 1.4: CNNs & Computer Vision, Module 1.6: Backpropagation Deep Dive, and Module 1.3: Model Evaluation, Validation, Leakage & Calibration.
Graph neural networks, usually shortened to GNNs, are the part of deep learning that handles inputs whose natural structure is a set of entities connected by relationships rather than a regular grid or a sequence. The reason practitioners care is not that graphs are fashionable. It is that for some problems — molecule property prediction, social-network classification, recommendation, knowledge-graph completion, fraud rings — the relationships between examples carry signal that a row-and-column model can only approximate by hand-engineering aggregate features. The dominant pitfall is the opposite of underuse. A team frames a problem as a graph because it can, not because the graph structure actually informs the target, and ends up paying the GNN tax — extra infrastructure, harder evaluation, training-time bottlenecks — without earning the inductive bias premium. This module is built around that decision discipline: when GNNs are the right tool, which family fits the regime, how to evaluate without leaking, and the three failure modes that are specific to graph models and have no direct analog in CNNs or transformers.
Learning Outcomes
Section titled “Learning Outcomes”- Decide whether a problem benefits from graph-structured inductive bias before reaching for a GNN, by comparing against a strong tabular baseline and inspecting whether the graph carries information the features do not.
- Choose between GCN, GraphSAGE, and GAT for a given regime by matching their assumptions about homophily, scale, and inductive versus transductive deployment to the data you actually have.
- Diagnose the three GNN-specific failure modes — over-smoothing as you stack layers, over-squashing on long-range tasks, heterophily that violates the connected-nodes-are-similar assumption — and explain which mitigations apply to each.
- Design train, validation, and test splits for transductive and inductive graph tasks that match the deployment shape, avoid leaking held-out labels (transductive) or held-out structural fragments (inductive), and prevent the link-prediction-specific traps of overlapping positive/negative edges and reverse-edge leakage.
- Build a working PyTorch Geometric pipeline using
Data, theMessagePassingfamily,GCNConv/SAGEConv/GATConv, andNeighborLoaderfor graphs that do not fit in memory.
Why This Module Matters
Section titled “Why This Module Matters”Picture a data-science team at a payments company spinning up a new fraud
project on a Monday morning.
They already have a strong tabular gradient-boosted classifier on engineered
transaction features — velocity, device fingerprint, prior-incident counts —
and a senior reviewer asks the obvious question.
“There is a graph here.
Customers connect to merchants, devices connect to multiple customers,
fraud rings cluster.
Should we be using a graph neural network?”
The good answer is to take the question seriously and the bad answer is to
treat it as rhetorical.
A team that says “yes, let’s add a GCN” without first checking whether the
graph carries signal beyond what the engineered aggregates already encode
will spend a sprint plumbing graph data, fight node ID hygiene, hit
training-time issues with full-batch updates over a multi-million-node graph,
and produce a result that is hard to compare honestly against the tabular
baseline.
A team that says “no, graphs are overkill” too early misses the entire point
of the inductive bias when fraud rings really do look like rings.
The decision needs evidence.
The first move is to define the prediction unit (nodes, edges, or whole
subgraphs) and pick a small benchmark you can iterate on quickly.
The second is to run two baselines: the existing tabular model on whatever
features already exist, and a graph-blind multilayer perceptron on the same
node features ignoring the graph entirely.
If a 2-layer GCN cannot beat both, the graph is not adding information at the
scale of complexity you have introduced and the right answer is to keep the
tabular model.
PyTorch Geometric is the standard library that makes this experiment cheap
(PyTorch Geometric documentation).
Its Data object holds node features and edge indices in a few lines.
Its MessagePassing base class implements the canonical aggregate-then-update
abstraction (PyG MessagePassing tutorial).
Its layer cheatsheet enumerates the dozens of published GNN architectures
behind a single API (PyG GNN cheatsheet).
The reason this module matters is not that GNNs are exotic.
It is that the cost of using them poorly is high and the cost of skipping
them when they would have helped is also high.
Practitioners need a steady framework for choosing,
a vocabulary for the failure modes,
and a small set of working patterns that survive contact with real graphs.
Section 1: What GNNs are, and what they aren’t
Section titled “Section 1: What GNNs are, and what they aren’t”A graph neural network is a model whose forward pass is structured by an explicit graph: it computes node representations by repeatedly aggregating information from each node’s neighbors and updating the node’s own state. The simplest way to think about it is as a generalization of the convolution operation in CNNs. A 2D convolution computes a new feature for a pixel by mixing features in a fixed-shape neighborhood — three by three, five by five — using shared weights. The neighborhood is regular because images are sampled on a grid. A graph convolution computes a new feature for a node by mixing features across the node’s actual graph neighbors — which can be one or one thousand, in any pattern — using shared weights. The neighborhood is irregular because graphs are arbitrary. That generalization is the whole point. GNNs let the model exploit relational structure that a CNN can only see if you re-shape the data into a grid first. GNNs are not the same thing as graph algorithms. Dijkstra, PageRank, community detection, and shortest-path queries are deterministic procedures that read a graph and compute something specific. They are not learning anything from labels. A GNN is a parameterized model trained to fit a target signal — node labels, edge presence, graph-level outcomes — using gradient descent. A GNN is also not the only way to feed graph structure into a learner. You can compute graph features (node degree, clustering coefficient, neighborhood-aggregated statistics) and feed them to an XGBoost model, and this hand-crafted approach often wins on small or feature-rich problems (Module 1.6: XGBoost & Gradient Boosting). The interesting question for practitioners is when the learned aggregation inside a GNN beats the hand-engineered aggregation outside one. The honest answer is “sometimes,” not “always,” and most of this module is about telling those situations apart. Two regimes where GNNs tend to earn their cost are large graphs whose local structure is rich enough that hand-engineering features is prohibitive, and tasks where the same feature value carries different meaning depending on its neighbors (a transaction at \50$ is normal for one merchant, suspicious for another). Two regimes where they tend to lose are small graphs with strong engineered features and graphs whose connectivity does not actually carry signal toward the target, where the graph is decorative rather than informative.
Section 2: Graph data, briefly
Section titled “Section 2: Graph data, briefly”A graph is a set of nodes and a set of edges.
Nodes can carry features (a vector per node), and edges can carry features
too (a vector per edge), although a lot of GNN work uses unweighted, untyped
edges to start.
Edges can be directed or undirected; in PyTorch Geometric, undirected graphs
are usually represented by storing each edge twice, once in each direction
(PyG introduction).
Heterogeneous graphs have multiple node types and multiple edge types — for
instance a knowledge graph with Person, Movie, and Director node types
and acted_in, directed, and friend_of edge types.
PyG handles heterogeneous graphs through a separate API
(PyG heterogeneous tutorial),
because the message-passing rules are different per edge type and stacking
plain GCNConv layers does not respect that.
Three benchmark families recur across the literature, so it is worth
recognizing them.
Citation networks like Cora, CiteSeer, and Pubmed are small,
homophilic, and ship with standard transductive splits — they are the GNN
equivalent of MNIST, useful for quick iteration but not predictive of how a
model will behave on a million-node industrial graph.
Molecule datasets such as those in the Open Graph Benchmark family treat
each molecule as a separate small graph and the task as graph classification
or graph regression — predicting properties of the whole molecule.
Social and recommendation graphs are typically much larger, often
heterogeneous, and require sampled training because they do not fit in GPU
memory.
PyG ships datasets and Data builders for all three families
(PyG dataset tutorial).
The data structure that ties everything together in PyG is the Data
object, which stores x (node features, shape [num_nodes, num_features]),
edge_index (shape [2, num_edges], where row 0 is source nodes and row 1
is target nodes), and optional fields such as y for labels and
edge_attr for edge features.
Carrying that mental model makes the rest of the API much easier to read.
Section 3: Message passing — the core abstraction
Section titled “Section 3: Message passing — the core abstraction”The unifying abstraction behind almost every modern GNN is the
message-passing neural network, or MPNN, framework.
At each layer, every node performs three steps.
First, for each neighbor, compute a message as a function of the source
node’s features, the target node’s features, and the edge features.
Second, aggregate the incoming messages with a permutation-invariant
function — typically sum, mean, or max — so that the result does not depend
on the arbitrary order of neighbors.
Third, update the target node’s representation by combining its previous
state with the aggregated message, usually through a small neural network.
PyTorch Geometric exposes this directly through its MessagePassing base
class, which has three corresponding hook methods: message, aggregate,
and update
(PyG MessagePassing tutorial).
GCN, GraphSAGE, GAT, and most other layers are implemented as
MessagePassing subclasses that override one or two of those hooks.
The abstraction matters for a reason that beginners often miss.
Stacking message-passing layers gives every node access to information
from nodes up to hops away.
That is the GNN equivalent of receptive field in CNNs, and it has direct
practical consequences.
A 2-layer model can only see 2-hop neighborhoods, so any task that requires
information from 5 or 6 hops away will need either a deeper model or a
different architectural choice.
The naive instinct — “if 5 hops is needed, stack 5 layers” — runs straight
into the over-smoothing and over-squashing failure modes covered in
Sections 7 and 8, which is exactly why GNNs do not get arbitrarily deep
the way transformers and CNNs do.
Section 4: GCN — the simplest baseline
Section titled “Section 4: GCN — the simplest baseline”The graph convolutional network of Kipf and Welling is the cleanest starting point (Kipf & Welling, 2017). A GCN layer can be written as
where adds a self-loop to each node, is the
diagonal degree matrix of , is the matrix of node
features at layer , is a learned weight matrix, and
is a nonlinearity.
In plain English, each node’s new feature vector is a normalized average of
its neighbors’ features (including itself), passed through a linear
transform and then a nonlinearity.
The symmetric scaling is borrowed from spectral graph
theory and reflects the first-order approximation to a Chebyshev
polynomial of the graph Laplacian.
GCN is transductive in its original formulation: training and inference both
operate on a single fixed graph, with train and test nodes distinguished by
boolean masks rather than by being separate graphs.
You feed the entire graph in, the model sees all node features and the full
adjacency, and the loss is computed only on the train-mask nodes.
The implicit assumption that makes vanilla GCN work is homophily —
neighbors in the graph are likely to share a label.
Citation networks are highly homophilic, which is why GCN performs well on
Cora and friends.
Section 9 returns to what happens when that assumption breaks.
For a sense of where this lives in PyG, the torch_geometric.nn.GCNConv
class implements exactly this layer, with one practical caveat: by default
it adds self-loops and computes the symmetric normalization on every
forward pass
(PyG nn module reference).
Section 5: GraphSAGE — inductive and sampled
Section titled “Section 5: GraphSAGE — inductive and sampled”GraphSAGE generalizes GCN in two important ways
(Hamilton et al., 2017).
First, it is inductive by design: the same trained model can be applied to
graphs it never saw at training time, because each layer is just an
aggregation function plus a learned transform that operates on
neighborhoods, not on a fixed adjacency matrix.
Second, instead of using every neighbor at every layer, GraphSAGE samples a
fixed-size subset of neighbors per node per layer.
The original paper used something like 25 neighbors at the first layer and
10 at the second.
That sampling decision is the load-bearing trick for scaling.
On a million-node graph, materializing the full adjacency matrix in GPU
memory is impossible.
Sampling fixes the memory cost per node and per layer, which makes
mini-batch training feasible.
PyG implements this end-to-end through SAGEConv for the layer and
NeighborLoader for the sampling
(PyG NeighborLoader tutorial).
GraphSAGE supports several aggregator choices: mean (the default and most
common), max-pool (a small MLP applied to each neighbor’s features followed
by an elementwise max), and LSTM (an order-dependent aggregator that the
authors note technically violates permutation invariance and use only with
random ordering).
The mean aggregator is the right starting point for most teams.
The pool aggregator can help when the per-neighbor features need a learned
projection before aggregation.
The LSTM aggregator is rarely the answer in modern practice and you should
treat it as historical context unless you have a specific reason.
For inductive problems — predicting on a graph the model has not seen, such
as new molecules, new users, or new fraud subgraphs — GraphSAGE is the
default choice.
For transductive problems on a single fixed graph, you can use either
GraphSAGE without sampling or a vanilla GCN; the difference is small and
the choice usually comes down to library familiarity.
Section 6: GAT — attention on graphs
Section titled “Section 6: GAT — attention on graphs”The graph attention network replaces fixed-coefficient aggregation with a learned attention mechanism (Veličković et al., 2018). Where GCN gives each neighbor of a node a weight that depends only on the node degrees, GAT computes a weight per edge as a function of the source and target node features. The attention coefficient between source and target at a given layer is
where is a learned attention vector, is a learned linear
projection, denotes concatenation, and is the set of
neighbors of .
Multi-head attention runs independent attention heads in parallel and
either concatenates their outputs (intermediate layers) or averages them
(final layer), in direct analogy to multi-head self-attention in
transformers.
The conceptual link is worth carrying: GAT applies the same
attention-as-learned-weighting idea that drives transformers, but
restricted to actual graph neighbors instead of every position in a
sequence.
A transformer step attends across all token positions; a GAT layer
attends only across the explicit edges in the graph.
That restriction is what keeps GAT computationally tractable on large
graphs while still giving the model the flexibility to weight neighbors
non-uniformly.
GAT helps most when neighbor importance varies sharply across edges — for
instance when a node has many neighbors but only a few are actually
informative for the target — and when the graph leans somewhat away from
strict homophily, so that uniform averaging is a worse default than learned
weighting.
GAT does not help much on highly homophilic graphs with low-variance
neighborhoods, where GCN’s degree-normalized averaging is already a
reasonable inductive bias.
A GAT layer in PyG is torch_geometric.nn.GATConv with arguments for the
hidden dimension and the number of heads
(PyG nn module reference).
Section 7: Over-smoothing — the depth ceiling
Section titled “Section 7: Over-smoothing — the depth ceiling”Stacking more GNN layers does not give you progressively richer representations the way stacking more CNN or transformer layers does. Past two or three layers, vanilla GCN representations begin to converge toward a fixed point that depends on graph topology but is independent of the original node features. Every node ends up looking like every other node in its connected component. This phenomenon is called over-smoothing, and it is the single most important practical fact about GNN depth. The intuition is that each GCN layer is approximately a step of Laplacian smoothing on the node features. Iterating Laplacian smoothing on a connected graph drives all node representations toward a stationary distribution determined by the graph structure, not the inputs. Oono and Suzuki (2020) made this formal and showed that, under standard conditions, deep GCNs lose expressive power exponentially with depth. That is why Kipf and Welling’s original GCN paper used two layers, why most GraphSAGE pipelines use two or three, and why a beginner’s instinct to “just stack more” backfires.
Pause and predict: You stack a 2-layer GCN, a 4-layer GCN, an 8-layer GCN, and a 16-layer GCN on Cora using the same hidden width and the same training budget. What does the validation accuracy curve look like as a function of depth, and at what depth do you expect the model to start behaving like an MLP that ignores its inputs entirely? (Answer: validation accuracy peaks around 2 to 3 layers, decays steadily beyond that, and approaches a value determined by class imbalance plus graph structure as depth grows large — node embeddings have collapsed and the linear head is classifying near-identical vectors.)
The mitigations are real and worth knowing, but none of them is a free lunch. Skip connections borrowed from ResNet — adding the layer’s input back to its output — let gradient and feature signal bypass the smoothing operator, which delays the collapse but does not avoid it indefinitely. Jumping Knowledge networks (Xu et al., 2018) concatenate or maximize across representations from every layer, letting the model pick the right effective depth per node. GCNII (Chen et al., 2020) combines an initial residual to the layer-zero features with an identity mapping in the weight, making GCNs that are 32 or 64 layers deep tractable. Layer normalization or batch normalization between message-passing layers helps a little. For practitioner work, the right default is two or three vanilla layers plus a strong feature head, and you only reach for deep-GCN tricks when you have evidence that long-range information is genuinely needed and your shallow model is bottlenecked by it.
Section 8: Over-squashing — long-range bottlenecks
Section titled “Section 8: Over-squashing — long-range bottlenecks”Over-squashing is a different and more fundamental failure mode. Even before over-smoothing kills you, long-range information has to flow through narrow cuts in the graph topology, and a fixed-width representation per node simply cannot carry enough bits. Alon and Yahav (2021) framed this formally and showed that on tasks requiring information from many hops away, GNNs degrade sharply because the per-node representation acts as a bottleneck on every long path. The intuition is geometric. Imagine a tree with a deep root and many leaves. A signal at one leaf needs to travel up to the root, mix with signals from every other branch at the root, and then travel back down to a different leaf. Each intermediate node squashes its entire growing receptive field into the same fixed-width vector. By the time the signal reaches a far-away node, it has been averaged with exponentially many other signals, and the relevant bit is lost. Over-squashing is fundamental rather than implementational because the bottleneck lives in the topology and the fixed representation width, not in any particular layer or optimizer. The most striking empirical finding from the same paper is that adding a single fully-adjacent layer — one round in which every node directly sees every other node, like a transformer step — recovered most of the lost performance on long-range tasks. That observation seeded the current line of work on graph transformers, which trade locality for global mixing in select layers. For practitioner work today, the practical implication is that if your problem genuinely needs information from six or more hops away, you should consider graph rewiring, virtual global nodes, or hybrid GNN-plus-attention architectures rather than simply stacking more local GNN layers. Local-only GNNs are not the right tool for fundamentally long-range tasks.
Section 9: Heterophily — the connected-nodes-are-similar trap
Section titled “Section 9: Heterophily — the connected-nodes-are-similar trap”The implicit assumption baked into vanilla GCN, GraphSAGE, and GAT is homophily: nodes connected by an edge tend to share label or be similar in the target sense. Citation networks have this property — papers cite papers in their own subfield. So do many social networks — friends share interests. But not all graphs are homophilic. Some online discussion graphs are heterophilic, where connections form between disagreeing parties. Some chemistry problems involve structures where adjacent atoms tend to be of different elements. And many real-world graphs land somewhere in between, with mixed local patterns. On a strongly heterophilic graph, averaging neighbor features mixes in signal that is actively misleading. Each GCN layer makes node representations look more like their dissimilar neighbors, which is exactly the opposite of what the downstream classifier needs. Zhu et al. (2020) showed empirically that on several heterophilic benchmark graphs — Texas, Wisconsin, and Cornell webpage networks among them — a plain 2-layer multilayer perceptron that ignores the graph entirely outperformed GCN, GraphSAGE, and GAT. That is a striking result and worth carrying in your head: a 2-layer MLP beating a 2-layer GCN means the graph structure was either not informative or actively harmful for the standard architectures.
Pause and decide: A teammate proposes training a vanilla GCN on a graph where edges connect users who flagged each other for problematic behavior, and the prediction target is whether a given user is themselves problematic. Why might that be exactly the wrong default architecture, and what is the one-line baseline you should run first to ground the decision? (Answer: flagging is a heterophilic signal — users tend to flag people unlike themselves, not similar to themselves — so vanilla GCN’s homophily-implicit averaging will smooth in misleading neighbor features. Run a graph-blind MLP on the same node features as the baseline; if GCN cannot beat it, the graph structure is not adding usable signal for this architecture.)
The principled answer when you suspect heterophily is to compute a homophily ratio — the fraction of edges where the two endpoints share the target label — on the training portion of the graph. A high ratio (above roughly 0.7) is a strong indicator that vanilla GCN is a reasonable default. A low ratio (below roughly 0.5) is a strong indicator that you should compare against a graph-blind MLP and consider heterophily-aware architectures such as H2GCN or GPR-GNN, which separate self representations from neighbor representations or learn signed propagation coefficients. Cross-link Module 1.7: Naive Bayes, k-NN & SVMs for the broader principle that the right learner depends on the structure of the data.
Section 10: Practical tasks — node, edge, graph
Section titled “Section 10: Practical tasks — node, edge, graph”Three task categories cover most GNN work in industry, each with a
different evaluation discipline.
Node classification predicts a label for each node — for instance, whether
a user is fraudulent, whether a paper belongs to a topic, whether a protein
has a particular function.
The training signal lives on a subset of nodes, the evaluation signal lives
on a held-out subset of nodes from the same graph (transductive) or from
separate graphs (inductive), and the loss is computed per node.
Link prediction predicts whether a candidate edge exists or scores a
likelihood — recommendation systems, drug-target interaction, knowledge
graph completion.
Training requires both positive examples (real edges) and negative
examples (sampled non-edges), and evaluation needs careful negative
sampling to avoid trivial negatives.
Graph classification predicts a label for an entire graph — molecule
property prediction is the canonical example.
Each example is a separate small graph with its own nodes and edges, and
the model produces a single output per graph by pooling node
representations after several message-passing layers.
PyG ships graph-level pooling layers (global_mean_pool, global_max_pool,
global_add_pool) for exactly this case
(PyG nn module reference).
Each task has its own evaluation failure modes.
Node classification on a single graph is the slipperiest because the
transductive setting deliberately exposes the whole graph at training time;
the practitioner trap is reporting transductive numbers and silently
implying inductive generalization (covered in Section 11).
Link prediction does leak if your negative samples include edges that
exist in your validation or test set, or if your train edges include the
reverse of test edges in an undirected graph.
Graph classification is closest to standard supervised learning because the
graphs are independent, but random splits can still understate brittleness
when scaffold-aware splits would have surfaced it
(Module 1.3: Model Evaluation, Validation, Leakage &
Calibration).
Section 11: Evaluation discipline for graphs
Section titled “Section 11: Evaluation discipline for graphs”The leakage taxonomy from Module 1.3 applies directly to graphs, but with two extra wrinkles worth naming.
The first wrinkle is the transductive-versus-inductive distinction. In a transductive node-classification setup, the model sees the entire graph at training time — full structure, full node features, edges in both directions across the train-test boundary. Only the labels of test nodes are masked. Aggregating across train-test edges during message passing is not leakage in this setup; it is the defining feature of transductive evaluation. This is the standard regime for Cora, CiteSeer, and Pubmed, and it is appropriate when deployment is the same shape: a fixed graph that grows with new labels rather than with new nodes.
In an inductive setup, the test nodes (or test graphs) are completely held out, including their edges and structural positions. The model has to generalize to unseen graph fragments at inference time. Most production fraud, recommendation, and molecular-property problems are inductive in practice. The real and common evaluation failure in published GNN work is not “messages crossed train-test edges” — it is reporting transductive accuracy and silently claiming inductive generalization from it. Match your evaluation to your deployment shape, or both numbers are uninterpretable.
The second wrinkle is that random splits often understate failure modes that a structural split would expose. Adjacent nodes share community, time, or scaffold context, and a random shuffle distributes that context evenly across train, validation, and test, which makes the held-out performance look more robust than it is. On custom graphs, the discipline is to split by a structural unit that matches deployment — connected component, time window, molecular scaffold, user cluster — so that the test set actually exercises the kind of generalization the model will be asked to do in production. The standard Cora/CiteSeer/Pubmed splits are publicly fixed for benchmark comparability rather than for realism, which is why production-leaning papers usually report results on splits that respect graph structure as well.
Cross-link Module 1.3: Model Evaluation, Validation, Leakage & Calibration for the full leakage discipline; it transfers to graphs with the modifications above.
Section 12: PyTorch Geometric in practice
Section titled “Section 12: PyTorch Geometric in practice”PyTorch Geometric is the dominant Python library for GNN work today, and its API is small enough that a working pipeline fits on one page (PyG documentation; PyG installation notes). DGL is the other actively maintained option and remains a reasonable choice in some research environments, but for new practitioner pipelines PyG is the default and that is what this section covers. The core data object holds node features and edges:
import torchfrom torch_geometric.data import Data
# A graph with 4 nodes and 4 directed edges, 1 feature per node.edge_index = torch.tensor( [[0, 1, 1, 2], [1, 0, 2, 3]], dtype=torch.long,)x = torch.tensor([[-1.0], [0.0], [1.0], [2.0]])data = Data(x=x, edge_index=edge_index)A two-layer GCN model is a few more lines, using the same forward shape
as any other PyTorch module:
import torchimport torch.nn.functional as Ffrom torch_geometric.nn import GCNConv
class GCN(torch.nn.Module): def __init__(self, in_dim, hidden, out_dim): super().__init__() self.conv1 = GCNConv(in_dim, hidden) self.conv2 = GCNConv(hidden, out_dim)
def forward(self, x, edge_index): h = F.relu(self.conv1(x, edge_index)) h = F.dropout(h, p=0.5, training=self.training) return self.conv2(h, edge_index)The training loop on a transductive node-classification graph uses train and validation masks to restrict the loss and metric computation:
import torchimport torch.nn.functional as F
model = GCN(in_dim=data.num_node_features, hidden=16, out_dim=7)optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
for epoch in range(200): model.train() optimizer.zero_grad() out = model(data.x, data.edge_index) loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask]) loss.backward() optimizer.step()For graphs that do not fit in GPU memory, the same model architecture
trains on sampled mini-batches via NeighborLoader
(PyG NeighborLoader tutorial):
from torch_geometric.loader import NeighborLoader
loader = NeighborLoader( data, num_neighbors=[25, 10], batch_size=128, input_nodes=data.train_mask,)
for batch in loader: optimizer.zero_grad() out = model(batch.x, batch.edge_index) loss = F.cross_entropy( out[: batch.batch_size], batch.y[: batch.batch_size], ) loss.backward() optimizer.step()Swapping in SAGEConv or GATConv is a one-line change in the model
definition, because PyG keeps the layer-call signature consistent across
the MessagePassing family.
That uniformity is a deliberate design choice and is worth more than any
single layer in the library — it is what makes architecture experiments
cheap.
Section 13: When GNNs are the wrong tool
Section titled “Section 13: When GNNs are the wrong tool”The most important section in this module is also the shortest, because the principle is simple and the rest of the work is in following it. Try a strong non-GNN baseline first. For a node-level task with rich engineered features, an XGBoost model on those features (Module 1.6: XGBoost & Gradient Boosting) is often within a small margin of the best GNN and is dramatically simpler to operate. For a node-level task with weaker features but informative graph structure, a graph-blind MLP on the node features alone is the right sanity check; if that baseline beats your GCN, you have learned something important about the data and you should investigate before adding GNN infrastructure. Concrete situations where GNNs tend to lose include the following. A small graph with a few thousand nodes and well-engineered features — the inductive bias premium does not pay for itself. A heterophilic graph without specialized architecture — Section 9 documented the empirical pattern. A latency-sensitive inference path where you need millisecond predictions and the neighborhood lookup itself becomes the bottleneck — sampling-based inference can mitigate but adds complexity that simpler models avoid. A graph where the relationships are decorative rather than informative — the structure exists but does not carry signal toward the target. The discipline is the same as in every other module of this curriculum. Beat a strong simple baseline before adding complexity. GNNs do real work when the graph structure carries information that hand-engineered features do not, the regime is large enough that the inductive bias amortizes the engineering cost, and you have the patience to evaluate honestly through inductive splits. They are the wrong tool when those conditions do not hold.
Pause and reflect: Your team has a fraud-detection problem with one million users, dense engineered features (transaction velocity, device fingerprint, prior-incident counts), and a strong XGBoost baseline at some target precision. A junior teammate proposes a GraphSAGE model on the user-merchant bipartite graph. What is the smallest, fairest experiment that decides whether to invest, and what is the failure condition under which you go back to XGBoost? (Answer: train a graph-blind MLP on the same engineered features and a GraphSAGE on the same node features plus the graph, evaluating both with the same inductive split that simulates the deployment scenario. If GraphSAGE cannot beat both XGBoost and the graph-blind MLP by a margin that exceeds the operational cost of the new pipeline — feature engineering is shifted, neighborhood serving is added, monitoring is expanded — go back to XGBoost.)
Decision-keyed regime table
Section titled “Decision-keyed regime table”| Regime | Algorithm | Why |
|---|---|---|
| Small homophilic graph, fixed at deployment, transductive | 2-layer GCN | Strong default; minimal infrastructure; matches the assumption that connected nodes share labels. |
| Large graph that does not fit in GPU memory, inductive deployment | GraphSAGE with NeighborLoader | Sampling-based training scales; same model runs on unseen graph fragments. |
| Neighbors carry highly variable importance for the target | GAT | Learned attention earns its cost when uniform averaging is a worse default. |
| Need for many hops of receptive field beyond what 2-3 layers gives | JKNet, GCNII, or graph rewiring with virtual nodes | Vanilla deep GCNs collapse from over-smoothing; deep-GNN tricks address that, global mixing addresses over-squashing. |
| Heterophilic graph (low neighbor-label agreement) | Graph-blind MLP first; H2GCN- or GPR-GNN-style if MLP plus structure helps | Vanilla GCN often loses to MLP here; the right baseline is MLP, not a deeper GCN. |
| Small graph with rich engineered features, latency-sensitive deployment | XGBoost on engineered features (no GNN) | The inductive-bias premium does not amortize; tabular wins on cost and latency. |
| Graph-level prediction (molecules, small subgraphs) | GCN or GAT body plus global_mean_pool head | Pooling collapses node embeddings into a graph embedding for the final head. |
Did You Know?
Section titled “Did You Know?”-
The symmetric normalization in GCN — — is borrowed from spectral graph theory and corresponds to the first-order Chebyshev polynomial approximation of a localized graph filter. Kipf and Welling derived their layer as a simplification of earlier spectral approaches, and the surprising empirical result was that this drastic simplification still works well on standard citation benchmarks (Kipf & Welling, 2017).
-
GraphSAGE stands for “Sample and AggregatE.” Sampling is in the name because it is the load-bearing trick that makes the method scale to large graphs — fixed-size neighborhood sampling at each layer keeps memory bounded regardless of graph size or node degree (Hamilton et al., 2017).
-
Adding a single round of fully-connected mixing — letting every node directly see every other node for one layer — recovered most of the lost performance on long-range tasks in the over-squashing study by Alon and Yahav, suggesting the bottleneck is fundamentally about graph topology rather than depth or capacity (Alon & Yahav, 2021). This finding motivates much of the current work on graph transformers.
-
Oono and Suzuki showed formally that under standard conditions deep GCNs lose expressive power exponentially in depth — node embeddings collapse toward a low-dimensional subspace whose structure depends on the graph rather than the input features. The two-layer convention in most GCN papers is empirically calibrated to stay safely above this collapse (Oono & Suzuki, 2020).
Common Mistakes
Section titled “Common Mistakes”| Mistake | Failure mode | Correct approach |
|---|---|---|
| Stacking 8 or more GCN layers expecting CNN-like depth gains | Over-smoothing collapses node embeddings; validation accuracy decays | Use 2-3 layers as the default; reach for JKNet or GCNII only with evidence that long-range information is the bottleneck |
| Splitting nodes uniformly at random on a custom graph and reporting transductive accuracy as if it were inductive | Random splits hide structural correlation; transductive numbers do not measure inductive generalization | Match the split shape to deployment — split by structural unit (component, time, scaffold, user cluster) when the model must generalize to unseen graph fragments |
| Reaching for a GNN before running a tabular or graph-blind baseline | Wasted infrastructure if XGBoost or MLP already wins on cost | Beat XGBoost on engineered features and an MLP on node features alone before claiming a GNN result |
| Forgetting to scale continuous node features | Distance- and attention-based layers see lopsided per-feature magnitudes | Standardize node features unless you are sure the scales are already comparable, mirroring the discipline in Module 1.4: Feature Engineering & Preprocessing |
| Training a vanilla GCN on a heterophilic graph and reporting only its accuracy | A 2-layer MLP on the same features may beat the GCN; the comparison hides the diagnosis | Check the homophily ratio, run an MLP baseline, and use heterophily-aware architectures if the regime requires it |
| Full-batch training on a multi-million-node graph | GPU OOM and unstable runs | Switch to NeighborLoader and sampling-based mini-batches |
| Treating GAT attention weights as global feature importance | Attention weights are conditional on path and message-passing dynamics | Use proper interpretability tools such as gradient-based attribution or GNNExplainer if the goal is feature attribution |
| Reporting transductive accuracy and claiming inductive generalization | The model never had to generalize structurally; deployment will reveal the gap | Match evaluation to deployment — if the production scenario is inductive, evaluate inductively |
-
A team trains a 2-layer GCN on Cora and gets a strong validation accuracy. They double the depth to 4 layers expecting an improvement, then 6, then 8. Validation accuracy drops monotonically. What is the diagnosis and what is the cheapest mitigation that does not abandon the architecture?
Answer
The diagnosis is over-smoothing: repeated symmetric aggregation collapses node embeddings toward a fixed point dominated by graph structure rather than input features. The cheapest mitigation is to add residual connections so the layer's input is added back to its output, which lets feature signal bypass the smoothing operator. If that is not enough, JKNet-style concatenation across layers or GCNII's initial-residual plus identity-mapping construction allow much deeper stacks. The wrong answer is to keep stacking layers in the hope that a deeper model will find a better optimum. -
A fraud team has a strong XGBoost baseline on engineered transaction features at the desired precision. A teammate proposes adding a GNN over the user-merchant bipartite graph. What is the smallest fair experiment that decides whether to invest, and what failure condition sends the team back to XGBoost?
Answer
Train a graph-blind MLP on the same engineered features and a GraphSAGE on the same node features plus the graph, evaluating both under the same inductive split that simulates deployment. If GraphSAGE cannot beat both the XGBoost baseline and the MLP baseline by a margin that exceeds the operational cost of the new pipeline — feature engineering shifts, neighborhood serving, expanded monitoring — go back to XGBoost. -
A team trains on graph G1 and evaluates on graph G2, where G2 is a completely separate graph the model has never seen. Which family of architecture is the natural fit and why?
Answer
GraphSAGE is the natural fit because it is inductive by design. Each layer is an aggregation function plus a learned transform that operates on local neighborhoods rather than on a fixed adjacency matrix, so the same trained model applies to graphs it never saw at training time. Vanilla GCN as originally formulated is transductive and does not generalize to a new graph in the same way without modification. -
A graph has a low homophily ratio — neighbors typically have different target labels. A team plans to train a 4-layer GCN on it. What is the first baseline they should run instead, and why?
Answer
A graph-blind MLP on the node features alone. On heterophilic graphs, averaging neighbor features actively mixes in misleading signal, and empirical studies have shown that a plain 2-layer MLP outperforms GCN, GraphSAGE, and GAT on several heterophilic benchmarks. If the MLP wins or matches the GCN, the conclusion is that the graph structure is not adding usable signal under the standard architecture and the team should consider heterophily-aware models or stick with the MLP. -
A team needs to train a node classifier on a graph with several million nodes. Full-batch training runs out of GPU memory. What is the right tool from PyTorch Geometric and what does it do conceptually?
Answer
`NeighborLoader` performs GraphSAGE-style sampling: at each layer, for every node in the batch, it samples a fixed number of neighbors — typically 25 at the first layer and 10 at the second. The sampled subgraph fits in GPU memory regardless of the global graph size, and the same trained model can be applied to the full graph at inference time. This is the standard scaling answer for industrial GNN pipelines. -
A team finds that GCN and GAT give nearly the same validation accuracy on a homophilic citation benchmark. Should they ship GAT because it is more expressive, or GCN because it is simpler?
Answer
On a homophilic graph with low-variance neighborhoods, GCN's degree-normalized averaging is already a good inductive bias and GAT's learned attention does not earn its extra cost. Ship GCN. GAT helps when neighbor importance varies sharply — a node has many neighbors and only a few are informative — or when the graph is not strictly homophilic. Picking the more expressive model when the data does not require it costs training time and serving complexity for no measurable gain. -
Alon and Yahav reported that adding a single fully-adjacent layer recovered much of the lost performance on long-range tasks. What broader lesson does this carry for GNN architecture design?
Answer
The lesson is that over-squashing is a topological bottleneck rather than a depth or capacity problem. A fixed-width per-node representation cannot carry enough bits along long paths, no matter how many local message-passing layers you stack. Recovering long-range performance requires breaking locality somewhere — through global mixing layers, virtual fully-connected nodes, graph rewiring, or transformer-style attention over the whole graph at select layers — not through stacking more local layers. -
A team plans to predict molecule properties: each example is a small graph and the prediction is a single number per molecule. Sketch the architecture and identify the one structural component that distinguishes graph-level work from node-level work.
Answer
Stack two or three message-passing layers (GCN or GAT) to compute node embeddings, then apply a graph-level pooling operation — `global_mean_pool`, `global_max_pool`, or `global_add_pool` — to collapse the per-node embeddings into a single per-graph vector, followed by a small MLP head for the final prediction. The distinguishing component is the pooling step: node-level work returns one prediction per node, while graph-level work pools first and returns one prediction per graph.
Hands-On Exercise
Section titled “Hands-On Exercise”The goal of this exercise is to internalize three things at once: the graph-blind baseline, the homophily check, and the over-smoothing diagnostic. You will compare an MLP, a 2-layer GCN, and a deeper GCN on the standard Cora citation dataset shipped with PyTorch Geometric and write a short decision memo at the end.
-
Install
torchandtorch_geometricper the official installation notes (PyG installation). Verify by importing both libraries in a Python shell. -
Load the Cora dataset and inspect the
Dataobject.
from torch_geometric.datasets import Planetoid
dataset = Planetoid(root="/tmp/Cora", name="Cora")data = dataset[0]print(data)print("num_nodes:", data.num_nodes)print("num_edges:", data.num_edges)print("num_features:", data.num_features)print("num_classes:", dataset.num_classes)-
Compute the homophily ratio: the fraction of edges where the two endpoints share a label, restricted to the training subgraph. Note the result. Cora is highly homophilic, which sets your expectations for the baselines below.
-
Train a graph-blind MLP on the node features alone, ignoring
edge_indexentirely. Use the train mask, validate on the val mask, and report final test accuracy. This is your floor. -
Train a 2-layer GCN with
GCNConvon the same data, same loss, same training budget. Report the test accuracy. On Cora this should noticeably outperform the MLP, confirming the graph carries signal beyond the features alone. -
Replace the GCN body with
GATConv(one head is fine to start) and rerun. On Cora the gap to GCN should be small. Note the result and reflect on whether attention earned its extra training cost. -
Stretch goal: train two more GCN configurations with depth 4 and depth 8, holding hidden width and training budget constant. Plot or tabulate test accuracy as a function of depth. You should see validation and test accuracy peak around depth 2 to 3 and degrade as depth increases — this is over-smoothing in practice.
-
Write a short decision memo. State the homophily ratio. State each baseline’s accuracy. Identify which architecture you would ship for Cora-style problems and why. Note explicitly whether the depth experiment confirmed over-smoothing on this dataset, and what you would do differently on a heterophilic graph with the same node features.
Sources
Section titled “Sources”- PyTorch Geometric documentation: the canonical entry point and library reference for the
Data,MessagePassing, and layer APIs used throughout this module. - PyTorch Geometric introduction: hands-on getting-started guide that documents the
Dataobject and the standard transductive Cora workflow. - PyTorch Geometric installation notes: authoritative installation procedure including the wheel-index step that is easy to miss on first install.
- PyTorch Geometric MessagePassing tutorial: explains the
message,aggregate,updatehook structure that underlies most published GNN layers. - PyTorch Geometric nn module reference: full API for
GCNConv,SAGEConv,GATConv, and the graph-level pooling layers used in graph classification. - PyTorch Geometric heterogeneous graph tutorial: documents the separate API and message-passing rules for graphs with multiple node and edge types.
- PyTorch Geometric NeighborLoader tutorial: the standard mini-batch sampling pattern for graphs that do not fit in GPU memory.
- PyTorch Geometric dataset tutorial: documents how PyG ships standard benchmark datasets and how to wrap a custom dataset.
- PyTorch Geometric GNN cheatsheet: reference table of published GNN layers and the
MessagePassingfeatures each one uses. - PyTorch documentation: canonical reference for the underlying tensor, module, autograd, and optimizer APIs that PyG builds on.
- Kipf & Welling, 2017: introduced the GCN layer with symmetric normalized adjacency aggregation as a first-order spectral approximation, plus the standard transductive Cora and CiteSeer benchmarks.
- Hamilton et al., 2017: introduced GraphSAGE with the inductive sampling-based aggregator design that scales to large graphs.
- Veličković et al., 2018: introduced GAT with multi-head attention over graph neighbors, mirroring transformer-style attention restricted to local structure.
- Xu et al., 2018: introduced Jumping Knowledge networks as a structural mitigation for over-smoothing by combining representations across layers.
- Oono & Suzuki, 2020: proved formally that deep GCNs lose expressive power exponentially with depth, the rigorous version of over-smoothing.
- Alon & Yahav, 2021: identified over-squashing as a topological bottleneck for long-range tasks and showed that a single fully-adjacent layer recovered much of the lost performance.
- Zhu et al., 2020: introduced H2GCN and documented the heterophilic-graph regime where a plain MLP outperforms GCN, GraphSAGE, and GAT on standard benchmarks.
- Chen et al., 2020: introduced GCNII with initial residual and identity mapping to enable practically-deep GCN stacks without collapse.
Next Steps
Section titled “Next Steps”This module closes the Tier-2 extension of the Deep Learning Foundations track. The KubeDojo curriculum continues in two adjacent tracks under the same AI/ML Engineering umbrella.
The Reinforcement Learning track introduces RL as a practitioner decision discipline, and a Tier-2 Offline RL & Imitation Learning module is forthcoming. The Machine Learning track’s Tier-1 sequence (modules 1.1 through 1.12) is complete; the Tier-2 sequence covering class imbalance, interpretability, Bayesian ML, recommenders, conformal prediction, fairness, and causal inference is forthcoming.
The unifying theme across all three tracks is the same discipline you have applied throughout this module: pick the right tool for the data structure and the deployment regime, beat a strong simple baseline before adding complexity, and evaluate honestly under a split that matches deployment.