Transformers from Scratch
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 8-10
Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)
Section titled “Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)”Reading Time: 6-7 hours Prerequisites: Module 30
The Idea That Almost Died Twice
Section titled “The Idea That Almost Died Twice”San Diego, California. July 1969. 3:47 PM.
Seppo Linnainmaa, a young Finnish computer scientist, had just finished his master’s thesis. The paper described an elegant algorithm for computing derivatives automatically—what he called “reverse mode automatic differentiation.” It was mathematically beautiful, but no one cared. Computers were too slow, and the problems it could solve seemed too small to matter.
The idea lay dormant for almost two decades.
Then in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature. They showed how Linnainmaa’s algorithm could train neural networks with multiple hidden layers—something thought impossible at the time. The paper made backpropagation famous.
But even then, the world wasn’t ready. Neural networks were slow, finicky, and often outperformed by simpler methods. Backprop was nearly forgotten again in the 1990s when support vector machines dominated.
It took until 2012—when Alex Krizhevsky’s AlexNet crushed the ImageNet competition using GPUs—for backpropagation to claim its throne. The algorithm that almost died twice is now the heartbeat of every AI system on Earth.
“Backpropagation is a beautiful algorithm. It’s not deep learning’s breakthrough—it’s its foundation. Everything else is built on top of it.” — Geoffrey Hinton, Nobel Prize acceptance speech, 2024
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Truly understand what backpropagation actually does
- Master the chain rule and see why it’s the key to everything
- Build a working autograd engine from scratch
- Understand computational graphs and how frameworks track operations
- Know how to debug gradient issues (vanishing, exploding, NaN)
- Implement gradient checking to verify your implementations
- Visualize gradient flow through networks
** Did You Know?** The term “backpropagation” wasn’t always the preferred name. Paul Werbos originally called it “dynamic feedback” in his 1974 PhD thesis. The name “backpropagation” (or “back-propagation”) was popularized by Rumelhart, Hinton, and Williams in their famous 1986 Nature paper. In the control theory community, the same algorithm is known as “adjoint sensitivity analysis.” Meanwhile, in the automatic differentiation community, it’s called “reverse-mode automatic differentiation.” All these names describe the same fundamental algorithm!
The Heureka Moment: What You’re About to Discover
Section titled “The Heureka Moment: What You’re About to Discover”Think of loss.backward() like pressing “show work” on a calculator that solved a complex equation. The calculator (PyTorch) has been keeping notes about every operation it performed. When you press backward, it traces through those notes in reverse, computing exactly how much each input contributed to the final answer.
Every time you call loss.backward() in PyTorch, something magical happens. The framework somehow figures out how to adjust millions of parameters to make the loss smaller. But how?
The answer is backpropagation — short for “backward propagation of errors.” It’s not magic; it’s calculus applied cleverly. By the end of this module, you’ll be able to implement it yourself from scratch.
Here’s the transformative insight: backpropagation is just the chain rule applied systematically. Once you see it, you can’t unsee it. Every neural network, from a simple perceptron to gpt-5, learns using the same fundamental algorithm.
The Problem: How Do We Update Weights?
Section titled “The Problem: How Do We Update Weights?”Consider a simple network:
Input x → [Layer 1] → [Layer 2] → [Layer 3] → Output ŷ → Loss L W₁, b₁ W₂, b₂ W₃, b₃We want to minimize the loss L. To do that, we need to know:
- How should we change W₁ to reduce L?
- How should we change W₂ to reduce L?
- How should we change W₃ to reduce L?
Mathematically, we need: ∂L/∂W₁, ∂L/∂W₂, ∂L/∂W₃
But here’s the challenge: W₁ affects the output through a long chain of operations. How do we compute its effect on the loss?
The Chain Rule: The Foundation of Everything
Section titled “The Chain Rule: The Foundation of Everything”Think of the chain rule like a relay race. When you want to know how fast the final runner is going, you need to know how fast each handoff affected the next. If runner 1 speeds up, runner 2 speeds up, which speeds up runner 3. The total effect is the product of all individual effects.
In backpropagation, we’re asking: “If I wiggle this weight, how much does the loss wiggle?” The answer flows backward through each operation, multiplying effects at each step.
The Single-Variable Chain Rule
Section titled “The Single-Variable Chain Rule”If y = f(g(x)), then:
dy/dx = dy/dg · dg/dxWorked Example:
Let y = (2x + 1)³
- Let g = 2x + 1, so y = g³
- dy/dg = 3g² = 3(2x + 1)²
- dg/dx = 2
- Therefore: dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²
Let’s verify at x = 1:
- y = (2·1 + 1)³ = 27
- dy/dx = 6(2·1 + 1)² = 6 · 9 = 54
Numerical check: (y(1.001) - y(1))/0.001 = (27.162… - 27)/0.001 ≈ 54
The Multi-Variable Chain Rule
Section titled “The Multi-Variable Chain Rule”If L depends on x through multiple paths:
L = f(a, b) where a = g(x) and b = h(x)
∂L/∂x = (∂L/∂a)(∂a/∂x) + (∂L/∂b)(∂b/∂x)We sum the contributions from all paths.
Did You Know? The chain rule was formalized by Gottfried Wilhelm Leibniz in the late 1600s. Over 300 years later, it became the mathematical foundation of deep learning. Leibniz couldn’t have imagined that his calculus would one day power language models and self-driving cars!
Computational Graphs: Tracking Operations
Section titled “Computational Graphs: Tracking Operations”Think of a computational graph like a detailed recipe card that records every step of cooking. When you make a cake, you don’t just remember “I made a cake”—you remember “I mixed flour with sugar, then added eggs, then baked for 30 minutes.” If the cake tastes too sweet, you can trace back through the recipe to find which step added too much sugar and adjust it. Computational graphs work the same way: they record every mathematical operation so we can trace backward and figure out which weights need adjusting.
Modern deep learning frameworks represent computations as directed acyclic graphs (DAGs). Each node is an operation, and edges represent data flow.
Example: Simple Expression
Section titled “Example: Simple Expression”Consider: L = (x · w + b)²
The computational graph looks like:
x w \ / \ / [*] ← multiply | v z₁ = x · w | + ← b | v z₂ = z₁ + b | [²] ← square | v L = z₂²Forward Pass
Section titled “Forward Pass”Compute values from inputs to outputs:
# Forward passx, w, b = 2.0, 3.0, 1.0
z1 = x * w # z1 = 6z2 = z1 + b # z2 = 7L = z2 ** 2 # L = 49Backward Pass
Section titled “Backward Pass”Compute gradients from outputs to inputs:
# Backward pass - apply chain rule at each nodedL_dL = 1.0 # Start with 1
# d(z²)/dz = 2zdL_dz2 = dL_dL * 2 * z2 # dL/dz2 = 1 * 2 * 7 = 14
# d(z1 + b)/dz1 = 1, d(z1 + b)/db = 1dL_dz1 = dL_dz2 * 1 # dL/dz1 = 14dL_db = dL_dz2 * 1 # dL/db = 14
# d(x * w)/dx = w, d(x * w)/dw = xdL_dx = dL_dz1 * w # dL/dx = 14 * 3 = 42dL_dw = dL_dz1 * x # dL/dw = 14 * 2 = 28Verification:
- L = (xw + b)² = (2·3 + 1)² = 49
- ∂L/∂w = 2(xw + b) · x = 2·7·2 = 28
- ∂L/∂x = 2(xw + b) · w = 2·7·3 = 42
- ∂L/∂b = 2(xw + b) · 1 = 2·7·1 = 14
Building an Autograd Engine from Scratch
Section titled “Building an Autograd Engine from Scratch”Let’s build a simple but complete autograd system. This is how PyTorch works at its core!
The Value Class
Section titled “The Value Class”class Value: """ A scalar value that tracks its computation history for automatic differentiation.
This is a simplified version of how PyTorch tensors work internally. """
def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 # Gradient starts at 0
# Internal variables for autograd self._backward = lambda: None # Function to compute gradient self._prev = set(_children) # Parent nodes in the graph self._op = _op # Operation that created this node
def __repr__(self): return f"Value(data={self.data}, grad={self.grad})"
def __add__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data + other.data, (self, other), '+')
def _backward(): # d(a + b)/da = 1, d(a + b)/db = 1 self.grad += out.grad * 1.0 other.grad += out.grad * 1.0 out._backward = _backward
return out
def __mul__(self, other): other = other if isinstance(other, Value) else Value(other) out = Value(self.data * other.data, (self, other), '*')
def _backward(): # d(a * b)/da = b, d(a * b)/db = a self.grad += out.grad * other.data other.grad += out.grad * self.data out._backward = _backward
return out
def __pow__(self, n): assert isinstance(n, (int, float)), "only int/float powers supported" out = Value(self.data ** n, (self,), f'**{n}')
def _backward(): # d(x^n)/dx = n * x^(n-1) self.grad += out.grad * (n * self.data ** (n - 1)) out._backward = _backward
return out
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other ** -1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def relu(self): out = Value(max(0, self.data), (self,), 'ReLU')
def _backward(): # d(ReLU(x))/dx = 1 if x > 0 else 0 self.grad += out.grad * (1.0 if self.data > 0 else 0.0) out._backward = _backward
return out
def tanh(self): import math t = math.tanh(self.data) out = Value(t, (self,), 'tanh')
def _backward(): # d(tanh(x))/dx = 1 - tanh(x)² self.grad += out.grad * (1 - t ** 2) out._backward = _backward
return out
def exp(self): import math out = Value(math.exp(self.data), (self,), 'exp')
def _backward(): # d(e^x)/dx = e^x self.grad += out.grad * out.data out._backward = _backward
return out
def backward(self): """ Compute gradients for all nodes in the graph using reverse-mode autodiff.
We do a topological sort to ensure we process nodes in the right order (children before parents). """ # Topological sort topo = [] visited = set()
def build_topo(v): if v not in visited: visited.add(v) for child in v._prev: build_topo(child) topo.append(v)
build_topo(self)
# Start with gradient of 1 for the output self.grad = 1.0
# Apply chain rule in reverse topological order for v in reversed(topo): v._backward()Notice how each operation stores a _backward function that knows how to compute its local gradient. When we call backward(), we traverse the graph in reverse order, applying the chain rule at each node.
Testing Our Autograd
Section titled “Testing Our Autograd”# Test 1: Simple expressionx = Value(2.0)w = Value(3.0)b = Value(1.0)
# L = (x*w + b)²L = (x * w + b) ** 2L.backward()
print(f"L = {L.data}") # 49.0print(f"dL/dx = {x.grad}") # 42.0print(f"dL/dw = {w.grad}") # 28.0print(f"dL/db = {b.grad}") # 14.0
# Test 2: Neural network forward passdef neuron(x, w, b): return (x * w + b).tanh()
x = Value(1.0)w = Value(0.5)b = Value(0.1)y = neuron(x, w, b)y.backward()
print(f"\nNeuron output: {y.data:.4f}")print(f"dy/dw = {w.grad:.4f}")Did You Know? This exact approach — building a computation graph during forward pass and traversing it backward — is how PyTorch’s autograd works. The main difference is that PyTorch handles tensors instead of scalars and has optimized C++ implementations. Andrej Karpathy’s “micrograd” (which inspired this implementation) proved you can build a functional autograd in ~100 lines of Python!
The Backpropagation Algorithm
Section titled “The Backpropagation Algorithm”Think of backpropagation like a blame assignment meeting after a project fails. The final result (loss) was bad, so you need to figure out who’s responsible. You start at the end: “The final report was wrong.” Then you trace back: “Because the calculations were off.” Then: “Because the data entry had mistakes.” Each person (weight) gets assigned exactly their share of the blame (gradient), proportional to how much they contributed to the problem. This “blame” is then used to adjust behavior (update weights) for next time.
Now let’s see how backpropagation works in a full neural network.
Forward Pass Through a Layer
Section titled “Forward Pass Through a Layer”For a layer: y = σ(Wx + b)
# Forward passz = W @ x + b # Linear transformationy = activation(z) # Apply activationBackward Pass Through a Layer
Section titled “Backward Pass Through a Layer”Given ∂L/∂y (gradient from the layer above), compute:
- ∂L/∂W (to update weights)
- ∂L/∂b (to update biases)
- ∂L/∂x (to pass to the layer below)
∂L/∂z = ∂L/∂y · ∂y/∂z = ∂L/∂y · σ'(z) # Gradient through activation∂L/∂W = ∂L/∂z · ∂z/∂W = ∂L/∂z · x^T # Weight gradient∂L/∂b = ∂L/∂z · ∂z/∂b = ∂L/∂z # Bias gradient∂L/∂x = ∂L/∂z · ∂z/∂x = W^T · ∂L/∂z # Input gradient (for previous layer)Worked Example: Two-Layer Network
Section titled “Worked Example: Two-Layer Network”Network: x → Linear(2, 3) → ReLU → Linear(3, 1) → MSE Loss
import numpy as np
# Initializenp.random.seed(42)W1 = np.random.randn(3, 2) * 0.1 # 3 neurons, 2 inputsb1 = np.zeros((3, 1))W2 = np.random.randn(1, 3) * 0.1 # 1 output, 3 inputsb2 = np.zeros((1, 1))
# Single training examplex = np.array([[1.0], [2.0]]) # Input (2,)y_true = np.array([[1.0]]) # Target
# === FORWARD PASS ===z1 = W1 @ x + b1 # (3, 1)a1 = np.maximum(0, z1) # ReLUz2 = W2 @ a1 + b2 # (1, 1)y_pred = z2 # No activation on outputloss = 0.5 * (y_pred - y_true) ** 2
print(f"Forward pass:")print(f" z1 shape: {z1.shape}, a1 shape: {a1.shape}")print(f" y_pred: {y_pred[0,0]:.4f}, loss: {loss[0,0]:.4f}")
# === BACKWARD PASS ===# Start from lossdL_dy = y_pred - y_true # d(0.5(y-t)²)/dy = (y-t)
# Through layer 2dL_dz2 = dL_dy # No activationdL_dW2 = dL_dz2 @ a1.T # (1, 3)dL_db2 = dL_dz2 # (1, 1)dL_da1 = W2.T @ dL_dz2 # (3, 1)
# Through ReLUdL_dz1 = dL_da1 * (z1 > 0) # ReLU gradient: 1 if z>0, else 0
# Through layer 1dL_dW1 = dL_dz1 @ x.T # (3, 2)dL_db1 = dL_dz1 # (3, 1)
print(f"\nBackward pass:")print(f" dL/dW2 shape: {dL_dW2.shape}")print(f" dL/dW1 shape: {dL_dW1.shape}")Gradient Checking: Verifying Your Implementation
Section titled “Gradient Checking: Verifying Your Implementation”Think of gradient checking like double-checking your work on a math test. Your analytical solution (backprop) should give the same answer as the slow but reliable numerical method (finite differences). If they disagree significantly, you made a mistake somewhere.
How do you know your backprop is correct? Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences).
Numerical Gradient
Section titled “Numerical Gradient”∂f/∂x ≈ [f(x + ε) - f(x - ε)] / (2ε)This is the definition of a derivative, approximated with small ε.
Implementation
Section titled “Implementation”def gradient_check(param, loss_fn, epsilon=1e-5): """ Compare analytical gradient with numerical gradient.
Args: param: Parameter to check (numpy array) loss_fn: Function that takes param and returns (loss, grad) epsilon: Small number for finite differences
Returns: relative_error: Should be < 1e-5 for correct implementation """ # Get analytical gradient loss, analytical_grad = loss_fn(param)
# Compute numerical gradient numerical_grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: idx = it.multi_index original = param[idx]
# f(x + ε) param[idx] = original + epsilon loss_plus, _ = loss_fn(param)
# f(x - ε) param[idx] = original - epsilon loss_minus, _ = loss_fn(param)
# Numerical gradient numerical_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
# Restore param[idx] = original it.iternext()
# Relative error diff = np.abs(analytical_grad - numerical_grad) norm = np.abs(analytical_grad) + np.abs(numerical_grad) + 1e-8 relative_error = np.max(diff / norm)
return relative_error, analytical_grad, numerical_grad
# Example usagedef simple_loss(W): """L = sum(W²)""" loss = np.sum(W ** 2) grad = 2 * W return loss, grad
W = np.random.randn(3, 4)error, analytical, numerical = gradient_check(W, simple_loss)print(f"Relative error: {error:.2e}") # Should be ~1e-10When to Use Gradient Checking
Section titled “When to Use Gradient Checking”- During development: When implementing new layers or loss functions
- Debugging NaN gradients: To isolate which operation is broken
- Custom autograd operations: To verify correctness
- Never in production: It’s O(n) forward passes per gradient, extremely slow
Did You Know? The backpropagation algorithm was independently discovered several times. Paul Werbos described it in his 1974 PhD thesis, but it wasn’t widely known. In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature, which finally brought backprop to mainstream attention. Hinton later won the Nobel Prize in Physics (2024) partly for this work!
Common Gradient Problems
Section titled “Common Gradient Problems”Think of gradients like a whispered message in the “telephone game.” In a shallow network (few players), the message arrives mostly intact. But in a deep network (many players), the message can either fade to silence (vanishing gradients) or get wildly exaggerated (exploding gradients). Modern techniques like skip connections are like having players occasionally shout the original message directly across the room, bypassing the chain entirely.
1. Vanishing Gradients
Section titled “1. Vanishing Gradients”Symptom: Early layers learn very slowly or not at all.
Cause: Gradients shrink as they propagate backward through many layers.
Sigmoid: σ'(x) = σ(x)(1 - σ(x)) ≤ 0.25
After 10 layers: gradient ≤ 0.25¹⁰ = 0.000001Solutions:
- Use ReLU or variants (Leaky ReLU, GELU)
- Batch normalization
- Residual connections
- Proper initialization (Xavier, He)
Did You Know? The vanishing gradient problem plagued neural networks for nearly two decades! After the initial excitement of backpropagation in the 1980s, researchers found that networks deeper than 2-3 layers were nearly impossible to train. This “AI Winter” period lasted until ~2006 when Geoffrey Hinton showed that layer-by-layer pretraining could help. The real breakthrough came with ReLU activations (2010) and residual connections (2015), finally enabling the 100+ layer networks we use today.
2. Exploding Gradients
Section titled “2. Exploding Gradients”Symptom: Loss becomes NaN, weights explode to infinity.
Cause: Gradients grow as they propagate, especially in RNNs.
Solutions:
- Gradient clipping:
grad = clip(grad, -max_val, max_val) - Proper initialization
- Batch normalization
- Learning rate warmup
3. Dead ReLU
Section titled “3. Dead ReLU”Symptom: Some neurons never activate, always output 0.
Cause: If a neuron’s input is always negative, ReLU outputs 0, and gradient is 0, so it never updates.
# Dead neuron: if z < 0 always, then:# Forward: ReLU(z) = 0# Backward: dReLU/dz = 0 → no learning!Solutions:
- Leaky ReLU:
max(0.01x, x)— small gradient when negative - PReLU:
max(αx, x)— learnable α - ELU, GELU — smooth alternatives
- Careful initialization (don’t start with large negative biases)
4. NaN Gradients
Section titled “4. NaN Gradients”Symptom: Loss becomes NaN during training.
Common causes:
- Division by zero:
1/xwhere x→0 - Log of zero:
log(0) = -∞ - Exploding gradients: Eventually overflow
- Learning rate too high: Weights jump to bad regions
Debugging strategy:
# Add hooks to detect NaNdef check_nan(grad): if torch.isnan(grad).any(): print(f"NaN detected! grad stats: min={grad.min()}, max={grad.max()}") import pdb; pdb.set_trace() return grad
for name, param in model.named_parameters(): param.register_hook(lambda grad, name=name: check_nan(grad))Gradient Flow Visualization
Section titled “Gradient Flow Visualization”Think of gradient visualization like an X-ray for your neural network. Just as doctors use X-rays to see what’s happening inside a patient, gradient histograms and flow diagrams reveal the health of your network’s learning process—showing where information flows freely and where it gets blocked or distorted.
Understanding how gradients flow helps debug networks.
Gradient Histograms
Section titled “Gradient Histograms”def plot_gradient_histograms(model): """ Plot histogram of gradients for each layer.
Healthy gradients should: - Have similar scale across layers - Not be all zeros - Not have extreme values """ import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 3, figsize=(12, 8)) axes = axes.flatten()
for idx, (name, param) in enumerate(model.named_parameters()): if param.grad is not None and idx < 6: grad = param.grad.detach().cpu().numpy().flatten() axes[idx].hist(grad, bins=50) axes[idx].set_title(f'{name}\nmean={grad.mean():.2e}, std={grad.std():.2e}') axes[idx].set_xlabel('Gradient value')
plt.tight_layout() return figGradient Norms per Layer
Section titled “Gradient Norms per Layer”def compute_gradient_norms(model): """ Compute gradient L2 norm for each layer.
Plot over training to detect: - Vanishing: norms decrease in early layers - Exploding: norms increase in early layers """ norms = {} for name, param in model.named_parameters(): if param.grad is not None: norms[name] = param.grad.norm().item() return normsWhat to Look For
Section titled “What to Look For”| Pattern | Interpretation | Action |
|---|---|---|
| Gradients near 0 in early layers | Vanishing gradients | Add skip connections, change activation |
| Gradients 100x larger in early layers | Exploding gradients | Clip gradients, reduce LR |
| Some parameters always 0 | Dead neurons | Use Leaky ReLU, check init |
| Gradients NaN | Numerical instability | Check for log(0), div by 0 |
Advanced Topics
Section titled “Advanced Topics”Forward-Mode vs Reverse-Mode Autodiff
Section titled “Forward-Mode vs Reverse-Mode Autodiff”There are two ways to compute gradients:
Reverse-mode (backpropagation):
- Compute ∂output/∂all_inputs in one backward pass
- Cost: O(1) backward passes regardless of input dimension
- Best when: many inputs, few outputs (most neural networks!)
Forward-mode:
- Compute ∂all_outputs/∂one_input per pass
- Cost: O(n) forward passes for n inputs
- Best when: few inputs, many outputs (rare in deep learning)
Neural network: 1M parameters → 1 loss value- Reverse mode: 1 backward pass to get all 1M gradients- Forward mode: 1M forward passes (one per parameter)This is why backpropagation dominates deep learning!
Did You Know? Forward-mode autodiff is used in some scientific computing applications where you have few inputs and many outputs. Julia’s ForwardDiff.jl is a popular implementation. Some researchers are exploring “mixed-mode” autodiff that combines both for certain architectures.
Higher-Order Gradients
Section titled “Higher-Order Gradients”Sometimes we need gradients of gradients (Hessians, second derivatives).
# PyTorch example: computing Hessian-vector productx = torch.randn(10, requires_grad=True)y = (x ** 3).sum()
# First derivativegrad_y = torch.autograd.grad(y, x, create_graph=True)[0]
# Second derivative (gradient of gradient)grad_grad_y = torch.autograd.grad(grad_y.sum(), x)[0]
print(f"y = sum(x³)")print(f"dy/dx = 3x² → {grad_y[:3]}")print(f"d²y/dx² = 6x → {grad_grad_y[:3]}")Uses:
- Second-order optimization (Newton’s method, natural gradient)
- Regularization (penalizing sharp minima)
- Meta-learning (learning to learn)
Checkpointing for Memory Efficiency
Section titled “Checkpointing for Memory Efficiency”Deep networks require storing all activations for backprop. This can exhaust GPU memory.
Gradient checkpointing trades compute for memory:
- Don’t store intermediate activations
- Recompute them during backward pass
- Reduces memory from O(n) to O(√n) with √n checkpoints
# PyTorch gradient checkpointingfrom torch.utils.checkpoint import checkpoint
class CheckpointedBlock(nn.Module): def __init__(self, block): super().__init__() self.block = block
def forward(self, x): # Recompute forward pass during backward return checkpoint(self.block, x, use_reentrant=False)Practical Debugging Guide
Section titled “Practical Debugging Guide”Systematic Debugging Steps
Section titled “Systematic Debugging Steps”-
Verify data pipeline
- Are inputs normalized?
- Are labels correct?
- Any NaN/Inf in inputs?
-
Check initialization
- Print initial weight stats
- Verify no NaN in initial forward pass
-
Start simple
- Can you overfit one batch?
- If not, something is fundamentally broken
-
Monitor gradient flow
- Plot gradient norms per layer
- Look for vanishing/exploding patterns
-
Gradient checking
- Verify backprop is correct for custom operations
- Use small networks for speed
Debugging Checklist
Section titled “Debugging Checklist”def debug_training(model, batch, loss_fn): """Comprehensive debugging for one batch."""
# 1. Check input x, y = batch print(f"Input: shape={x.shape}, range=[{x.min():.2f}, {x.max():.2f}]") print(f"Labels: shape={y.shape}, unique={torch.unique(y).tolist()}") assert not torch.isnan(x).any(), "NaN in input!"
# 2. Forward pass model.train() output = model(x) print(f"Output: shape={output.shape}, range=[{output.min():.2f}, {output.max():.2f}]") assert not torch.isnan(output).any(), "NaN in output!"
# 3. Loss loss = loss_fn(output, y) print(f"Loss: {loss.item():.4f}") assert not torch.isnan(loss), "NaN loss!"
# 4. Backward pass loss.backward()
# 5. Check gradients for name, param in model.named_parameters(): if param.grad is not None: grad = param.grad print(f"{name}: grad_norm={grad.norm():.2e}, " f"grad_range=[{grad.min():.2e}, {grad.max():.2e}]") assert not torch.isnan(grad).any(), f"NaN gradient in {name}!"
print("All checks passed!")Hands-On Exercises
Section titled “Hands-On Exercises”These exercises will solidify your understanding of backpropagation by having you implement and debug gradient computations yourself.
Exercise 1: Extend the Autograd Engine
Section titled “Exercise 1: Extend the Autograd Engine”Take the Value class we built earlier and extend it with additional operations.
Your task: Implement these missing methods:
class Value: # ... (existing code from above)
def log(self): """Natural logarithm. d(log(x))/dx = 1/x""" import math # YOUR CODE HERE pass
def sigmoid(self): """Sigmoid function. σ(x) = 1/(1+e^(-x)), σ'(x) = σ(x)(1-σ(x))""" # YOUR CODE HERE pass
def __matmul__(self, other): """ Implement dot product for two Value vectors. This is trickier - think about how gradients flow through a sum of products. """ # YOUR CODE HERE passSolution approach:
def log(self): import math out = Value(math.log(self.data), (self,), 'log')
def _backward(): # d(log(x))/dx = 1/x self.grad += out.grad * (1.0 / self.data) out._backward = _backward
return out
def sigmoid(self): import math s = 1.0 / (1.0 + math.exp(-self.data)) out = Value(s, (self,), 'sigmoid')
def _backward(): # d(sigmoid(x))/dx = sigmoid(x) * (1 - sigmoid(x)) self.grad += out.grad * s * (1 - s) out._backward = _backward
return outVerify your implementation:
# Test logx = Value(2.0)y = x.log()y.backward()print(f"log(2) = {y.data:.4f}, should be ~0.6931")print(f"d(log(x))/dx at x=2 is {x.grad:.4f}, should be 0.5")
# Test sigmoidx = Value(0.0)y = x.sigmoid()y.backward()print(f"sigmoid(0) = {y.data:.4f}, should be 0.5")print(f"d(sigmoid(x))/dx at x=0 is {x.grad:.4f}, should be 0.25")Exercise 2: Trace Through Backprop by Hand
Section titled “Exercise 2: Trace Through Backprop by Hand”Manually compute gradients for this computation graph:
L = (a * b + c)² where a=2, b=3, c=1Step 1: Draw the graph
a(2) b(3) \ / [*] | v z1(6) c(1) \ / \ / [+] | v z2(7) | [²] | v L(49)Step 2: Forward pass (verify values)
- z1 = a * b = 2 * 3 = 6
- z2 = z1 + c = 6 + 1 = 7
- L = z2² = 49
Step 3: Backward pass (compute gradients)
- dL/dL = 1 (start here)
- dL/dz2 = dL/dL * d(z2²)/dz2 = 1 * 2z2 = 27 = 14
- dL/dz1 = dL/dz2 * d(z1+c)/dz1 = 14 * 1 = 14
- dL/dc = dL/dz2 * d(z1+c)/dc = 14 * 1 = 14
- dL/da = dL/dz1 * d(a*b)/da = 14 * b = 14 * 3 = 42
- dL/db = dL/dz1 * d(a*b)/db = 14 * a = 14 * 2 = 28
Your task: Verify with our autograd engine:
a = Value(2.0)b = Value(3.0)c = Value(1.0)L = (a * b + c) ** 2L.backward()
print(f"L = {L.data}") # Should be 49print(f"dL/da = {a.grad}") # Should be 42print(f"dL/db = {b.grad}") # Should be 28print(f"dL/dc = {c.grad}") # Should be 14Exercise 3: Implement Gradient Clipping
Section titled “Exercise 3: Implement Gradient Clipping”Gradient clipping prevents exploding gradients by limiting gradient magnitudes.
Your task: Implement two types of gradient clipping:
def clip_grad_value(parameters, clip_value): """ Clip gradients by value. Each gradient component is clipped to [-clip_value, clip_value].
Example: grad = [5, -10, 3] with clip_value=4 becomes [4, -4, 3] """ # YOUR CODE HERE for param in parameters: if param.grad is not None: param.grad.data.clamp_(-clip_value, clip_value)
def clip_grad_norm(parameters, max_norm): """ Clip gradients by global norm. All gradients are scaled so their combined L2 norm <= max_norm.
Example: If total norm is 10 and max_norm is 5, scale all grads by 0.5 """ # YOUR CODE HERE total_norm = 0.0 for param in parameters: if param.grad is not None: total_norm += param.grad.data.pow(2).sum() total_norm = total_norm.sqrt()
clip_coef = max_norm / (total_norm + 1e-6) if clip_coef < 1: for param in parameters: if param.grad is not None: param.grad.data.mul_(clip_coef)
return total_normTest your implementation:
import torch
# Create a model with exploding gradientsmodel = torch.nn.Linear(10, 10)x = torch.randn(1, 10) * 100 # Large inputy = model(x)loss = y.sum()loss.backward()
# Print gradient norms before clippingprint("Before clipping:")for name, param in model.named_parameters(): print(f" {name}: grad_norm = {param.grad.norm():.2f}")
# Clip gradientsclip_grad_norm(model.parameters(), max_norm=1.0)
# Print gradient norms after clippingprint("After clipping (max_norm=1.0):")total = 0for name, param in model.named_parameters(): print(f" {name}: grad_norm = {param.grad.norm():.2f}") total += param.grad.norm()**2print(f" Total norm: {total.sqrt():.2f}") # Should be <= 1.0Exercise 4: Debug a Broken Training Loop
Section titled “Exercise 4: Debug a Broken Training Loop”The following training loop has bugs that cause NaN loss. Find and fix them.
import torchimport torch.nn as nn
def broken_training(): """This training loop will produce NaN. Fix the bugs!"""
# Bug 1: No seed for reproducibility model = nn.Sequential( nn.Linear(10, 50), nn.Sigmoid(), # Bug 2: Sigmoid can cause vanishing gradients in deep nets nn.Linear(50, 50), nn.Sigmoid(), nn.Linear(50, 1) )
optimizer = torch.optim.SGD(model.parameters(), lr=10.0) # Bug 3: LR too high
for epoch in range(100): x = torch.randn(32, 10) y = torch.randn(32, 1)
output = model(x) loss = torch.log(output) # Bug 4: log of potentially negative values! loss = loss.mean()
# Bug 5: No zero_grad! loss.backward() optimizer.step()
if epoch % 10 == 0: print(f"Epoch {epoch}: loss = {loss.item():.4f}")
# Run it - watch it failbroken_training()Your task: Fix all 5 bugs and get the training to work.
Fixed version:
def fixed_training(): """Corrected training loop."""
# Fix 1: Set seed for reproducibility torch.manual_seed(42)
# Fix 2: Use ReLU instead of Sigmoid model = nn.Sequential( nn.Linear(10, 50), nn.ReLU(), nn.Linear(50, 50), nn.ReLU(), nn.Linear(50, 1) )
# Fix 3: Use reasonable learning rate optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(100): x = torch.randn(32, 10) y = torch.randn(32, 1)
# Fix 5: Zero gradients at start of each iteration optimizer.zero_grad()
output = model(x) # Fix 4: Use proper loss function (MSE instead of log) loss = nn.functional.mse_loss(output, y)
loss.backward() optimizer.step()
if epoch % 10 == 0: print(f"Epoch {epoch}: loss = {loss.item():.4f}")
fixed_training()Exercise 5: Visualize Gradient Flow
Section titled “Exercise 5: Visualize Gradient Flow”Build a tool to visualize how gradients flow through a network.
Your task: Implement this gradient visualization:
import matplotlib.pyplot as pltimport torchimport torch.nn as nn
def visualize_gradient_flow(model, sample_input, sample_target, loss_fn): """ Creates a visualization of gradient magnitudes through the network.
This helps diagnose: - Vanishing gradients (later layers have much larger gradients) - Exploding gradients (earlier layers have much larger gradients) - Dead neurons (gradients are exactly zero) """ # Forward pass output = model(sample_input) loss = loss_fn(output, sample_target)
# Backward pass loss.backward()
# Collect gradient statistics layer_names = [] grad_means = [] grad_stds = [] grad_maxs = []
for name, param in model.named_parameters(): if param.grad is not None: grad = param.grad.detach().cpu().numpy() layer_names.append(name) grad_means.append(abs(grad).mean()) grad_stds.append(grad.std()) grad_maxs.append(abs(grad).max())
# Plot fig, axes = plt.subplots(1, 3, figsize=(15, 5))
x = range(len(layer_names))
axes[0].bar(x, grad_means) axes[0].set_xticks(x) axes[0].set_xticklabels(layer_names, rotation=45, ha='right') axes[0].set_title('Mean |Gradient|') axes[0].set_yscale('log')
axes[1].bar(x, grad_stds) axes[1].set_xticks(x) axes[1].set_xticklabels(layer_names, rotation=45, ha='right') axes[1].set_title('Gradient Std Dev') axes[1].set_yscale('log')
axes[2].bar(x, grad_maxs) axes[2].set_xticks(x) axes[2].set_xticklabels(layer_names, rotation=45, ha='right') axes[2].set_title('Max |Gradient|') axes[2].set_yscale('log')
plt.tight_layout() return fig
# Test itmodel = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10))
x = torch.randn(32, 784)y = torch.randint(0, 10, (32,))
fig = visualize_gradient_flow(model, x, y, nn.CrossEntropyLoss())plt.show()Questions to answer after running:
- Are the gradient magnitudes similar across layers, or do they vary wildly?
- What happens if you add more layers?
- What happens if you replace ReLU with Sigmoid?
- What happens if you add BatchNorm?
Exercise 6: Implement Your Own Autograd for a Neural Network
Section titled “Exercise 6: Implement Your Own Autograd for a Neural Network”This is the capstone exercise. Build a 2-layer neural network using only our Value class, train it on XOR, and verify it learns.
# The XOR problem - not linearly separable!# Input: (0,0) -> 0, (0,1) -> 1, (1,0) -> 1, (1,1) -> 0
import random
def train_xor(): """Train a 2-layer network on XOR using our custom autograd.""" random.seed(42)
# Network: 2 -> 4 -> 1 # Initialize weights W1 = [[Value(random.uniform(-1, 1)) for _ in range(2)] for _ in range(4)] b1 = [Value(0) for _ in range(4)] W2 = [Value(random.uniform(-1, 1)) for _ in range(4)] b2 = Value(0)
def forward(x1, x2): """Forward pass through the network.""" x = [Value(x1), Value(x2)]
# Hidden layer hidden = [] for i in range(4): z = W1[i][0] * x[0] + W1[i][1] * x[1] + b1[i] hidden.append(z.tanh())
# Output layer out = W2[0] * hidden[0] + W2[1] * hidden[1] + W2[2] * hidden[2] + W2[3] * hidden[3] + b2 return out.tanh()
# Training data data = [ ((0, 0), 0), ((0, 1), 1), ((1, 0), 1), ((1, 1), 0), ]
# Training loop learning_rate = 0.5 params = [w for row in W1 for w in row] + b1 + W2 + [b2]
for epoch in range(1000): total_loss = 0
for (x1, x2), target in data: # Forward pred = forward(x1, x2) loss = (pred - Value(target)) ** 2 total_loss += loss.data
# Backward for p in params: p.grad = 0 # Zero gradients loss.backward()
# Update for p in params: p.data -= learning_rate * p.grad
if epoch % 100 == 0: print(f"Epoch {epoch}: loss = {total_loss:.4f}")
# Test print("\nFinal predictions:") for (x1, x2), target in data: pred = forward(x1, x2) print(f" ({x1}, {x2}) -> {pred.data:.3f} (target: {target})")
train_xor()Expected output (approximately):
Epoch 0: loss = 1.8234Epoch 100: loss = 0.9821Epoch 200: loss = 0.3456...Epoch 900: loss = 0.0123
Final predictions: (0, 0) -> 0.012 (target: 0) (0, 1) -> 0.987 (target: 1) (1, 0) -> 0.991 (target: 1) (1, 1) -> 0.015 (target: 0)Quiz: Test Your Understanding
Section titled “Quiz: Test Your Understanding”Q1: Why do we process the computation graph in reverse order during backprop?
Answer
We need to compute gradients from the output back to the inputs because:
- The chain rule multiplies local gradients — we need ∂L/∂output first before computing ∂L/∂earlier_nodes
- Each node needs the gradient from its outputs to compute the gradient for its inputs
- Reverse topological order ensures that when we process a node, all its outputs have already been processed
This is why it’s called “back”-propagation — we propagate errors backward through the graph.
Q2: What’s the computational complexity advantage of reverse-mode autodiff over forward-mode?
Answer
For a function with n inputs and m outputs:
- Reverse-mode: O(m) backward passes to get all gradients
- Forward-mode: O(n) forward passes to get all gradients
In neural networks, we typically have millions of parameters (n) but one scalar loss (m=1). So reverse-mode needs just 1 backward pass to get gradients for all parameters, while forward-mode would need millions of passes.
This O(1) vs O(n) difference is why backpropagation (reverse-mode) is used in deep learning.
Q3: A network has vanishing gradients. What are three solutions?
Answer
-
Use ReLU instead of sigmoid/tanh: ReLU has gradient 1 for positive inputs, preventing multiplicative shrinking
-
Add skip/residual connections: Allow gradients to flow directly through addition: ∂(x + f(x))/∂x = 1 + ∂f/∂x, so even if ∂f/∂x→0, gradient is still 1
-
Use batch normalization: Normalizes activations to prevent them from saturating in sigmoid/tanh regions
Other solutions: proper initialization (Xavier/He), gradient clipping (for exploding), LSTM/GRU gating (for RNNs).
Q4: Your loss becomes NaN after 100 epochs. How do you debug?
Answer
Systematic approach:
-
Add gradient hooks to detect when NaN first appears
-
Binary search on epochs to find when it starts
-
Check for:
- Division by zero: Look for
/operations with potentially zero denominators - Log of zero:
log(x)where x could be 0 or negative - Exploding gradients: Print gradient norms, look for exponential growth
- Large learning rate: Try 10x smaller LR
- Division by zero: Look for
-
Reproduce with deterministic seed for consistent debugging
-
Simplify: Remove layers until you find the culprit
Q5: What is gradient checking and when would you use it?
Answer
Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences):
numerical_grad ≈ [f(x+ε) - f(x-ε)] / (2ε)Use it when:
- Implementing new layers or custom operations
- Debugging unexpected training behavior
- Verifying complex loss functions
Don’t use it:
- In production (too slow — O(n) forward passes)
- On large networks (use on small test cases)
A relative error < 1e-5 indicates correct implementation.
Summary
Section titled “Summary”You’ve learned:
- The chain rule is the mathematical foundation of all neural network learning
- Computational graphs track operations for automatic differentiation
- Reverse-mode autodiff (backprop) computes all gradients in one backward pass
- Building autograd from scratch shows there’s no magic — just calculus
- Gradient checking verifies implementations with numerical approximation
- Common problems (vanishing, exploding, dead ReLU) and their solutions
- Debugging techniques for systematic troubleshooting
Every time you call loss.backward(), you now understand exactly what happens inside!
Further Reading
Section titled “Further Reading”Essential Resources
Section titled “Essential Resources”-
Andrej Karpathy’s micrograd — The simplest autograd engine (~100 lines)
-
CS231n Backpropagation notes — Stanford’s excellent visual explanation
-
“Calculus on Computational Graphs” by Chris Olah
Historical Papers
Section titled “Historical Papers”-
“Learning representations by back-propagating errors” (Rumelhart, Hinton, Williams, 1986)
- The paper that popularized backpropagation
-
“Automatic Differentiation in Machine Learning: a Survey” (Baydin et al., 2018)
- Comprehensive survey of autodiff methods
The Heureka Moment (Revisited)
Section titled “The Heureka Moment (Revisited)”Backpropagation is just the chain rule applied systematically.
When you call loss.backward():
- The framework has been recording every operation you performed (building a computational graph)
- It starts at the loss (gradient = 1)
- It walks backward through the graph, applying the chain rule at each operation
- At each node, it multiplies the incoming gradient by the local gradient
- When it reaches a parameter, that accumulated gradient tells us how to update it
This elegant algorithm, combined with GPUs and big data, is why deep learning works. From a simple linear regression to gpt-5, the learning mechanism is fundamentally the same.
You now understand the engine that powers all of modern AI!
Next Steps
Section titled “Next Steps”Congratulations! You’ve completed Phase 6: Deep Learning Foundations!
You now understand:
- Python for ML (NumPy, Pandas, scikit-learn)
- Neural networks from scratch
- PyTorch fundamentals
- Training techniques (optimizers, regularization, scheduling)
- CNNs for computer vision
- Transformers and attention
- Backpropagation and autograd
Move on to Phase 7: Advanced Generative AI where you’ll learn:
- Fine-tuning and PEFT methods
- Reinforcement Learning from Human Feedback (RLHF)
- Diffusion models and image generation
- Multimodal models
- Efficient inference
Last updated: 2025-11-27 Status: Complete