Skip to content

Transformers from Scratch

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 8-10

Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)

Section titled “Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)”

Reading Time: 6-7 hours Prerequisites: Module 30


San Diego, California. July 1969. 3:47 PM.

Seppo Linnainmaa, a young Finnish computer scientist, had just finished his master’s thesis. The paper described an elegant algorithm for computing derivatives automatically—what he called “reverse mode automatic differentiation.” It was mathematically beautiful, but no one cared. Computers were too slow, and the problems it could solve seemed too small to matter.

The idea lay dormant for almost two decades.

Then in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature. They showed how Linnainmaa’s algorithm could train neural networks with multiple hidden layers—something thought impossible at the time. The paper made backpropagation famous.

But even then, the world wasn’t ready. Neural networks were slow, finicky, and often outperformed by simpler methods. Backprop was nearly forgotten again in the 1990s when support vector machines dominated.

It took until 2012—when Alex Krizhevsky’s AlexNet crushed the ImageNet competition using GPUs—for backpropagation to claim its throne. The algorithm that almost died twice is now the heartbeat of every AI system on Earth.

“Backpropagation is a beautiful algorithm. It’s not deep learning’s breakthrough—it’s its foundation. Everything else is built on top of it.” — Geoffrey Hinton, Nobel Prize acceptance speech, 2024


By the end of this module, you will:

  • Truly understand what backpropagation actually does
  • Master the chain rule and see why it’s the key to everything
  • Build a working autograd engine from scratch
  • Understand computational graphs and how frameworks track operations
  • Know how to debug gradient issues (vanishing, exploding, NaN)
  • Implement gradient checking to verify your implementations
  • Visualize gradient flow through networks

** Did You Know?** The term “backpropagation” wasn’t always the preferred name. Paul Werbos originally called it “dynamic feedback” in his 1974 PhD thesis. The name “backpropagation” (or “back-propagation”) was popularized by Rumelhart, Hinton, and Williams in their famous 1986 Nature paper. In the control theory community, the same algorithm is known as “adjoint sensitivity analysis.” Meanwhile, in the automatic differentiation community, it’s called “reverse-mode automatic differentiation.” All these names describe the same fundamental algorithm!


The Heureka Moment: What You’re About to Discover

Section titled “The Heureka Moment: What You’re About to Discover”

Think of loss.backward() like pressing “show work” on a calculator that solved a complex equation. The calculator (PyTorch) has been keeping notes about every operation it performed. When you press backward, it traces through those notes in reverse, computing exactly how much each input contributed to the final answer.

Every time you call loss.backward() in PyTorch, something magical happens. The framework somehow figures out how to adjust millions of parameters to make the loss smaller. But how?

The answer is backpropagation — short for “backward propagation of errors.” It’s not magic; it’s calculus applied cleverly. By the end of this module, you’ll be able to implement it yourself from scratch.

Here’s the transformative insight: backpropagation is just the chain rule applied systematically. Once you see it, you can’t unsee it. Every neural network, from a simple perceptron to gpt-5, learns using the same fundamental algorithm.


Consider a simple network:

Input x → [Layer 1] → [Layer 2] → [Layer 3] → Output ŷ → Loss L
W₁, b₁ W₂, b₂ W₃, b₃

We want to minimize the loss L. To do that, we need to know:

  • How should we change W₁ to reduce L?
  • How should we change W₂ to reduce L?
  • How should we change W₃ to reduce L?

Mathematically, we need: ∂L/∂W₁, ∂L/∂W₂, ∂L/∂W₃

But here’s the challenge: W₁ affects the output through a long chain of operations. How do we compute its effect on the loss?


The Chain Rule: The Foundation of Everything

Section titled “The Chain Rule: The Foundation of Everything”

Think of the chain rule like a relay race. When you want to know how fast the final runner is going, you need to know how fast each handoff affected the next. If runner 1 speeds up, runner 2 speeds up, which speeds up runner 3. The total effect is the product of all individual effects.

In backpropagation, we’re asking: “If I wiggle this weight, how much does the loss wiggle?” The answer flows backward through each operation, multiplying effects at each step.

If y = f(g(x)), then:

dy/dx = dy/dg · dg/dx

Worked Example:

Let y = (2x + 1)³

  • Let g = 2x + 1, so y = g³
  • dy/dg = 3g² = 3(2x + 1)²
  • dg/dx = 2
  • Therefore: dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²

Let’s verify at x = 1:

  • y = (2·1 + 1)³ = 27
  • dy/dx = 6(2·1 + 1)² = 6 · 9 = 54

Numerical check: (y(1.001) - y(1))/0.001 = (27.162… - 27)/0.001 ≈ 54

If L depends on x through multiple paths:

L = f(a, b) where a = g(x) and b = h(x)
∂L/∂x = (∂L/∂a)(∂a/∂x) + (∂L/∂b)(∂b/∂x)

We sum the contributions from all paths.

Did You Know? The chain rule was formalized by Gottfried Wilhelm Leibniz in the late 1600s. Over 300 years later, it became the mathematical foundation of deep learning. Leibniz couldn’t have imagined that his calculus would one day power language models and self-driving cars!


Think of a computational graph like a detailed recipe card that records every step of cooking. When you make a cake, you don’t just remember “I made a cake”—you remember “I mixed flour with sugar, then added eggs, then baked for 30 minutes.” If the cake tastes too sweet, you can trace back through the recipe to find which step added too much sugar and adjust it. Computational graphs work the same way: they record every mathematical operation so we can trace backward and figure out which weights need adjusting.

Modern deep learning frameworks represent computations as directed acyclic graphs (DAGs). Each node is an operation, and edges represent data flow.

Consider: L = (x · w + b)²

The computational graph looks like:

x w
\ /
\ /
[*] ← multiply
|
v
z₁ = x · w
|
+ ← b
|
v
z₂ = z₁ + b
|
[²] ← square
|
v
L = z₂²

Compute values from inputs to outputs:

# Forward pass
x, w, b = 2.0, 3.0, 1.0
z1 = x * w # z1 = 6
z2 = z1 + b # z2 = 7
L = z2 ** 2 # L = 49

Compute gradients from outputs to inputs:

# Backward pass - apply chain rule at each node
dL_dL = 1.0 # Start with 1
# d(z²)/dz = 2z
dL_dz2 = dL_dL * 2 * z2 # dL/dz2 = 1 * 2 * 7 = 14
# d(z1 + b)/dz1 = 1, d(z1 + b)/db = 1
dL_dz1 = dL_dz2 * 1 # dL/dz1 = 14
dL_db = dL_dz2 * 1 # dL/db = 14
# d(x * w)/dx = w, d(x * w)/dw = x
dL_dx = dL_dz1 * w # dL/dx = 14 * 3 = 42
dL_dw = dL_dz1 * x # dL/dw = 14 * 2 = 28

Verification:

  • L = (xw + b)² = (2·3 + 1)² = 49
  • ∂L/∂w = 2(xw + b) · x = 2·7·2 = 28
  • ∂L/∂x = 2(xw + b) · w = 2·7·3 = 42
  • ∂L/∂b = 2(xw + b) · 1 = 2·7·1 = 14

Let’s build a simple but complete autograd system. This is how PyTorch works at its core!

class Value:
"""
A scalar value that tracks its computation history for automatic differentiation.
This is a simplified version of how PyTorch tensors work internally.
"""
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0 # Gradient starts at 0
# Internal variables for autograd
self._backward = lambda: None # Function to compute gradient
self._prev = set(_children) # Parent nodes in the graph
self._op = _op # Operation that created this node
def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
# d(a + b)/da = 1, d(a + b)/db = 1
self.grad += out.grad * 1.0
other.grad += out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
# d(a * b)/da = b, d(a * b)/db = a
self.grad += out.grad * other.data
other.grad += out.grad * self.data
out._backward = _backward
return out
def __pow__(self, n):
assert isinstance(n, (int, float)), "only int/float powers supported"
out = Value(self.data ** n, (self,), f'**{n}')
def _backward():
# d(x^n)/dx = n * x^(n-1)
self.grad += out.grad * (n * self.data ** (n - 1))
out._backward = _backward
return out
def __neg__(self):
return self * -1
def __sub__(self, other):
return self + (-other)
def __truediv__(self, other):
return self * other ** -1
def __radd__(self, other):
return self + other
def __rmul__(self, other):
return self * other
def relu(self):
out = Value(max(0, self.data), (self,), 'ReLU')
def _backward():
# d(ReLU(x))/dx = 1 if x > 0 else 0
self.grad += out.grad * (1.0 if self.data > 0 else 0.0)
out._backward = _backward
return out
def tanh(self):
import math
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
# d(tanh(x))/dx = 1 - tanh(x)²
self.grad += out.grad * (1 - t ** 2)
out._backward = _backward
return out
def exp(self):
import math
out = Value(math.exp(self.data), (self,), 'exp')
def _backward():
# d(e^x)/dx = e^x
self.grad += out.grad * out.data
out._backward = _backward
return out
def backward(self):
"""
Compute gradients for all nodes in the graph using reverse-mode autodiff.
We do a topological sort to ensure we process nodes in the right order
(children before parents).
"""
# Topological sort
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
# Start with gradient of 1 for the output
self.grad = 1.0
# Apply chain rule in reverse topological order
for v in reversed(topo):
v._backward()

Notice how each operation stores a _backward function that knows how to compute its local gradient. When we call backward(), we traverse the graph in reverse order, applying the chain rule at each node.

# Test 1: Simple expression
x = Value(2.0)
w = Value(3.0)
b = Value(1.0)
# L = (x*w + b)²
L = (x * w + b) ** 2
L.backward()
print(f"L = {L.data}") # 49.0
print(f"dL/dx = {x.grad}") # 42.0
print(f"dL/dw = {w.grad}") # 28.0
print(f"dL/db = {b.grad}") # 14.0
# Test 2: Neural network forward pass
def neuron(x, w, b):
return (x * w + b).tanh()
x = Value(1.0)
w = Value(0.5)
b = Value(0.1)
y = neuron(x, w, b)
y.backward()
print(f"\nNeuron output: {y.data:.4f}")
print(f"dy/dw = {w.grad:.4f}")

Did You Know? This exact approach — building a computation graph during forward pass and traversing it backward — is how PyTorch’s autograd works. The main difference is that PyTorch handles tensors instead of scalars and has optimized C++ implementations. Andrej Karpathy’s “micrograd” (which inspired this implementation) proved you can build a functional autograd in ~100 lines of Python!


Think of backpropagation like a blame assignment meeting after a project fails. The final result (loss) was bad, so you need to figure out who’s responsible. You start at the end: “The final report was wrong.” Then you trace back: “Because the calculations were off.” Then: “Because the data entry had mistakes.” Each person (weight) gets assigned exactly their share of the blame (gradient), proportional to how much they contributed to the problem. This “blame” is then used to adjust behavior (update weights) for next time.

Now let’s see how backpropagation works in a full neural network.

For a layer: y = σ(Wx + b)

# Forward pass
z = W @ x + b # Linear transformation
y = activation(z) # Apply activation

Given ∂L/∂y (gradient from the layer above), compute:

  • ∂L/∂W (to update weights)
  • ∂L/∂b (to update biases)
  • ∂L/∂x (to pass to the layer below)
∂L/∂z = ∂L/∂y · ∂y/∂z = ∂L/∂y · σ'(z) # Gradient through activation
∂L/∂W = ∂L/∂z · ∂z/∂W = ∂L/∂z · x^T # Weight gradient
∂L/∂b = ∂L/∂z · ∂z/∂b = ∂L/∂z # Bias gradient
∂L/∂x = ∂L/∂z · ∂z/∂x = W^T · ∂L/∂z # Input gradient (for previous layer)

Network: x → Linear(2, 3) → ReLU → Linear(3, 1) → MSE Loss

import numpy as np
# Initialize
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1 # 3 neurons, 2 inputs
b1 = np.zeros((3, 1))
W2 = np.random.randn(1, 3) * 0.1 # 1 output, 3 inputs
b2 = np.zeros((1, 1))
# Single training example
x = np.array([[1.0], [2.0]]) # Input (2,)
y_true = np.array([[1.0]]) # Target
# === FORWARD PASS ===
z1 = W1 @ x + b1 # (3, 1)
a1 = np.maximum(0, z1) # ReLU
z2 = W2 @ a1 + b2 # (1, 1)
y_pred = z2 # No activation on output
loss = 0.5 * (y_pred - y_true) ** 2
print(f"Forward pass:")
print(f" z1 shape: {z1.shape}, a1 shape: {a1.shape}")
print(f" y_pred: {y_pred[0,0]:.4f}, loss: {loss[0,0]:.4f}")
# === BACKWARD PASS ===
# Start from loss
dL_dy = y_pred - y_true # d(0.5(y-t)²)/dy = (y-t)
# Through layer 2
dL_dz2 = dL_dy # No activation
dL_dW2 = dL_dz2 @ a1.T # (1, 3)
dL_db2 = dL_dz2 # (1, 1)
dL_da1 = W2.T @ dL_dz2 # (3, 1)
# Through ReLU
dL_dz1 = dL_da1 * (z1 > 0) # ReLU gradient: 1 if z>0, else 0
# Through layer 1
dL_dW1 = dL_dz1 @ x.T # (3, 2)
dL_db1 = dL_dz1 # (3, 1)
print(f"\nBackward pass:")
print(f" dL/dW2 shape: {dL_dW2.shape}")
print(f" dL/dW1 shape: {dL_dW1.shape}")

Gradient Checking: Verifying Your Implementation

Section titled “Gradient Checking: Verifying Your Implementation”

Think of gradient checking like double-checking your work on a math test. Your analytical solution (backprop) should give the same answer as the slow but reliable numerical method (finite differences). If they disagree significantly, you made a mistake somewhere.

How do you know your backprop is correct? Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences).

∂f/∂x ≈ [f(x + ε) - f(x - ε)] / (2ε)

This is the definition of a derivative, approximated with small ε.

def gradient_check(param, loss_fn, epsilon=1e-5):
"""
Compare analytical gradient with numerical gradient.
Args:
param: Parameter to check (numpy array)
loss_fn: Function that takes param and returns (loss, grad)
epsilon: Small number for finite differences
Returns:
relative_error: Should be < 1e-5 for correct implementation
"""
# Get analytical gradient
loss, analytical_grad = loss_fn(param)
# Compute numerical gradient
numerical_grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
original = param[idx]
# f(x + ε)
param[idx] = original + epsilon
loss_plus, _ = loss_fn(param)
# f(x - ε)
param[idx] = original - epsilon
loss_minus, _ = loss_fn(param)
# Numerical gradient
numerical_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
# Restore
param[idx] = original
it.iternext()
# Relative error
diff = np.abs(analytical_grad - numerical_grad)
norm = np.abs(analytical_grad) + np.abs(numerical_grad) + 1e-8
relative_error = np.max(diff / norm)
return relative_error, analytical_grad, numerical_grad
# Example usage
def simple_loss(W):
"""L = sum(W²)"""
loss = np.sum(W ** 2)
grad = 2 * W
return loss, grad
W = np.random.randn(3, 4)
error, analytical, numerical = gradient_check(W, simple_loss)
print(f"Relative error: {error:.2e}") # Should be ~1e-10
  1. During development: When implementing new layers or loss functions
  2. Debugging NaN gradients: To isolate which operation is broken
  3. Custom autograd operations: To verify correctness
  4. Never in production: It’s O(n) forward passes per gradient, extremely slow

Did You Know? The backpropagation algorithm was independently discovered several times. Paul Werbos described it in his 1974 PhD thesis, but it wasn’t widely known. In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature, which finally brought backprop to mainstream attention. Hinton later won the Nobel Prize in Physics (2024) partly for this work!


Think of gradients like a whispered message in the “telephone game.” In a shallow network (few players), the message arrives mostly intact. But in a deep network (many players), the message can either fade to silence (vanishing gradients) or get wildly exaggerated (exploding gradients). Modern techniques like skip connections are like having players occasionally shout the original message directly across the room, bypassing the chain entirely.

Symptom: Early layers learn very slowly or not at all.

Cause: Gradients shrink as they propagate backward through many layers.

Sigmoid: σ'(x) = σ(x)(1 - σ(x)) ≤ 0.25
After 10 layers: gradient ≤ 0.25¹⁰ = 0.000001

Solutions:

  • Use ReLU or variants (Leaky ReLU, GELU)
  • Batch normalization
  • Residual connections
  • Proper initialization (Xavier, He)

Did You Know? The vanishing gradient problem plagued neural networks for nearly two decades! After the initial excitement of backpropagation in the 1980s, researchers found that networks deeper than 2-3 layers were nearly impossible to train. This “AI Winter” period lasted until ~2006 when Geoffrey Hinton showed that layer-by-layer pretraining could help. The real breakthrough came with ReLU activations (2010) and residual connections (2015), finally enabling the 100+ layer networks we use today.

Symptom: Loss becomes NaN, weights explode to infinity.

Cause: Gradients grow as they propagate, especially in RNNs.

Solutions:

  • Gradient clipping: grad = clip(grad, -max_val, max_val)
  • Proper initialization
  • Batch normalization
  • Learning rate warmup

Symptom: Some neurons never activate, always output 0.

Cause: If a neuron’s input is always negative, ReLU outputs 0, and gradient is 0, so it never updates.

# Dead neuron: if z < 0 always, then:
# Forward: ReLU(z) = 0
# Backward: dReLU/dz = 0 → no learning!

Solutions:

  • Leaky ReLU: max(0.01x, x) — small gradient when negative
  • PReLU: max(αx, x) — learnable α
  • ELU, GELU — smooth alternatives
  • Careful initialization (don’t start with large negative biases)

Symptom: Loss becomes NaN during training.

Common causes:

  1. Division by zero: 1/x where x→0
  2. Log of zero: log(0) = -∞
  3. Exploding gradients: Eventually overflow
  4. Learning rate too high: Weights jump to bad regions

Debugging strategy:

# Add hooks to detect NaN
def check_nan(grad):
if torch.isnan(grad).any():
print(f"NaN detected! grad stats: min={grad.min()}, max={grad.max()}")
import pdb; pdb.set_trace()
return grad
for name, param in model.named_parameters():
param.register_hook(lambda grad, name=name: check_nan(grad))

Think of gradient visualization like an X-ray for your neural network. Just as doctors use X-rays to see what’s happening inside a patient, gradient histograms and flow diagrams reveal the health of your network’s learning process—showing where information flows freely and where it gets blocked or distorted.

Understanding how gradients flow helps debug networks.

def plot_gradient_histograms(model):
"""
Plot histogram of gradients for each layer.
Healthy gradients should:
- Have similar scale across layers
- Not be all zeros
- Not have extreme values
"""
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
axes = axes.flatten()
for idx, (name, param) in enumerate(model.named_parameters()):
if param.grad is not None and idx < 6:
grad = param.grad.detach().cpu().numpy().flatten()
axes[idx].hist(grad, bins=50)
axes[idx].set_title(f'{name}\nmean={grad.mean():.2e}, std={grad.std():.2e}')
axes[idx].set_xlabel('Gradient value')
plt.tight_layout()
return fig
def compute_gradient_norms(model):
"""
Compute gradient L2 norm for each layer.
Plot over training to detect:
- Vanishing: norms decrease in early layers
- Exploding: norms increase in early layers
"""
norms = {}
for name, param in model.named_parameters():
if param.grad is not None:
norms[name] = param.grad.norm().item()
return norms
PatternInterpretationAction
Gradients near 0 in early layersVanishing gradientsAdd skip connections, change activation
Gradients 100x larger in early layersExploding gradientsClip gradients, reduce LR
Some parameters always 0Dead neuronsUse Leaky ReLU, check init
Gradients NaNNumerical instabilityCheck for log(0), div by 0

There are two ways to compute gradients:

Reverse-mode (backpropagation):

  • Compute ∂output/∂all_inputs in one backward pass
  • Cost: O(1) backward passes regardless of input dimension
  • Best when: many inputs, few outputs (most neural networks!)

Forward-mode:

  • Compute ∂all_outputs/∂one_input per pass
  • Cost: O(n) forward passes for n inputs
  • Best when: few inputs, many outputs (rare in deep learning)
Neural network: 1M parameters → 1 loss value
- Reverse mode: 1 backward pass to get all 1M gradients
- Forward mode: 1M forward passes (one per parameter)

This is why backpropagation dominates deep learning!

Did You Know? Forward-mode autodiff is used in some scientific computing applications where you have few inputs and many outputs. Julia’s ForwardDiff.jl is a popular implementation. Some researchers are exploring “mixed-mode” autodiff that combines both for certain architectures.

Sometimes we need gradients of gradients (Hessians, second derivatives).

# PyTorch example: computing Hessian-vector product
x = torch.randn(10, requires_grad=True)
y = (x ** 3).sum()
# First derivative
grad_y = torch.autograd.grad(y, x, create_graph=True)[0]
# Second derivative (gradient of gradient)
grad_grad_y = torch.autograd.grad(grad_y.sum(), x)[0]
print(f"y = sum(x³)")
print(f"dy/dx = 3x² → {grad_y[:3]}")
print(f"d²y/dx² = 6x → {grad_grad_y[:3]}")

Uses:

  • Second-order optimization (Newton’s method, natural gradient)
  • Regularization (penalizing sharp minima)
  • Meta-learning (learning to learn)

Deep networks require storing all activations for backprop. This can exhaust GPU memory.

Gradient checkpointing trades compute for memory:

  1. Don’t store intermediate activations
  2. Recompute them during backward pass
  3. Reduces memory from O(n) to O(√n) with √n checkpoints
# PyTorch gradient checkpointing
from torch.utils.checkpoint import checkpoint
class CheckpointedBlock(nn.Module):
def __init__(self, block):
super().__init__()
self.block = block
def forward(self, x):
# Recompute forward pass during backward
return checkpoint(self.block, x, use_reentrant=False)

  1. Verify data pipeline

    • Are inputs normalized?
    • Are labels correct?
    • Any NaN/Inf in inputs?
  2. Check initialization

    • Print initial weight stats
    • Verify no NaN in initial forward pass
  3. Start simple

    • Can you overfit one batch?
    • If not, something is fundamentally broken
  4. Monitor gradient flow

    • Plot gradient norms per layer
    • Look for vanishing/exploding patterns
  5. Gradient checking

    • Verify backprop is correct for custom operations
    • Use small networks for speed
def debug_training(model, batch, loss_fn):
"""Comprehensive debugging for one batch."""
# 1. Check input
x, y = batch
print(f"Input: shape={x.shape}, range=[{x.min():.2f}, {x.max():.2f}]")
print(f"Labels: shape={y.shape}, unique={torch.unique(y).tolist()}")
assert not torch.isnan(x).any(), "NaN in input!"
# 2. Forward pass
model.train()
output = model(x)
print(f"Output: shape={output.shape}, range=[{output.min():.2f}, {output.max():.2f}]")
assert not torch.isnan(output).any(), "NaN in output!"
# 3. Loss
loss = loss_fn(output, y)
print(f"Loss: {loss.item():.4f}")
assert not torch.isnan(loss), "NaN loss!"
# 4. Backward pass
loss.backward()
# 5. Check gradients
for name, param in model.named_parameters():
if param.grad is not None:
grad = param.grad
print(f"{name}: grad_norm={grad.norm():.2e}, "
f"grad_range=[{grad.min():.2e}, {grad.max():.2e}]")
assert not torch.isnan(grad).any(), f"NaN gradient in {name}!"
print("All checks passed!")

These exercises will solidify your understanding of backpropagation by having you implement and debug gradient computations yourself.

Take the Value class we built earlier and extend it with additional operations.

Your task: Implement these missing methods:

class Value:
# ... (existing code from above)
def log(self):
"""Natural logarithm. d(log(x))/dx = 1/x"""
import math
# YOUR CODE HERE
pass
def sigmoid(self):
"""Sigmoid function. σ(x) = 1/(1+e^(-x)), σ'(x) = σ(x)(1-σ(x))"""
# YOUR CODE HERE
pass
def __matmul__(self, other):
"""
Implement dot product for two Value vectors.
This is trickier - think about how gradients flow through a sum of products.
"""
# YOUR CODE HERE
pass

Solution approach:

def log(self):
import math
out = Value(math.log(self.data), (self,), 'log')
def _backward():
# d(log(x))/dx = 1/x
self.grad += out.grad * (1.0 / self.data)
out._backward = _backward
return out
def sigmoid(self):
import math
s = 1.0 / (1.0 + math.exp(-self.data))
out = Value(s, (self,), 'sigmoid')
def _backward():
# d(sigmoid(x))/dx = sigmoid(x) * (1 - sigmoid(x))
self.grad += out.grad * s * (1 - s)
out._backward = _backward
return out

Verify your implementation:

# Test log
x = Value(2.0)
y = x.log()
y.backward()
print(f"log(2) = {y.data:.4f}, should be ~0.6931")
print(f"d(log(x))/dx at x=2 is {x.grad:.4f}, should be 0.5")
# Test sigmoid
x = Value(0.0)
y = x.sigmoid()
y.backward()
print(f"sigmoid(0) = {y.data:.4f}, should be 0.5")
print(f"d(sigmoid(x))/dx at x=0 is {x.grad:.4f}, should be 0.25")

Exercise 2: Trace Through Backprop by Hand

Section titled “Exercise 2: Trace Through Backprop by Hand”

Manually compute gradients for this computation graph:

L = (a * b + c)² where a=2, b=3, c=1

Step 1: Draw the graph

a(2) b(3)
\ /
[*]
|
v
z1(6) c(1)
\ /
\ /
[+]
|
v
z2(7)
|
[²]
|
v
L(49)

Step 2: Forward pass (verify values)

  • z1 = a * b = 2 * 3 = 6
  • z2 = z1 + c = 6 + 1 = 7
  • L = z2² = 49

Step 3: Backward pass (compute gradients)

  • dL/dL = 1 (start here)
  • dL/dz2 = dL/dL * d(z2²)/dz2 = 1 * 2z2 = 27 = 14
  • dL/dz1 = dL/dz2 * d(z1+c)/dz1 = 14 * 1 = 14
  • dL/dc = dL/dz2 * d(z1+c)/dc = 14 * 1 = 14
  • dL/da = dL/dz1 * d(a*b)/da = 14 * b = 14 * 3 = 42
  • dL/db = dL/dz1 * d(a*b)/db = 14 * a = 14 * 2 = 28

Your task: Verify with our autograd engine:

a = Value(2.0)
b = Value(3.0)
c = Value(1.0)
L = (a * b + c) ** 2
L.backward()
print(f"L = {L.data}") # Should be 49
print(f"dL/da = {a.grad}") # Should be 42
print(f"dL/db = {b.grad}") # Should be 28
print(f"dL/dc = {c.grad}") # Should be 14

Gradient clipping prevents exploding gradients by limiting gradient magnitudes.

Your task: Implement two types of gradient clipping:

def clip_grad_value(parameters, clip_value):
"""
Clip gradients by value.
Each gradient component is clipped to [-clip_value, clip_value].
Example: grad = [5, -10, 3] with clip_value=4 becomes [4, -4, 3]
"""
# YOUR CODE HERE
for param in parameters:
if param.grad is not None:
param.grad.data.clamp_(-clip_value, clip_value)
def clip_grad_norm(parameters, max_norm):
"""
Clip gradients by global norm.
All gradients are scaled so their combined L2 norm <= max_norm.
Example: If total norm is 10 and max_norm is 5, scale all grads by 0.5
"""
# YOUR CODE HERE
total_norm = 0.0
for param in parameters:
if param.grad is not None:
total_norm += param.grad.data.pow(2).sum()
total_norm = total_norm.sqrt()
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
for param in parameters:
if param.grad is not None:
param.grad.data.mul_(clip_coef)
return total_norm

Test your implementation:

import torch
# Create a model with exploding gradients
model = torch.nn.Linear(10, 10)
x = torch.randn(1, 10) * 100 # Large input
y = model(x)
loss = y.sum()
loss.backward()
# Print gradient norms before clipping
print("Before clipping:")
for name, param in model.named_parameters():
print(f" {name}: grad_norm = {param.grad.norm():.2f}")
# Clip gradients
clip_grad_norm(model.parameters(), max_norm=1.0)
# Print gradient norms after clipping
print("After clipping (max_norm=1.0):")
total = 0
for name, param in model.named_parameters():
print(f" {name}: grad_norm = {param.grad.norm():.2f}")
total += param.grad.norm()**2
print(f" Total norm: {total.sqrt():.2f}") # Should be <= 1.0

The following training loop has bugs that cause NaN loss. Find and fix them.

import torch
import torch.nn as nn
def broken_training():
"""This training loop will produce NaN. Fix the bugs!"""
# Bug 1: No seed for reproducibility
model = nn.Sequential(
nn.Linear(10, 50),
nn.Sigmoid(), # Bug 2: Sigmoid can cause vanishing gradients in deep nets
nn.Linear(50, 50),
nn.Sigmoid(),
nn.Linear(50, 1)
)
optimizer = torch.optim.SGD(model.parameters(), lr=10.0) # Bug 3: LR too high
for epoch in range(100):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model(x)
loss = torch.log(output) # Bug 4: log of potentially negative values!
loss = loss.mean()
# Bug 5: No zero_grad!
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss = {loss.item():.4f}")
# Run it - watch it fail
broken_training()

Your task: Fix all 5 bugs and get the training to work.

Fixed version:

def fixed_training():
"""Corrected training loop."""
# Fix 1: Set seed for reproducibility
torch.manual_seed(42)
# Fix 2: Use ReLU instead of Sigmoid
model = nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 50),
nn.ReLU(),
nn.Linear(50, 1)
)
# Fix 3: Use reasonable learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(100):
x = torch.randn(32, 10)
y = torch.randn(32, 1)
# Fix 5: Zero gradients at start of each iteration
optimizer.zero_grad()
output = model(x)
# Fix 4: Use proper loss function (MSE instead of log)
loss = nn.functional.mse_loss(output, y)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss = {loss.item():.4f}")
fixed_training()

Build a tool to visualize how gradients flow through a network.

Your task: Implement this gradient visualization:

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
def visualize_gradient_flow(model, sample_input, sample_target, loss_fn):
"""
Creates a visualization of gradient magnitudes through the network.
This helps diagnose:
- Vanishing gradients (later layers have much larger gradients)
- Exploding gradients (earlier layers have much larger gradients)
- Dead neurons (gradients are exactly zero)
"""
# Forward pass
output = model(sample_input)
loss = loss_fn(output, sample_target)
# Backward pass
loss.backward()
# Collect gradient statistics
layer_names = []
grad_means = []
grad_stds = []
grad_maxs = []
for name, param in model.named_parameters():
if param.grad is not None:
grad = param.grad.detach().cpu().numpy()
layer_names.append(name)
grad_means.append(abs(grad).mean())
grad_stds.append(grad.std())
grad_maxs.append(abs(grad).max())
# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
x = range(len(layer_names))
axes[0].bar(x, grad_means)
axes[0].set_xticks(x)
axes[0].set_xticklabels(layer_names, rotation=45, ha='right')
axes[0].set_title('Mean |Gradient|')
axes[0].set_yscale('log')
axes[1].bar(x, grad_stds)
axes[1].set_xticks(x)
axes[1].set_xticklabels(layer_names, rotation=45, ha='right')
axes[1].set_title('Gradient Std Dev')
axes[1].set_yscale('log')
axes[2].bar(x, grad_maxs)
axes[2].set_xticks(x)
axes[2].set_xticklabels(layer_names, rotation=45, ha='right')
axes[2].set_title('Max |Gradient|')
axes[2].set_yscale('log')
plt.tight_layout()
return fig
# Test it
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
x = torch.randn(32, 784)
y = torch.randint(0, 10, (32,))
fig = visualize_gradient_flow(model, x, y, nn.CrossEntropyLoss())
plt.show()

Questions to answer after running:

  1. Are the gradient magnitudes similar across layers, or do they vary wildly?
  2. What happens if you add more layers?
  3. What happens if you replace ReLU with Sigmoid?
  4. What happens if you add BatchNorm?

Exercise 6: Implement Your Own Autograd for a Neural Network

Section titled “Exercise 6: Implement Your Own Autograd for a Neural Network”

This is the capstone exercise. Build a 2-layer neural network using only our Value class, train it on XOR, and verify it learns.

# The XOR problem - not linearly separable!
# Input: (0,0) -> 0, (0,1) -> 1, (1,0) -> 1, (1,1) -> 0
import random
def train_xor():
"""Train a 2-layer network on XOR using our custom autograd."""
random.seed(42)
# Network: 2 -> 4 -> 1
# Initialize weights
W1 = [[Value(random.uniform(-1, 1)) for _ in range(2)] for _ in range(4)]
b1 = [Value(0) for _ in range(4)]
W2 = [Value(random.uniform(-1, 1)) for _ in range(4)]
b2 = Value(0)
def forward(x1, x2):
"""Forward pass through the network."""
x = [Value(x1), Value(x2)]
# Hidden layer
hidden = []
for i in range(4):
z = W1[i][0] * x[0] + W1[i][1] * x[1] + b1[i]
hidden.append(z.tanh())
# Output layer
out = W2[0] * hidden[0] + W2[1] * hidden[1] + W2[2] * hidden[2] + W2[3] * hidden[3] + b2
return out.tanh()
# Training data
data = [
((0, 0), 0),
((0, 1), 1),
((1, 0), 1),
((1, 1), 0),
]
# Training loop
learning_rate = 0.5
params = [w for row in W1 for w in row] + b1 + W2 + [b2]
for epoch in range(1000):
total_loss = 0
for (x1, x2), target in data:
# Forward
pred = forward(x1, x2)
loss = (pred - Value(target)) ** 2
total_loss += loss.data
# Backward
for p in params:
p.grad = 0 # Zero gradients
loss.backward()
# Update
for p in params:
p.data -= learning_rate * p.grad
if epoch % 100 == 0:
print(f"Epoch {epoch}: loss = {total_loss:.4f}")
# Test
print("\nFinal predictions:")
for (x1, x2), target in data:
pred = forward(x1, x2)
print(f" ({x1}, {x2}) -> {pred.data:.3f} (target: {target})")
train_xor()

Expected output (approximately):

Epoch 0: loss = 1.8234
Epoch 100: loss = 0.9821
Epoch 200: loss = 0.3456
...
Epoch 900: loss = 0.0123
Final predictions:
(0, 0) -> 0.012 (target: 0)
(0, 1) -> 0.987 (target: 1)
(1, 0) -> 0.991 (target: 1)
(1, 1) -> 0.015 (target: 0)

Q1: Why do we process the computation graph in reverse order during backprop?

Answer

We need to compute gradients from the output back to the inputs because:

  1. The chain rule multiplies local gradients — we need ∂L/∂output first before computing ∂L/∂earlier_nodes
  2. Each node needs the gradient from its outputs to compute the gradient for its inputs
  3. Reverse topological order ensures that when we process a node, all its outputs have already been processed

This is why it’s called “back”-propagation — we propagate errors backward through the graph.

Q2: What’s the computational complexity advantage of reverse-mode autodiff over forward-mode?

Answer

For a function with n inputs and m outputs:

  • Reverse-mode: O(m) backward passes to get all gradients
  • Forward-mode: O(n) forward passes to get all gradients

In neural networks, we typically have millions of parameters (n) but one scalar loss (m=1). So reverse-mode needs just 1 backward pass to get gradients for all parameters, while forward-mode would need millions of passes.

This O(1) vs O(n) difference is why backpropagation (reverse-mode) is used in deep learning.

Q3: A network has vanishing gradients. What are three solutions?

Answer
  1. Use ReLU instead of sigmoid/tanh: ReLU has gradient 1 for positive inputs, preventing multiplicative shrinking

  2. Add skip/residual connections: Allow gradients to flow directly through addition: ∂(x + f(x))/∂x = 1 + ∂f/∂x, so even if ∂f/∂x→0, gradient is still 1

  3. Use batch normalization: Normalizes activations to prevent them from saturating in sigmoid/tanh regions

Other solutions: proper initialization (Xavier/He), gradient clipping (for exploding), LSTM/GRU gating (for RNNs).

Q4: Your loss becomes NaN after 100 epochs. How do you debug?

Answer

Systematic approach:

  1. Add gradient hooks to detect when NaN first appears

  2. Binary search on epochs to find when it starts

  3. Check for:

    • Division by zero: Look for / operations with potentially zero denominators
    • Log of zero: log(x) where x could be 0 or negative
    • Exploding gradients: Print gradient norms, look for exponential growth
    • Large learning rate: Try 10x smaller LR
  4. Reproduce with deterministic seed for consistent debugging

  5. Simplify: Remove layers until you find the culprit

Q5: What is gradient checking and when would you use it?

Answer

Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences):

numerical_grad ≈ [f(x+ε) - f(x-ε)] / (2ε)

Use it when:

  • Implementing new layers or custom operations
  • Debugging unexpected training behavior
  • Verifying complex loss functions

Don’t use it:

  • In production (too slow — O(n) forward passes)
  • On large networks (use on small test cases)

A relative error < 1e-5 indicates correct implementation.


You’ve learned:

  1. The chain rule is the mathematical foundation of all neural network learning
  2. Computational graphs track operations for automatic differentiation
  3. Reverse-mode autodiff (backprop) computes all gradients in one backward pass
  4. Building autograd from scratch shows there’s no magic — just calculus
  5. Gradient checking verifies implementations with numerical approximation
  6. Common problems (vanishing, exploding, dead ReLU) and their solutions
  7. Debugging techniques for systematic troubleshooting

Every time you call loss.backward(), you now understand exactly what happens inside!


  1. Andrej Karpathy’s micrograd — The simplest autograd engine (~100 lines)

  2. CS231n Backpropagation notes — Stanford’s excellent visual explanation

  3. “Calculus on Computational Graphs” by Chris Olah

  1. “Learning representations by back-propagating errors” (Rumelhart, Hinton, Williams, 1986)

    • The paper that popularized backpropagation
  2. “Automatic Differentiation in Machine Learning: a Survey” (Baydin et al., 2018)

    • Comprehensive survey of autodiff methods

Backpropagation is just the chain rule applied systematically.

When you call loss.backward():

  1. The framework has been recording every operation you performed (building a computational graph)
  2. It starts at the loss (gradient = 1)
  3. It walks backward through the graph, applying the chain rule at each operation
  4. At each node, it multiplies the incoming gradient by the local gradient
  5. When it reaches a parameter, that accumulated gradient tells us how to update it

This elegant algorithm, combined with GPUs and big data, is why deep learning works. From a simple linear regression to gpt-5, the learning mechanism is fundamentally the same.

You now understand the engine that powers all of modern AI!


Congratulations! You’ve completed Phase 6: Deep Learning Foundations!

You now understand:

  • Python for ML (NumPy, Pandas, scikit-learn)
  • Neural networks from scratch
  • PyTorch fundamentals
  • Training techniques (optimizers, regularization, scheduling)
  • CNNs for computer vision
  • Transformers and attention
  • Backpropagation and autograd

Move on to Phase 7: Advanced Generative AI where you’ll learn:

  • Fine-tuning and PEFT methods
  • Reinforcement Learning from Human Feedback (RLHF)
  • Diffusion models and image generation
  • Multimodal models
  • Efficient inference

Last updated: 2025-11-27 Status: Complete