Transformers from Scratch

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 8-10

Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)

Reading Time: 6-7 hours Prerequisites: Module 30

The Idea That Almost Died Twice

San Diego, California. July 1969. 3:47 PM.

Seppo Linnainmaa, a young Finnish computer scientist, had just finished his master’s thesis. The paper described an elegant algorithm for computing derivatives automatically—what he called “reverse mode automatic differentiation.” It was mathematically beautiful, but no one cared. Computers were too slow, and the problems it could solve seemed too small to matter.

The idea lay dormant for almost two decades.

Then in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature. They showed how Linnainmaa’s algorithm could train neural networks with multiple hidden layers—something thought impossible at the time. The paper made backpropagation famous.

But even then, the world wasn’t ready. Neural networks were slow, finicky, and often outperformed by simpler methods. Backprop was nearly forgotten again in the 1990s when support vector machines dominated.

It took until 2012—when Alex Krizhevsky’s AlexNet crushed the ImageNet competition using GPUs—for backpropagation to claim its throne. The algorithm that almost died twice is now the heartbeat of every AI system on Earth.

“Backpropagation is a beautiful algorithm. It’s not deep learning’s breakthrough—it’s its foundation. Everything else is built on top of it.” — Geoffrey Hinton, Nobel Prize acceptance speech, 2024

What You’ll Be Able to Do

By the end of this module, you will:

Truly understand what backpropagation actually does
Master the chain rule and see why it’s the key to everything
Build a working autograd engine from scratch
Understand computational graphs and how frameworks track operations
Know how to debug gradient issues (vanishing, exploding, NaN)
Implement gradient checking to verify your implementations
Visualize gradient flow through networks

** Did You Know?** The term “backpropagation” wasn’t always the preferred name. Paul Werbos originally called it “dynamic feedback” in his 1974 PhD thesis. The name “backpropagation” (or “back-propagation”) was popularized by Rumelhart, Hinton, and Williams in their famous 1986 Nature paper. In the control theory community, the same algorithm is known as “adjoint sensitivity analysis.” Meanwhile, in the automatic differentiation community, it’s called “reverse-mode automatic differentiation.” All these names describe the same fundamental algorithm!

The Heureka Moment: What You’re About to Discover

Think of loss.backward() like pressing “show work” on a calculator that solved a complex equation. The calculator (PyTorch) has been keeping notes about every operation it performed. When you press backward, it traces through those notes in reverse, computing exactly how much each input contributed to the final answer.

Every time you call loss.backward() in PyTorch, something magical happens. The framework somehow figures out how to adjust millions of parameters to make the loss smaller. But how?

The answer is backpropagation — short for “backward propagation of errors.” It’s not magic; it’s calculus applied cleverly. By the end of this module, you’ll be able to implement it yourself from scratch.

Here’s the transformative insight: backpropagation is just the chain rule applied systematically. Once you see it, you can’t unsee it. Every neural network, from a simple perceptron to gpt-5, learns using the same fundamental algorithm.

The Problem: How Do We Update Weights?

Consider a simple network:

Input x → [Layer 1] → [Layer 2] → [Layer 3] → Output ŷ → Loss L
          W₁, b₁      W₂, b₂      W₃, b₃

We want to minimize the loss L. To do that, we need to know:

How should we change W₁ to reduce L?
How should we change W₂ to reduce L?
How should we change W₃ to reduce L?

Mathematically, we need: ∂L/∂W₁, ∂L/∂W₂, ∂L/∂W₃

But here’s the challenge: W₁ affects the output through a long chain of operations. How do we compute its effect on the loss?

The Chain Rule: The Foundation of Everything

Think of the chain rule like a relay race. When you want to know how fast the final runner is going, you need to know how fast each handoff affected the next. If runner 1 speeds up, runner 2 speeds up, which speeds up runner 3. The total effect is the product of all individual effects.

In backpropagation, we’re asking: “If I wiggle this weight, how much does the loss wiggle?” The answer flows backward through each operation, multiplying effects at each step.

The Single-Variable Chain Rule

If y = f(g(x)), then:

dy/dx = dy/dg · dg/dx

Worked Example:

Let y = (2x + 1)³

Let g = 2x + 1, so y = g³
dy/dg = 3g² = 3(2x + 1)²
dg/dx = 2
Therefore: dy/dx = 3(2x + 1)² · 2 = 6(2x + 1)²

Let’s verify at x = 1:

y = (2·1 + 1)³ = 27
dy/dx = 6(2·1 + 1)² = 6 · 9 = 54

Numerical check: (y(1.001) - y(1))/0.001 = (27.162… - 27)/0.001 ≈ 54

The Multi-Variable Chain Rule

If L depends on x through multiple paths:

L = f(a, b) where a = g(x) and b = h(x)

∂L/∂x = (∂L/∂a)(∂a/∂x) + (∂L/∂b)(∂b/∂x)

We sum the contributions from all paths.

Did You Know? The chain rule was formalized by Gottfried Wilhelm Leibniz in the late 1600s. Over 300 years later, it became the mathematical foundation of deep learning. Leibniz couldn’t have imagined that his calculus would one day power language models and self-driving cars!

Computational Graphs: Tracking Operations

Think of a computational graph like a detailed recipe card that records every step of cooking. When you make a cake, you don’t just remember “I made a cake”—you remember “I mixed flour with sugar, then added eggs, then baked for 30 minutes.” If the cake tastes too sweet, you can trace back through the recipe to find which step added too much sugar and adjust it. Computational graphs work the same way: they record every mathematical operation so we can trace backward and figure out which weights need adjusting.

Modern deep learning frameworks represent computations as directed acyclic graphs (DAGs). Each node is an operation, and edges represent data flow.

Example: Simple Expression

Consider: L = (x · w + b)²

The computational graph looks like:

    x       w
     \     /
      \   /
       [*]  ← multiply
        |
        v
        z₁ = x · w
        |
        + ← b
        |
        v
        z₂ = z₁ + b
        |
       [²] ← square
        |
        v
        L = z₂²

Forward Pass

Compute values from inputs to outputs:

# Forward pass
x, w, b = 2.0, 3.0, 1.0

z1 = x * w        # z1 = 6
z2 = z1 + b       # z2 = 7
L = z2 ** 2       # L = 49

Backward Pass

Compute gradients from outputs to inputs:

# Backward pass - apply chain rule at each node
dL_dL = 1.0                    # Start with 1

# d(z²)/dz = 2z
dL_dz2 = dL_dL * 2 * z2        # dL/dz2 = 1 * 2 * 7 = 14

# d(z1 + b)/dz1 = 1, d(z1 + b)/db = 1
dL_dz1 = dL_dz2 * 1            # dL/dz1 = 14
dL_db = dL_dz2 * 1             # dL/db = 14

# d(x * w)/dx = w, d(x * w)/dw = x
dL_dx = dL_dz1 * w             # dL/dx = 14 * 3 = 42
dL_dw = dL_dz1 * x             # dL/dw = 14 * 2 = 28

Verification:

L = (xw + b)² = (2·3 + 1)² = 49
∂L/∂w = 2(xw + b) · x = 2·7·2 = 28
∂L/∂x = 2(xw + b) · w = 2·7·3 = 42
∂L/∂b = 2(xw + b) · 1 = 2·7·1 = 14

Building an Autograd Engine from Scratch

Let’s build a simple but complete autograd system. This is how PyTorch works at its core!

The Value Class

class Value:
    """
    A scalar value that tracks its computation history for automatic differentiation.

    This is a simplified version of how PyTorch tensors work internally.
    """

    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0  # Gradient starts at 0

        # Internal variables for autograd
        self._backward = lambda: None  # Function to compute gradient
        self._prev = set(_children)    # Parent nodes in the graph
        self._op = _op                 # Operation that created this node

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')

        def _backward():
            # d(a + b)/da = 1, d(a + b)/db = 1
            self.grad += out.grad * 1.0
            other.grad += out.grad * 1.0
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            # d(a * b)/da = b, d(a * b)/db = a
            self.grad += out.grad * other.data
            other.grad += out.grad * self.data
        out._backward = _backward

        return out

    def __pow__(self, n):
        assert isinstance(n, (int, float)), "only int/float powers supported"
        out = Value(self.data ** n, (self,), f'**{n}')

        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += out.grad * (n * self.data ** (n - 1))
        out._backward = _backward

        return out

    def __neg__(self):
        return self * -1

    def __sub__(self, other):
        return self + (-other)

    def __truediv__(self, other):
        return self * other ** -1

    def __radd__(self, other):
        return self + other

    def __rmul__(self, other):
        return self * other

    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')

        def _backward():
            # d(ReLU(x))/dx = 1 if x > 0 else 0
            self.grad += out.grad * (1.0 if self.data > 0 else 0.0)
        out._backward = _backward

        return out

    def tanh(self):
        import math
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')

        def _backward():
            # d(tanh(x))/dx = 1 - tanh(x)²
            self.grad += out.grad * (1 - t ** 2)
        out._backward = _backward

        return out

    def exp(self):
        import math
        out = Value(math.exp(self.data), (self,), 'exp')

        def _backward():
            # d(e^x)/dx = e^x
            self.grad += out.grad * out.data
        out._backward = _backward

        return out

    def backward(self):
        """
        Compute gradients for all nodes in the graph using reverse-mode autodiff.

        We do a topological sort to ensure we process nodes in the right order
        (children before parents).
        """
        # Topological sort
        topo = []
        visited = set()

        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)

        build_topo(self)

        # Start with gradient of 1 for the output
        self.grad = 1.0

        # Apply chain rule in reverse topological order
        for v in reversed(topo):
            v._backward()

Notice how each operation stores a _backward function that knows how to compute its local gradient. When we call backward(), we traverse the graph in reverse order, applying the chain rule at each node.

Testing Our Autograd

# Test 1: Simple expression
x = Value(2.0)
w = Value(3.0)
b = Value(1.0)

# L = (x*w + b)²
L = (x * w + b) ** 2
L.backward()

print(f"L = {L.data}")        # 49.0
print(f"dL/dx = {x.grad}")    # 42.0
print(f"dL/dw = {w.grad}")    # 28.0
print(f"dL/db = {b.grad}")    # 14.0

# Test 2: Neural network forward pass
def neuron(x, w, b):
    return (x * w + b).tanh()

x = Value(1.0)
w = Value(0.5)
b = Value(0.1)
y = neuron(x, w, b)
y.backward()

print(f"\nNeuron output: {y.data:.4f}")
print(f"dy/dw = {w.grad:.4f}")

Did You Know? This exact approach — building a computation graph during forward pass and traversing it backward — is how PyTorch’s autograd works. The main difference is that PyTorch handles tensors instead of scalars and has optimized C++ implementations. Andrej Karpathy’s “micrograd” (which inspired this implementation) proved you can build a functional autograd in ~100 lines of Python!

The Backpropagation Algorithm

Think of backpropagation like a blame assignment meeting after a project fails. The final result (loss) was bad, so you need to figure out who’s responsible. You start at the end: “The final report was wrong.” Then you trace back: “Because the calculations were off.” Then: “Because the data entry had mistakes.” Each person (weight) gets assigned exactly their share of the blame (gradient), proportional to how much they contributed to the problem. This “blame” is then used to adjust behavior (update weights) for next time.

Now let’s see how backpropagation works in a full neural network.

Forward Pass Through a Layer

For a layer: y = σ(Wx + b)

# Forward pass
z = W @ x + b        # Linear transformation
y = activation(z)    # Apply activation

Backward Pass Through a Layer

Given ∂L/∂y (gradient from the layer above), compute:

∂L/∂W (to update weights)
∂L/∂b (to update biases)
∂L/∂x (to pass to the layer below)

∂L/∂z = ∂L/∂y · ∂y/∂z = ∂L/∂y · σ'(z)     # Gradient through activation
∂L/∂W = ∂L/∂z · ∂z/∂W = ∂L/∂z · x^T       # Weight gradient
∂L/∂b = ∂L/∂z · ∂z/∂b = ∂L/∂z             # Bias gradient
∂L/∂x = ∂L/∂z · ∂z/∂x = W^T · ∂L/∂z       # Input gradient (for previous layer)

Worked Example: Two-Layer Network

Network: x → Linear(2, 3) → ReLU → Linear(3, 1) → MSE Loss

import numpy as np

# Initialize
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1  # 3 neurons, 2 inputs
b1 = np.zeros((3, 1))
W2 = np.random.randn(1, 3) * 0.1  # 1 output, 3 inputs
b2 = np.zeros((1, 1))

# Single training example
x = np.array([[1.0], [2.0]])       # Input (2,)
y_true = np.array([[1.0]])         # Target

# === FORWARD PASS ===
z1 = W1 @ x + b1                   # (3, 1)
a1 = np.maximum(0, z1)             # ReLU
z2 = W2 @ a1 + b2                  # (1, 1)
y_pred = z2                        # No activation on output
loss = 0.5 * (y_pred - y_true) ** 2

print(f"Forward pass:")
print(f"  z1 shape: {z1.shape}, a1 shape: {a1.shape}")
print(f"  y_pred: {y_pred[0,0]:.4f}, loss: {loss[0,0]:.4f}")

# === BACKWARD PASS ===
# Start from loss
dL_dy = y_pred - y_true            # d(0.5(y-t)²)/dy = (y-t)

# Through layer 2
dL_dz2 = dL_dy                     # No activation
dL_dW2 = dL_dz2 @ a1.T             # (1, 3)
dL_db2 = dL_dz2                    # (1, 1)
dL_da1 = W2.T @ dL_dz2             # (3, 1)

# Through ReLU
dL_dz1 = dL_da1 * (z1 > 0)         # ReLU gradient: 1 if z>0, else 0

# Through layer 1
dL_dW1 = dL_dz1 @ x.T              # (3, 2)
dL_db1 = dL_dz1                    # (3, 1)

print(f"\nBackward pass:")
print(f"  dL/dW2 shape: {dL_dW2.shape}")
print(f"  dL/dW1 shape: {dL_dW1.shape}")

Gradient Checking: Verifying Your Implementation

Think of gradient checking like double-checking your work on a math test. Your analytical solution (backprop) should give the same answer as the slow but reliable numerical method (finite differences). If they disagree significantly, you made a mistake somewhere.

How do you know your backprop is correct? Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences).

Numerical Gradient

∂f/∂x ≈ [f(x + ε) - f(x - ε)] / (2ε)

This is the definition of a derivative, approximated with small ε.

Implementation

def gradient_check(param, loss_fn, epsilon=1e-5):
    """
    Compare analytical gradient with numerical gradient.

    Args:
        param: Parameter to check (numpy array)
        loss_fn: Function that takes param and returns (loss, grad)
        epsilon: Small number for finite differences

    Returns:
        relative_error: Should be < 1e-5 for correct implementation
    """
    # Get analytical gradient
    loss, analytical_grad = loss_fn(param)

    # Compute numerical gradient
    numerical_grad = np.zeros_like(param)

    it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        original = param[idx]

        # f(x + ε)
        param[idx] = original + epsilon
        loss_plus, _ = loss_fn(param)

        # f(x - ε)
        param[idx] = original - epsilon
        loss_minus, _ = loss_fn(param)

        # Numerical gradient
        numerical_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)

        # Restore
        param[idx] = original
        it.iternext()

    # Relative error
    diff = np.abs(analytical_grad - numerical_grad)
    norm = np.abs(analytical_grad) + np.abs(numerical_grad) + 1e-8
    relative_error = np.max(diff / norm)

    return relative_error, analytical_grad, numerical_grad


# Example usage
def simple_loss(W):
    """L = sum(W²)"""
    loss = np.sum(W ** 2)
    grad = 2 * W
    return loss, grad

W = np.random.randn(3, 4)
error, analytical, numerical = gradient_check(W, simple_loss)
print(f"Relative error: {error:.2e}")  # Should be ~1e-10

When to Use Gradient Checking

During development: When implementing new layers or loss functions
Debugging NaN gradients: To isolate which operation is broken
Custom autograd operations: To verify correctness
Never in production: It’s O(n) forward passes per gradient, extremely slow

Did You Know? The backpropagation algorithm was independently discovered several times. Paul Werbos described it in his 1974 PhD thesis, but it wasn’t widely known. In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published “Learning representations by back-propagating errors” in Nature, which finally brought backprop to mainstream attention. Hinton later won the Nobel Prize in Physics (2024) partly for this work!

Common Gradient Problems

Think of gradients like a whispered message in the “telephone game.” In a shallow network (few players), the message arrives mostly intact. But in a deep network (many players), the message can either fade to silence (vanishing gradients) or get wildly exaggerated (exploding gradients). Modern techniques like skip connections are like having players occasionally shout the original message directly across the room, bypassing the chain entirely.

1. Vanishing Gradients

Symptom: Early layers learn very slowly or not at all.

Cause: Gradients shrink as they propagate backward through many layers.

Sigmoid: σ'(x) = σ(x)(1 - σ(x)) ≤ 0.25

After 10 layers: gradient ≤ 0.25¹⁰ = 0.000001

Solutions:

Use ReLU or variants (Leaky ReLU, GELU)
Batch normalization
Residual connections
Proper initialization (Xavier, He)

Did You Know? The vanishing gradient problem plagued neural networks for nearly two decades! After the initial excitement of backpropagation in the 1980s, researchers found that networks deeper than 2-3 layers were nearly impossible to train. This “AI Winter” period lasted until ~2006 when Geoffrey Hinton showed that layer-by-layer pretraining could help. The real breakthrough came with ReLU activations (2010) and residual connections (2015), finally enabling the 100+ layer networks we use today.

2. Exploding Gradients

Symptom: Loss becomes NaN, weights explode to infinity.

Cause: Gradients grow as they propagate, especially in RNNs.

Solutions:

Gradient clipping: grad = clip(grad, -max_val, max_val)
Proper initialization
Batch normalization
Learning rate warmup

3. Dead ReLU

Symptom: Some neurons never activate, always output 0.

Cause: If a neuron’s input is always negative, ReLU outputs 0, and gradient is 0, so it never updates.

# Dead neuron: if z < 0 always, then:
# Forward: ReLU(z) = 0
# Backward: dReLU/dz = 0 → no learning!

Solutions:

Leaky ReLU: max(0.01x, x) — small gradient when negative
PReLU: max(αx, x) — learnable α
ELU, GELU — smooth alternatives
Careful initialization (don’t start with large negative biases)

4. NaN Gradients

Symptom: Loss becomes NaN during training.

Common causes:

Division by zero: 1/x where x→0
Log of zero: log(0) = -∞
Exploding gradients: Eventually overflow
Learning rate too high: Weights jump to bad regions

Debugging strategy:

# Add hooks to detect NaN
def check_nan(grad):
    if torch.isnan(grad).any():
        print(f"NaN detected! grad stats: min={grad.min()}, max={grad.max()}")
        import pdb; pdb.set_trace()
    return grad

for name, param in model.named_parameters():
    param.register_hook(lambda grad, name=name: check_nan(grad))

Gradient Flow Visualization

Think of gradient visualization like an X-ray for your neural network. Just as doctors use X-rays to see what’s happening inside a patient, gradient histograms and flow diagrams reveal the health of your network’s learning process—showing where information flows freely and where it gets blocked or distorted.

Understanding how gradients flow helps debug networks.

Gradient Histograms

def plot_gradient_histograms(model):
    """
    Plot histogram of gradients for each layer.

    Healthy gradients should:
    - Have similar scale across layers
    - Not be all zeros
    - Not have extreme values
    """
    import matplotlib.pyplot as plt

    fig, axes = plt.subplots(2, 3, figsize=(12, 8))
    axes = axes.flatten()

    for idx, (name, param) in enumerate(model.named_parameters()):
        if param.grad is not None and idx < 6:
            grad = param.grad.detach().cpu().numpy().flatten()
            axes[idx].hist(grad, bins=50)
            axes[idx].set_title(f'{name}\nmean={grad.mean():.2e}, std={grad.std():.2e}')
            axes[idx].set_xlabel('Gradient value')

    plt.tight_layout()
    return fig

Gradient Norms per Layer

def compute_gradient_norms(model):
    """
    Compute gradient L2 norm for each layer.

    Plot over training to detect:
    - Vanishing: norms decrease in early layers
    - Exploding: norms increase in early layers
    """
    norms = {}
    for name, param in model.named_parameters():
        if param.grad is not None:
            norms[name] = param.grad.norm().item()
    return norms

What to Look For

Pattern	Interpretation	Action
Gradients near 0 in early layers	Vanishing gradients	Add skip connections, change activation
Gradients 100x larger in early layers	Exploding gradients	Clip gradients, reduce LR
Some parameters always 0	Dead neurons	Use Leaky ReLU, check init
Gradients NaN	Numerical instability	Check for log(0), div by 0

Advanced Topics

Forward-Mode vs Reverse-Mode Autodiff

There are two ways to compute gradients:

Reverse-mode (backpropagation):

Compute ∂output/∂all_inputs in one backward pass
Cost: O(1) backward passes regardless of input dimension
Best when: many inputs, few outputs (most neural networks!)

Forward-mode:

Compute ∂all_outputs/∂one_input per pass
Cost: O(n) forward passes for n inputs
Best when: few inputs, many outputs (rare in deep learning)

Neural network: 1M parameters → 1 loss value
- Reverse mode: 1 backward pass to get all 1M gradients
- Forward mode: 1M forward passes (one per parameter)

This is why backpropagation dominates deep learning!

Did You Know? Forward-mode autodiff is used in some scientific computing applications where you have few inputs and many outputs. Julia’s ForwardDiff.jl is a popular implementation. Some researchers are exploring “mixed-mode” autodiff that combines both for certain architectures.

Higher-Order Gradients

Sometimes we need gradients of gradients (Hessians, second derivatives).

# PyTorch example: computing Hessian-vector product
x = torch.randn(10, requires_grad=True)
y = (x ** 3).sum()

# First derivative
grad_y = torch.autograd.grad(y, x, create_graph=True)[0]

# Second derivative (gradient of gradient)
grad_grad_y = torch.autograd.grad(grad_y.sum(), x)[0]

print(f"y = sum(x³)")
print(f"dy/dx = 3x² → {grad_y[:3]}")
print(f"d²y/dx² = 6x → {grad_grad_y[:3]}")

Uses:

Second-order optimization (Newton’s method, natural gradient)
Regularization (penalizing sharp minima)
Meta-learning (learning to learn)

Checkpointing for Memory Efficiency

Deep networks require storing all activations for backprop. This can exhaust GPU memory.

Gradient checkpointing trades compute for memory:

Don’t store intermediate activations
Recompute them during backward pass
Reduces memory from O(n) to O(√n) with √n checkpoints

# PyTorch gradient checkpointing
from torch.utils.checkpoint import checkpoint

class CheckpointedBlock(nn.Module):
    def __init__(self, block):
        super().__init__()
        self.block = block

    def forward(self, x):
        # Recompute forward pass during backward
        return checkpoint(self.block, x, use_reentrant=False)

Practical Debugging Guide

Systematic Debugging Steps

Verify data pipeline
- Are inputs normalized?
- Are labels correct?
- Any NaN/Inf in inputs?
Check initialization
- Print initial weight stats
- Verify no NaN in initial forward pass
Start simple
- Can you overfit one batch?
- If not, something is fundamentally broken
Monitor gradient flow
- Plot gradient norms per layer
- Look for vanishing/exploding patterns
Gradient checking
- Verify backprop is correct for custom operations
- Use small networks for speed

Debugging Checklist

def debug_training(model, batch, loss_fn):
    """Comprehensive debugging for one batch."""

    # 1. Check input
    x, y = batch
    print(f"Input: shape={x.shape}, range=[{x.min():.2f}, {x.max():.2f}]")
    print(f"Labels: shape={y.shape}, unique={torch.unique(y).tolist()}")
    assert not torch.isnan(x).any(), "NaN in input!"

    # 2. Forward pass
    model.train()
    output = model(x)
    print(f"Output: shape={output.shape}, range=[{output.min():.2f}, {output.max():.2f}]")
    assert not torch.isnan(output).any(), "NaN in output!"

    # 3. Loss
    loss = loss_fn(output, y)
    print(f"Loss: {loss.item():.4f}")
    assert not torch.isnan(loss), "NaN loss!"

    # 4. Backward pass
    loss.backward()

    # 5. Check gradients
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad = param.grad
            print(f"{name}: grad_norm={grad.norm():.2e}, "
                  f"grad_range=[{grad.min():.2e}, {grad.max():.2e}]")
            assert not torch.isnan(grad).any(), f"NaN gradient in {name}!"

    print("All checks passed!")

Hands-On Exercises

These exercises will solidify your understanding of backpropagation by having you implement and debug gradient computations yourself.

Exercise 1: Extend the Autograd Engine

Take the Value class we built earlier and extend it with additional operations.

Your task: Implement these missing methods:

class Value:
    # ... (existing code from above)

    def log(self):
        """Natural logarithm. d(log(x))/dx = 1/x"""
        import math
        # YOUR CODE HERE
        pass

    def sigmoid(self):
        """Sigmoid function. σ(x) = 1/(1+e^(-x)), σ'(x) = σ(x)(1-σ(x))"""
        # YOUR CODE HERE
        pass

    def __matmul__(self, other):
        """
        Implement dot product for two Value vectors.
        This is trickier - think about how gradients flow through a sum of products.
        """
        # YOUR CODE HERE
        pass

Solution approach:

def log(self):
    import math
    out = Value(math.log(self.data), (self,), 'log')

    def _backward():
        # d(log(x))/dx = 1/x
        self.grad += out.grad * (1.0 / self.data)
    out._backward = _backward

    return out

def sigmoid(self):
    import math
    s = 1.0 / (1.0 + math.exp(-self.data))
    out = Value(s, (self,), 'sigmoid')

    def _backward():
        # d(sigmoid(x))/dx = sigmoid(x) * (1 - sigmoid(x))
        self.grad += out.grad * s * (1 - s)
    out._backward = _backward

    return out

Verify your implementation:

# Test log
x = Value(2.0)
y = x.log()
y.backward()
print(f"log(2) = {y.data:.4f}, should be ~0.6931")
print(f"d(log(x))/dx at x=2 is {x.grad:.4f}, should be 0.5")

# Test sigmoid
x = Value(0.0)
y = x.sigmoid()
y.backward()
print(f"sigmoid(0) = {y.data:.4f}, should be 0.5")
print(f"d(sigmoid(x))/dx at x=0 is {x.grad:.4f}, should be 0.25")

Exercise 2: Trace Through Backprop by Hand

Manually compute gradients for this computation graph:

L = (a * b + c)² where a=2, b=3, c=1

Step 1: Draw the graph

    a(2)    b(3)
       \    /
        [*]
         |
         v
        z1(6)     c(1)
          \       /
           \     /
            [+]
             |
             v
            z2(7)
             |
            [²]
             |
             v
            L(49)

Step 2: Forward pass (verify values)

z1 = a * b = 2 * 3 = 6
z2 = z1 + c = 6 + 1 = 7
L = z2² = 49

Step 3: Backward pass (compute gradients)

dL/dL = 1 (start here)
dL/dz2 = dL/dL * d(z2²)/dz2 = 1 * 2z2 = 27 = 14
dL/dz1 = dL/dz2 * d(z1+c)/dz1 = 14 * 1 = 14
dL/dc = dL/dz2 * d(z1+c)/dc = 14 * 1 = 14
dL/da = dL/dz1 * d(a*b)/da = 14 * b = 14 * 3 = 42
dL/db = dL/dz1 * d(a*b)/db = 14 * a = 14 * 2 = 28

Your task: Verify with our autograd engine:

a = Value(2.0)
b = Value(3.0)
c = Value(1.0)
L = (a * b + c) ** 2
L.backward()

print(f"L = {L.data}")         # Should be 49
print(f"dL/da = {a.grad}")     # Should be 42
print(f"dL/db = {b.grad}")     # Should be 28
print(f"dL/dc = {c.grad}")     # Should be 14

Exercise 3: Implement Gradient Clipping

Gradient clipping prevents exploding gradients by limiting gradient magnitudes.

Your task: Implement two types of gradient clipping:

def clip_grad_value(parameters, clip_value):
    """
    Clip gradients by value.
    Each gradient component is clipped to [-clip_value, clip_value].

    Example: grad = [5, -10, 3] with clip_value=4 becomes [4, -4, 3]
    """
    # YOUR CODE HERE
    for param in parameters:
        if param.grad is not None:
            param.grad.data.clamp_(-clip_value, clip_value)


def clip_grad_norm(parameters, max_norm):
    """
    Clip gradients by global norm.
    All gradients are scaled so their combined L2 norm <= max_norm.

    Example: If total norm is 10 and max_norm is 5, scale all grads by 0.5
    """
    # YOUR CODE HERE
    total_norm = 0.0
    for param in parameters:
        if param.grad is not None:
            total_norm += param.grad.data.pow(2).sum()
    total_norm = total_norm.sqrt()

    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for param in parameters:
            if param.grad is not None:
                param.grad.data.mul_(clip_coef)

    return total_norm

Test your implementation:

import torch

# Create a model with exploding gradients
model = torch.nn.Linear(10, 10)
x = torch.randn(1, 10) * 100  # Large input
y = model(x)
loss = y.sum()
loss.backward()

# Print gradient norms before clipping
print("Before clipping:")
for name, param in model.named_parameters():
    print(f"  {name}: grad_norm = {param.grad.norm():.2f}")

# Clip gradients
clip_grad_norm(model.parameters(), max_norm=1.0)

# Print gradient norms after clipping
print("After clipping (max_norm=1.0):")
total = 0
for name, param in model.named_parameters():
    print(f"  {name}: grad_norm = {param.grad.norm():.2f}")
    total += param.grad.norm()**2
print(f"  Total norm: {total.sqrt():.2f}")  # Should be <= 1.0

Exercise 4: Debug a Broken Training Loop

The following training loop has bugs that cause NaN loss. Find and fix them.

import torch
import torch.nn as nn

def broken_training():
    """This training loop will produce NaN. Fix the bugs!"""

    # Bug 1: No seed for reproducibility
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.Sigmoid(),  # Bug 2: Sigmoid can cause vanishing gradients in deep nets
        nn.Linear(50, 50),
        nn.Sigmoid(),
        nn.Linear(50, 1)
    )

    optimizer = torch.optim.SGD(model.parameters(), lr=10.0)  # Bug 3: LR too high

    for epoch in range(100):
        x = torch.randn(32, 10)
        y = torch.randn(32, 1)

        output = model(x)
        loss = torch.log(output)  # Bug 4: log of potentially negative values!
        loss = loss.mean()

        # Bug 5: No zero_grad!
        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}: loss = {loss.item():.4f}")

# Run it - watch it fail
broken_training()

Your task: Fix all 5 bugs and get the training to work.

Fixed version:

def fixed_training():
    """Corrected training loop."""

    # Fix 1: Set seed for reproducibility
    torch.manual_seed(42)

    # Fix 2: Use ReLU instead of Sigmoid
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.ReLU(),
        nn.Linear(50, 50),
        nn.ReLU(),
        nn.Linear(50, 1)
    )

    # Fix 3: Use reasonable learning rate
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(100):
        x = torch.randn(32, 10)
        y = torch.randn(32, 1)

        # Fix 5: Zero gradients at start of each iteration
        optimizer.zero_grad()

        output = model(x)
        # Fix 4: Use proper loss function (MSE instead of log)
        loss = nn.functional.mse_loss(output, y)

        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}: loss = {loss.item():.4f}")

fixed_training()

Exercise 5: Visualize Gradient Flow

Build a tool to visualize how gradients flow through a network.

Your task: Implement this gradient visualization:

import matplotlib.pyplot as plt
import torch
import torch.nn as nn

def visualize_gradient_flow(model, sample_input, sample_target, loss_fn):
    """
    Creates a visualization of gradient magnitudes through the network.

    This helps diagnose:
    - Vanishing gradients (later layers have much larger gradients)
    - Exploding gradients (earlier layers have much larger gradients)
    - Dead neurons (gradients are exactly zero)
    """
    # Forward pass
    output = model(sample_input)
    loss = loss_fn(output, sample_target)

    # Backward pass
    loss.backward()

    # Collect gradient statistics
    layer_names = []
    grad_means = []
    grad_stds = []
    grad_maxs = []

    for name, param in model.named_parameters():
        if param.grad is not None:
            grad = param.grad.detach().cpu().numpy()
            layer_names.append(name)
            grad_means.append(abs(grad).mean())
            grad_stds.append(grad.std())
            grad_maxs.append(abs(grad).max())

    # Plot
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    x = range(len(layer_names))

    axes[0].bar(x, grad_means)
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(layer_names, rotation=45, ha='right')
    axes[0].set_title('Mean |Gradient|')
    axes[0].set_yscale('log')

    axes[1].bar(x, grad_stds)
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(layer_names, rotation=45, ha='right')
    axes[1].set_title('Gradient Std Dev')
    axes[1].set_yscale('log')

    axes[2].bar(x, grad_maxs)
    axes[2].set_xticks(x)
    axes[2].set_xticklabels(layer_names, rotation=45, ha='right')
    axes[2].set_title('Max |Gradient|')
    axes[2].set_yscale('log')

    plt.tight_layout()
    return fig

# Test it
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

x = torch.randn(32, 784)
y = torch.randint(0, 10, (32,))

fig = visualize_gradient_flow(model, x, y, nn.CrossEntropyLoss())
plt.show()

Questions to answer after running:

Are the gradient magnitudes similar across layers, or do they vary wildly?
What happens if you add more layers?
What happens if you replace ReLU with Sigmoid?
What happens if you add BatchNorm?

Exercise 6: Implement Your Own Autograd for a Neural Network

This is the capstone exercise. Build a 2-layer neural network using only our Value class, train it on XOR, and verify it learns.

# The XOR problem - not linearly separable!
# Input: (0,0) -> 0, (0,1) -> 1, (1,0) -> 1, (1,1) -> 0

import random

def train_xor():
    """Train a 2-layer network on XOR using our custom autograd."""
    random.seed(42)

    # Network: 2 -> 4 -> 1
    # Initialize weights
    W1 = [[Value(random.uniform(-1, 1)) for _ in range(2)] for _ in range(4)]
    b1 = [Value(0) for _ in range(4)]
    W2 = [Value(random.uniform(-1, 1)) for _ in range(4)]
    b2 = Value(0)

    def forward(x1, x2):
        """Forward pass through the network."""
        x = [Value(x1), Value(x2)]

        # Hidden layer
        hidden = []
        for i in range(4):
            z = W1[i][0] * x[0] + W1[i][1] * x[1] + b1[i]
            hidden.append(z.tanh())

        # Output layer
        out = W2[0] * hidden[0] + W2[1] * hidden[1] + W2[2] * hidden[2] + W2[3] * hidden[3] + b2
        return out.tanh()

    # Training data
    data = [
        ((0, 0), 0),
        ((0, 1), 1),
        ((1, 0), 1),
        ((1, 1), 0),
    ]

    # Training loop
    learning_rate = 0.5
    params = [w for row in W1 for w in row] + b1 + W2 + [b2]

    for epoch in range(1000):
        total_loss = 0

        for (x1, x2), target in data:
            # Forward
            pred = forward(x1, x2)
            loss = (pred - Value(target)) ** 2
            total_loss += loss.data

            # Backward
            for p in params:
                p.grad = 0  # Zero gradients
            loss.backward()

            # Update
            for p in params:
                p.data -= learning_rate * p.grad

        if epoch % 100 == 0:
            print(f"Epoch {epoch}: loss = {total_loss:.4f}")

    # Test
    print("\nFinal predictions:")
    for (x1, x2), target in data:
        pred = forward(x1, x2)
        print(f"  ({x1}, {x2}) -> {pred.data:.3f} (target: {target})")

train_xor()

Expected output (approximately):

Epoch 0: loss = 1.8234
Epoch 100: loss = 0.9821
Epoch 200: loss = 0.3456
...
Epoch 900: loss = 0.0123

Final predictions:
  (0, 0) -> 0.012 (target: 0)
  (0, 1) -> 0.987 (target: 1)
  (1, 0) -> 0.991 (target: 1)
  (1, 1) -> 0.015 (target: 0)

Quiz: Test Your Understanding

Q1: Why do we process the computation graph in reverse order during backprop?

Answer

We need to compute gradients from the output back to the inputs because:

The chain rule multiplies local gradients — we need ∂L/∂output first before computing ∂L/∂earlier_nodes
Each node needs the gradient from its outputs to compute the gradient for its inputs
Reverse topological order ensures that when we process a node, all its outputs have already been processed

This is why it’s called “back”-propagation — we propagate errors backward through the graph.

Q2: What’s the computational complexity advantage of reverse-mode autodiff over forward-mode?

Answer

For a function with n inputs and m outputs:

Reverse-mode: O(m) backward passes to get all gradients
Forward-mode: O(n) forward passes to get all gradients

In neural networks, we typically have millions of parameters (n) but one scalar loss (m=1). So reverse-mode needs just 1 backward pass to get gradients for all parameters, while forward-mode would need millions of passes.

This O(1) vs O(n) difference is why backpropagation (reverse-mode) is used in deep learning.

Q3: A network has vanishing gradients. What are three solutions?

Answer

Use ReLU instead of sigmoid/tanh: ReLU has gradient 1 for positive inputs, preventing multiplicative shrinking
Add skip/residual connections: Allow gradients to flow directly through addition: ∂(x + f(x))/∂x = 1 + ∂f/∂x, so even if ∂f/∂x→0, gradient is still 1
Use batch normalization: Normalizes activations to prevent them from saturating in sigmoid/tanh regions

Other solutions: proper initialization (Xavier/He), gradient clipping (for exploding), LSTM/GRU gating (for RNNs).

Q4: Your loss becomes NaN after 100 epochs. How do you debug?

Answer

Systematic approach:

Add gradient hooks to detect when NaN first appears
Binary search on epochs to find when it starts
Check for:
- Division by zero: Look for / operations with potentially zero denominators
- Log of zero: log(x) where x could be 0 or negative
- Exploding gradients: Print gradient norms, look for exponential growth
- Large learning rate: Try 10x smaller LR
Reproduce with deterministic seed for consistent debugging
Simplify: Remove layers until you find the culprit

Q5: What is gradient checking and when would you use it?

Answer

Gradient checking compares analytical gradients (from backprop) with numerical gradients (from finite differences):

numerical_grad ≈ [f(x+ε) - f(x-ε)] / (2ε)

Use it when:

Implementing new layers or custom operations
Debugging unexpected training behavior
Verifying complex loss functions

Don’t use it:

In production (too slow — O(n) forward passes)
On large networks (use on small test cases)

A relative error < 1e-5 indicates correct implementation.

Summary

You’ve learned:

The chain rule is the mathematical foundation of all neural network learning
Computational graphs track operations for automatic differentiation
Reverse-mode autodiff (backprop) computes all gradients in one backward pass
Building autograd from scratch shows there’s no magic — just calculus
Gradient checking verifies implementations with numerical approximation
Common problems (vanishing, exploding, dead ReLU) and their solutions
Debugging techniques for systematic troubleshooting

Every time you call loss.backward(), you now understand exactly what happens inside!

The Heureka Moment (Revisited)

Backpropagation is just the chain rule applied systematically.

When you call loss.backward():

The framework has been recording every operation you performed (building a computational graph)
It starts at the loss (gradient = 1)
It walks backward through the graph, applying the chain rule at each operation
At each node, it multiplies the incoming gradient by the local gradient
When it reaches a parameter, that accumulated gradient tells us how to update it

This elegant algorithm, combined with GPUs and big data, is why deep learning works. From a simple linear regression to gpt-5, the learning mechanism is fundamentally the same.

You now understand the engine that powers all of modern AI!

Next Steps

Congratulations! You’ve completed Phase 6: Deep Learning Foundations!

You now understand:

Python for ML (NumPy, Pandas, scikit-learn)
Neural networks from scratch
PyTorch fundamentals
Training techniques (optimizers, regularization, scheduling)
CNNs for computer vision
Transformers and attention
Backpropagation and autograd

Move on to Phase 7: Advanced Generative AI where you’ll learn:

Fine-tuning and PEFT methods
Reinforcement Learning from Human Feedback (RLHF)
Diffusion models and image generation
Multimodal models
Efficient inference

Last updated: 2025-11-27 Status: Complete

Transformers from Scratch

Or: The Algorithm That Taught Machines to Learn (And How to Build It Yourself)

The Idea That Almost Died Twice

What You’ll Be Able to Do

The Heureka Moment: What You’re About to Discover

The Problem: How Do We Update Weights?

The Chain Rule: The Foundation of Everything

The Single-Variable Chain Rule

The Multi-Variable Chain Rule

Computational Graphs: Tracking Operations

Example: Simple Expression

Forward Pass

Backward Pass

Building an Autograd Engine from Scratch

The Value Class

Testing Our Autograd

The Backpropagation Algorithm

Forward Pass Through a Layer

Backward Pass Through a Layer

Worked Example: Two-Layer Network

Gradient Checking: Verifying Your Implementation

Numerical Gradient

Implementation

When to Use Gradient Checking

Common Gradient Problems

1. Vanishing Gradients

2. Exploding Gradients

3. Dead ReLU

4. NaN Gradients

Gradient Flow Visualization

Gradient Histograms

Gradient Norms per Layer

What to Look For

Advanced Topics

Forward-Mode vs Reverse-Mode Autodiff

Higher-Order Gradients

Checkpointing for Memory Efficiency

Practical Debugging Guide

Systematic Debugging Steps

Debugging Checklist

Hands-On Exercises

Exercise 1: Extend the Autograd Engine

Exercise 2: Trace Through Backprop by Hand

Exercise 3: Implement Gradient Clipping

Exercise 4: Debug a Broken Training Loop

Exercise 5: Visualize Gradient Flow

Exercise 6: Implement Your Own Autograd for a Neural Network

Quiz: Test Your Understanding

Summary

Further Reading

Essential Resources

Historical Papers

The Heureka Moment (Revisited)

Next Steps