PyTorch Fundamentals

AI/ML Engineering Track | Complexity: [MEDIUM] | Time: 6-8

Or: Building a Brain with Nothing but Math

Reading Time: 7-8 hours Phase: 6 - Deep Learning Foundations

The Day a Machine Learned to See: When Math Became Magic

Toronto. December 3, 2012. 4:47 PM.

Geoffrey Hinton was staring at his computer screen, trying to process what he was seeing. For thirty years, he’d been a voice in the wilderness, insisting that neural networks could work if we just had enough data and computing power. His colleagues had called him stubborn. Funding agencies had called him delusional. The AI winters had frozen out most of his peers.

But today was different.

The ImageNet Large Scale Visual Recognition Challenge results were in. Hinton’s team—two graduate students, Alex Krizhevsky and Ilya Sutskever—had entered a neural network called “AlexNet.” The best competing systems achieved error rates around 26%. AlexNet achieved 15.3%.

Hinton read the number again: 15.3%.

Not slightly better. Not incrementally improved. A single neural network had reduced the error rate by more than 10 percentage points—a jump so large that reviewers initially assumed it was a mistake.

“I remember thinking: this changes everything. The ideas we’d been developing for decades—they actually work. We just needed scale.” — Geoffrey Hinton, reflecting on the moment in 2017

The phone started ringing. Colleagues who’d dismissed neural networks for years suddenly wanted to know more. Within months, Google acquired Hinton’s startup for $44 million. Within a year, neural networks would be the dominant paradigm in AI. Within a decade, they would power everything from voice assistants to self-driving cars to ChatGPT.

But here’s the remarkable part: the core algorithm that made AlexNet work was invented in 1986. The math hadn’t changed. The architecture was similar to networks from the 1990s. What changed was data, compute, and a few clever tricks.

In this module, you’re going to learn that algorithm from scratch. Not by using PyTorch or TensorFlow, but by building every component yourself in pure Python and NumPy. By the end, you’ll understand exactly what happens when a neural network “learns”—and you’ll have built one that can read handwritten digits with 97% accuracy.

This isn’t just historical education. The engineers who understand what’s happening under the hood—not just how to call .fit()—are the ones who debug the hard problems, design new architectures, and push the field forward. Today, you join their ranks.

What You’ll Be Able to Do

By the end of this module, you will:

Understand what a neural network actually IS (not just how to use one)
Implement forward propagation from scratch
Derive and implement backpropagation by hand
Train a network on MNIST without ANY frameworks
Visualize the learning process
Understand WHY neural networks work

Introduction: What IS a Neural Network?

You’ve used neural networks through APIs and frameworks. You’ve seen them perform magic—generating text, recognizing images, understanding speech. But what’s actually happening inside?

A neural network is just a function.

That’s it. A very complex, parameterized function that takes inputs and produces outputs. The “learning” is just finding the right parameters.

Think of it like a combination lock with millions of dials. Each dial is a parameter. When you “train” a neural network, you’re slowly turning each dial until the lock opens—until the function produces the right outputs for your inputs. The genius of neural networks isn’t the lock itself; it’s the algorithm that figures out how to turn all those dials simultaneously.

Neural Network = f(x; θ)

Where:
  x = input (image, text, numbers)
  θ = parameters (weights and biases)
  f = the function we're learning

The magic isn’t in the architecture - it’s in the learning algorithm that finds good values for θ.

Did You Know? The Tumultuous History of Neural Networks

The story of neural networks is a tale of boom and bust, faith and vindication. Understanding this history helps you appreciate why the field looks the way it does today.

The First Neural Network: 1943

In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts published “A Logical Calculus of the Ideas Immanent in Nervous Activity.” They proposed the first mathematical model of a neuron.

“We propose to show how neurons might be connected to perform logical operations.” — McCulloch & Pitts, 1943

Their model was simple: a neuron either fires (1) or doesn’t (0), based on whether the weighted sum of its inputs exceeds a threshold. This was the birth of artificial neural networks.

The Perceptron: 1958

Frank Rosenblatt at Cornell built the Mark I Perceptron - actual hardware that could learn to recognize simple patterns. The New York Times reported:

“The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The hype was real. And wildly premature.

The AI Winter: 1969

In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” a book that mathematically proved single-layer perceptrons couldn’t learn XOR (a simple logical function). The conclusion was devastating:

“The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations.”

Funding for neural networks collapsed. Researchers moved on. This began the first “AI Winter.”

What they missed: Multi-layer networks COULD learn XOR. But training them was the unsolved problem.

The Backpropagation Revolution: 1986

The solution had actually been discovered multiple times:

Paul Werbos (1974) - In his PhD thesis, mostly ignored
David Parker (1985) - Rediscovered independently
Geoffrey Hinton, David Rumelhart, Ronald Williams (1986) - Made it famous

The 1986 paper “Learning representations by back-propagating errors” in Nature showed that multi-layer networks could be trained efficiently using backpropagation. Neural networks were back.

The Second AI Winter: 1990s

Despite backpropagation, neural networks hit walls:

Training was slow (no GPUs)
Vanishing gradients killed deep networks
Support Vector Machines often worked better
Funding dried up again

Geoffrey Hinton was one of the few who kept the faith. He later said:

“I was pretty much the only person in the world who still believed in neural networks.”

The Deep Learning Revolution: 2012

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet competition with “AlexNet” - a deep convolutional neural network trained on GPUs.

They didn’t just win. They crushed the competition, reducing error rates from 26% to 15%.

The world noticed. The deep learning revolution began.

The irony: The core ideas (backpropagation, convolutions) were decades old. What changed was:

Data: ImageNet provided millions of labeled images
Compute: GPUs made training fast
Techniques: ReLU, dropout, batch norm solved training issues

The Building Block: A Single Neuron

Let’s start with the simplest possible neural network: one neuron.

The Biological Inspiration

A biological neuron works like a tiny decision-maker in your brain:

It receives signals from other neurons through dendrites (like antennas collecting radio signals)
If the combined signal exceeds a threshold, it fires (like a voter who only acts when enough arguments convince them)
The signal travels down the axon to other neurons (like sending a message down a wire)

Your brain has about 86 billion of these neurons, each connected to thousands of others. The miracle of intelligence emerges from simple units doing simple things, connected in complex patterns.

    Dendrites          Cell Body         Axon
    (inputs)           (processing)      (output)

    x₁ ─────┐
            │
    x₂ ─────┼──────► [ Σ + f ] ──────► y
            │
    x₃ ─────┘

The Artificial Neuron

An artificial neuron does something similar:

def neuron(inputs, weights, bias):
    # 1. Weighted sum of inputs
    z = sum(x * w for x, w in zip(inputs, weights)) + bias

    # 2. Apply activation function
    output = activation(z)

    return output

Mathematically:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = w·x + b
y = f(z)

Where:

x = inputs (features)
w = weights (learned parameters)
b = bias (learned parameter)
f = activation function
y = output

Activation Functions

The activation function introduces non-linearity. Without it, stacking neurons would just give you another linear function.

Sigmoid (Historical)

σ(z) = 1 / (1 + e^(-z))

Output range: (0, 1)
Problem: Vanishing gradients for large |z|
Used for: Binary classification outputs

Tanh (Historical)

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Output range: (-1, 1)
Zero-centered (better than sigmoid)
Still has vanishing gradient problem

ReLU (Modern Standard)

ReLU(z) = max(0, z)

Output range: [0, ∞)
No vanishing gradient for positive values
Computationally efficient
Problem: “Dead neurons” (always output 0)

Leaky ReLU

LeakyReLU(z) = z if z > 0 else 0.01 * z

Fixes the dead neuron problem
Small gradient for negative values

Visualization of Activation Functions:

Sigmoid:                ReLU:
    1 ─────────            │     ╱
      │      ╱             │    ╱
  0.5 │─────╱              │   ╱
      │    ╱               │  ╱
    0 ─────                └──────
     -5    0    5         -2  0  2  4

From Neuron to Network

One neuron can only learn linear patterns. To learn complex patterns, we stack neurons into layers.

Network Architecture

Input Layer    Hidden Layer(s)    Output Layer
    x₁ ─────┐
            ├──► h₁ ─────┐
    x₂ ─────┼            ├──► y₁
            ├──► h₂ ─────┤
    x₃ ─────┼            ├──► y₂
            ├──► h₃ ─────┘
    x₄ ─────┘

    Features   Learned         Predictions
               Representations

Fully Connected (Dense) Layer: Every neuron connects to every neuron in the next layer.

Why This Module Matters

Think of it like an assembly line in a factory. Raw materials enter, and each station transforms them into something more refined. By the end, you have a finished product—but no single station could have built it alone.

Each layer learns increasingly abstract representations:

Image Recognition Example:

Layer 1: Edges and colors
Layer 2: Textures and patterns
Layer 3: Parts (eyes, wheels)
Layer 4: Objects (faces, cars)

This is the key insight of deep learning: hierarchical feature learning.

Matrix Formulation

For efficiency, we represent layer computations as matrix operations:

# Single sample
z = W @ x + b     # Linear transformation
a = activation(z)  # Non-linearity

# Batch of samples
Z = W @ X + b     # X is (features, batch_size)
A = activation(Z)

Where:

W = weight matrix (output_neurons × input_neurons)
X = input matrix (input_neurons × batch_size)
b = bias vector (output_neurons × 1)
Z = pre-activation (output_neurons × batch_size)
A = activation (output_neurons × batch_size)

️ Forward Propagation

Forward propagation is computing the output given an input. It’s just applying each layer’s transformation in sequence.

Algorithm

def forward(X, parameters):
    """
    Forward propagation through L layers.

    Args:
        X: Input data (n_features, m_samples)
        parameters: Dict with W1, b1, W2, b2, ..., WL, bL

    Returns:
        AL: Final output
        caches: Intermediate values (needed for backprop)
    """
    caches = []
    A = X
    L = len(parameters) // 2  # Number of layers

    # Hidden layers (with ReLU)
    for l in range(1, L):
        A_prev = A
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']

        Z = W @ A_prev + b
        A = relu(Z)

        caches.append((A_prev, W, b, Z))

    # Output layer (with sigmoid for binary classification)
    W = parameters[f'W{L}']
    b = parameters[f'b{L}']

    Z = W @ A + b
    AL = sigmoid(Z)

    caches.append((A, W, b, Z))

    return AL, caches

Example: 2-Layer Network

Input: X (784 features - flattened 28x28 image)
Hidden: 128 neurons with ReLU
Output: 10 neurons with softmax (digits 0-9)

Forward pass:
1. Z1 = W1 @ X + b1      # (128, 784) @ (784, m) = (128, m)
2. A1 = ReLU(Z1)         # (128, m)
3. Z2 = W2 @ A1 + b2     # (10, 128) @ (128, m) = (10, m)
4. A2 = softmax(Z2)      # (10, m) - probabilities for each class

The Loss Function

How do we know if our network’s predictions are good? We need a loss function (also called cost function or objective function).

Binary Cross-Entropy Loss

For binary classification (yes/no):

L(y, ŷ) = -[y * log(ŷ) + (1-y) * log(1-ŷ)]

For m samples:

J = -(1/m) * Σ[y * log(ŷ) + (1-y) * log(1-ŷ)]

Intuition:

If y=1 and ŷ=1: loss ≈ 0 (correct and confident)
If y=1 and ŷ=0: loss → ∞ (wrong and confident)

Categorical Cross-Entropy Loss

For multi-class classification:

L(y, ŷ) = -Σ yᵢ * log(ŷᵢ)

Where y is one-hot encoded (e.g., [0,0,1,0,0,0,0,0,0,0] for digit “2”).

Mean Squared Error (MSE)

For regression:

J = (1/m) * Σ(y - ŷ)²

⬅️ Backpropagation: The Key Algorithm

Backpropagation is how neural networks learn. It computes the gradient of the loss with respect to every parameter. This is the algorithm that Geoffrey Hinton championed for decades, the one that finally proved its worth in 2012.

Think of it like tracing blame through an organization. When something goes wrong (high loss), you need to figure out who was responsible (which parameters contributed to the error) and by how much. Backpropagation is the accounting system that assigns credit and blame to millions of parameters simultaneously.

The Chain Rule

Backpropagation is just the chain rule from calculus, applied systematically. If you remember anything from calculus, the chain rule lets you compute derivatives of composed functions: how a change in an input ripples through to affect the output.

If y = f(g(x)), then:

dy/dx = dy/dg * dg/dx

In a neural network, the loss depends on parameters through a chain of operations:

Loss ← Output ← Hidden ← ... ← Input ← Parameters

To find how the loss changes with respect to a parameter, we multiply gradients along the path.

Deriving Backpropagation

Let’s derive backprop for a 2-layer network step by step.

Forward pass:

Z1 = W1 @ X + b1
A1 = ReLU(Z1)
Z2 = W2 @ A1 + b2
A2 = sigmoid(Z2)
L = cross_entropy(Y, A2)

Backward pass (computing gradients):

Output layer gradient:

dL/dZ2 = A2 - Y     # For sigmoid + cross-entropy

Output layer parameters:

dL/dW2 = (1/m) * dZ2 @ A1.T
dL/db2 = (1/m) * sum(dZ2, axis=1)

Hidden layer gradient:

dL/dA1 = W2.T @ dZ2
dL/dZ1 = dL/dA1 * ReLU'(Z1)    # Element-wise

Hidden layer parameters:

dL/dW1 = (1/m) * dZ1 @ X.T
dL/db1 = (1/m) * sum(dZ1, axis=1)

The Backpropagation Algorithm

def backward(AL, Y, caches):
    """
    Backward propagation through L layers.

    Args:
        AL: Final output from forward prop
        Y: True labels
        caches: Intermediate values from forward prop

    Returns:
        gradients: Dict with dW1, db1, dW2, db2, ..., dWL, dbL
    """
    gradients = {}
    L = len(caches)
    m = AL.shape[1]

    # Output layer (sigmoid + cross-entropy)
    dAL = -(Y / AL) + (1 - Y) / (1 - AL)

    # For sigmoid: dZ = dA * sigmoid'(Z) = dA * A * (1 - A)
    A_prev, W, b, Z = caches[L-1]
    dZ = AL - Y  # Simplified for sigmoid + cross-entropy

    gradients[f'dW{L}'] = (1/m) * dZ @ A_prev.T
    gradients[f'db{L}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = W.T @ dZ

    # Hidden layers (ReLU)
    for l in reversed(range(L-1)):
        A_prev, W, b, Z = caches[l]

        # ReLU derivative: 1 if Z > 0, else 0
        dZ = dA_prev * (Z > 0)

        gradients[f'dW{l+1}'] = (1/m) * dZ @ A_prev.T
        gradients[f'db{l+1}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)

        if l > 0:
            dA_prev = W.T @ dZ

    return gradients

Visualizing Backpropagation

Forward:  X ───► Z1 ───► A1 ───► Z2 ───► A2 ───► Loss
                ↑        ↑        ↑        ↑
               W1       ---      W2       ---

Backward: dX ◄─── dZ1 ◄─── dA1 ◄─── dZ2 ◄─── dA2 ◄─── dLoss
                  ↓               ↓
                 dW1             dW2
                 db1             db2

Gradient Descent

Now that we can compute gradients, we need to update parameters to minimize the loss.

Basic Gradient Descent

def update_parameters(parameters, gradients, learning_rate):
    """
    Update parameters using gradient descent.

    θ = θ - α * ∂L/∂θ
    """
    L = len(parameters) // 2

    for l in range(1, L + 1):
        parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
        parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']

    return parameters

Why Gradient Descent Works

Imagine you’re blindfolded on a hilly landscape, trying to find the lowest point. You can’t see anything, but you can feel the slope under your feet. Here’s your strategy:

Feel the slope under your feet (compute gradient)
Take a step downhill (update parameters)
Repeat until you can’t go lower

The gradient always points “uphill,” so going in the opposite direction takes you “downhill” toward lower loss. It’s like water flowing downhill—it always finds the path of steepest descent, eventually settling in a valley.

The remarkable thing is that this simple strategy works even in spaces with millions of dimensions. In a neural network with 10 million parameters, you’re navigating a 10-million-dimensional landscape—far beyond human imagination—but the math doesn’t care. Gradient descent still finds the way down.

Learning Rate

The learning rate (α) controls step size:

Too small: Slow convergence, may get stuck
Too large: Overshooting, unstable training
Just right: Smooth convergence to minimum

Loss
  │
  │  ╲      Learning rate too large
  │   ╲╱╲ (overshooting)
  │      ╲
  │       ───── Just right
  │
  └────────────── iterations

Variants of Gradient Descent

Batch Gradient Descent:

Use ALL samples to compute gradient
Slow for large datasets
Smooth but expensive

Stochastic Gradient Descent (SGD):

Use ONE sample per update
Noisy but fast
Can escape local minima

Mini-batch Gradient Descent:

Use a small batch (32, 64, 128 samples)
Best of both worlds
Most common in practice

The Complete Training Loop

Putting it all together:

def train(X, Y, layer_dims, learning_rate=0.01, num_iterations=1000):
    """
    Train a neural network.

    Args:
        X: Training data (n_features, m_samples)
        Y: Labels (n_classes, m_samples)
        layer_dims: List of layer sizes [n_x, n_h1, n_h2, ..., n_y]
        learning_rate: Step size for gradient descent
        num_iterations: Number of training iterations

    Returns:
        parameters: Trained weights and biases
        costs: Loss history
    """
    # Initialize parameters
    parameters = initialize_parameters(layer_dims)
    costs = []

    for i in range(num_iterations):
        # Forward propagation
        AL, caches = forward(X, parameters)

        # Compute cost
        cost = compute_cost(AL, Y)
        costs.append(cost)

        # Backward propagation
        gradients = backward(AL, Y, caches)

        # Update parameters
        parameters = update_parameters(parameters, gradients, learning_rate)

        # Print progress
        if i % 100 == 0:
            print(f"Iteration {i}: cost = {cost:.4f}")

    return parameters, costs

MNIST: Our Training Ground

MNIST is the “Hello World” of deep learning - a dataset of 70,000 handwritten digits.

The Dataset

Training: 60,000 images
Test:     10,000 images
Size:     28x28 pixels (784 features when flattened)
Classes:  10 (digits 0-9)

Sample images:
┌────────┐ ┌────────┐ ┌────────┐
│   ██   │ │    █   │ │  ███   │
│  █  █  │ │   ██   │ │     █  │
│ █    █ │ │    █   │ │   ██   │
│ █    █ │ │    █   │ │     █  │
│  ████  │ │   ███  │ │  ███   │
└────────┘ └────────┘ └────────┘
    0          1          2

Why MNIST?

Small enough: Can train on CPU in minutes
Hard enough: Requires real learning
Well-understood: Benchmark results available
Visual: Easy to interpret results

Target Accuracy

Model	Accuracy
Random guessing	10%
Simple neural net (no hidden layers)	~92%
2-layer neural net	~97%
CNN (state of art)	~99.8%

Our goal: Build a 2-layer network achieving ~97% accuracy.

Did You Know? MNIST Trivia

MNIST has become so central to machine learning education that it’s worth understanding its role and limitations.

The Origin

MNIST was created by Yann LeCun (now at Meta AI) and colleagues in 1998. It combines modified samples from NIST’s original dataset of handwritten digits.

The “Drosophila of Machine Learning”

MNIST is often called the “fruit fly of machine learning” - a simple model organism for experimenting. Just as biologists use fruit flies to understand genetics, ML researchers use MNIST to test new ideas.

It’s Actually Too Easy Now

Modern techniques achieve >99.5% accuracy on MNIST. Researchers have moved on to harder datasets like:

CIFAR-10: 60,000 color images in 10 classes
ImageNet: 14 million images in 20,000+ classes
Fashion-MNIST: Harder clothing items (drop-in MNIST replacement)

Human Performance

Humans achieve about 97.5% accuracy on MNIST (yes, some digits are ambiguous!). A well-tuned neural network can beat human performance.

️ Implementation Details

Weight Initialization

Bad initialization can kill training:

# BAD: All zeros (all neurons learn the same thing)
W = np.zeros((n_out, n_in))

# BAD: Large random (exploding activations)
W = np.random.randn(n_out, n_in) * 100

# GOOD: Xavier/Glorot initialization
W = np.random.randn(n_out, n_in) * np.sqrt(1 / n_in)

# BETTER: He initialization (for ReLU)
W = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)

Why He initialization?

Keeps variance of activations roughly constant across layers
Prevents vanishing/exploding gradients
Named after Kaiming He (Microsoft Research, 2015)

Numerical Stability

# BAD: Can overflow/underflow
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# GOOD: Numerically stable
def sigmoid(z):
    z = np.clip(z, -500, 500)  # Prevent overflow
    return 1 / (1 + np.exp(-z))

# Cross-entropy with stability
def cross_entropy(y, y_hat):
    epsilon = 1e-15
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

Softmax for Multi-class

def softmax(z):
    """
    Softmax function for multi-class classification.
    Converts logits to probabilities that sum to 1.
    """
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

Visualizing Learning

Loss Curve

plt.plot(costs)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training Loss')

A healthy loss curve:

Loss
  │
2.0├───╲
   │    ╲
1.0├─────╲────
   │       ╲──────
0.5├───────────────
   │
   └─────────────────
   0    500   1000  iterations

Decision Boundaries

For 2D data, we can visualize what the network learned:

def plot_decision_boundary(model, X, Y):
    # Create grid
    x_min, x_max = X[0, :].min() - 0.5, X[0, :].max() + 0.5
    y_min, y_max = X[1, :].min() - 0.5, X[1, :].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                         np.arange(y_min, y_max, 0.01))

    # Predict on grid
    Z = model(np.c_[xx.ravel(), yy.ravel()].T)
    Z = Z.reshape(xx.shape)

    # Plot
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X[0, :], X[1, :], c=Y, edgecolors='black')

Weight Visualization

For MNIST, first-layer weights can be visualized as 28x28 images:

# Reshape and plot weights
for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(W1[i].reshape(28, 28), cmap='gray')
    plt.axis('off')

You’ll see the network has learned to look for digit-like patterns!

Practical Exercises

Exercise 1: Implement a Perceptron

Build a single-layer perceptron that can learn:

AND gate
OR gate
(Show it fails on XOR)

Exercise 2: 2-Layer XOR Network

Build a 2-layer network that learns XOR:

Input: [0,0] → Output: 0
Input: [0,1] → Output: 1
Input: [1,0] → Output: 1
Input: [1,1] → Output: 0

Exercise 3: MNIST Classifier

Build a network that achieves >95% accuracy on MNIST:

Architecture: 784 → 128 → 64 → 10
Activation: ReLU (hidden), Softmax (output)
Loss: Cross-entropy
Optimizer: Mini-batch gradient descent

Exercise 4: Hyperparameter Exploration

Experiment with:

Learning rate: 0.001, 0.01, 0.1, 1.0
Hidden layer sizes: 32, 64, 128, 256
Number of layers: 1, 2, 3
Batch size: 16, 32, 64, 128

Deliverables

For this module, you will build:

1. Pure Python Neural Network

No PyTorch, no TensorFlow, only NumPy
Configurable architecture (any number of layers)
Support for ReLU, Sigmoid, Tanh activations
Forward and backward propagation

2. MNIST Classifier

Train on MNIST dataset
Achieve >95% test accuracy
Visualize training progress
Show misclassified examples

3. Neural Network Toolkit (Main Deliverable)

Complete implementation with all components
Training visualization
Model save/load functionality
Performance benchmarks
CLI interface

Did You Know? Famous Neural Network Facts

These facts illuminate why neural networks work and the breakthroughs that made modern deep learning possible.

The Vanishing Gradient Problem

For decades, training deep networks was nearly impossible. Gradients would shrink exponentially as they propagated backward, making early layers learn incredibly slowly. Solutions:

ReLU activation (2010)
Batch normalization (2015)
Skip connections (ResNet, 2015)

Universal Approximation Theorem

A neural network with just ONE hidden layer can approximate ANY continuous function (given enough neurons). Proved by George Cybenko in 1989.

So why go deep?

Deep networks are exponentially more efficient
A function that needs 2^n neurons in a shallow network might need only n layers

The Lottery Ticket Hypothesis

In 2019, Jonathan Frankle and Michael Carlin showed that large networks contain small subnetworks (“winning tickets”) that, when trained alone, match the full network’s performance. This suggests most of a network’s capacity is redundant!

Key Takeaways

A neural network is just a function. Specifically, it’s a parameterized function f(x; θ) where learning means finding good values for θ. All the magic is in the parameters.
Forward propagation is function composition. Each layer transforms its input: Z = W @ A + b, then A = activation(Z). Stack enough layers, and you can approximate any function.
Loss functions measure wrongness. Cross-entropy for classification, MSE for regression. The lower the loss, the better the predictions match reality.
Backpropagation is just the chain rule. It assigns credit and blame to each parameter by computing how much changing that parameter would change the loss. No magic, just calculus.
Gradient descent navigates the loss landscape. By repeatedly moving in the direction that reduces loss most steeply, we find good parameter values—even in spaces with millions of dimensions.
Activation functions provide non-linearity. Without them, stacking linear layers would just produce another linear function. ReLU is the modern standard because it avoids vanishing gradients.
Initialization matters. He initialization keeps gradients flowing through deep networks. Bad initialization can kill training before it starts.
The core ideas are old; the results are new. Backpropagation (1986), convolutions (1989), and gradient descent (centuries old) power modern AI. What changed was data, compute, and clever tricks.

Next Steps

With neural networks understood from scratch, you’re ready for Module 27: PyTorch Fundamentals.

You’ll learn how PyTorch:

Automates gradient computation (autograd)
Provides GPU acceleration
Offers building blocks for complex architectures
Makes training much more convenient

But now you know what’s happening under the hood. You’re not just using magic - you understand it.

Module 26: Building the foundation of deep learning understanding.