PyTorch Fundamentals
AI/ML Engineering Track | Complexity:
[MEDIUM]| Time: 6-8
Or: Building a Brain with Nothing but Math
Section titled “Or: Building a Brain with Nothing but Math”Reading Time: 7-8 hours Phase: 6 - Deep Learning Foundations
The Day a Machine Learned to See: When Math Became Magic
Section titled “The Day a Machine Learned to See: When Math Became Magic”Toronto. December 3, 2012. 4:47 PM.
Geoffrey Hinton was staring at his computer screen, trying to process what he was seeing. For thirty years, he’d been a voice in the wilderness, insisting that neural networks could work if we just had enough data and computing power. His colleagues had called him stubborn. Funding agencies had called him delusional. The AI winters had frozen out most of his peers.
But today was different.
The ImageNet Large Scale Visual Recognition Challenge results were in. Hinton’s team—two graduate students, Alex Krizhevsky and Ilya Sutskever—had entered a neural network called “AlexNet.” The best competing systems achieved error rates around 26%. AlexNet achieved 15.3%.
Hinton read the number again: 15.3%.
Not slightly better. Not incrementally improved. A single neural network had reduced the error rate by more than 10 percentage points—a jump so large that reviewers initially assumed it was a mistake.
“I remember thinking: this changes everything. The ideas we’d been developing for decades—they actually work. We just needed scale.” — Geoffrey Hinton, reflecting on the moment in 2017
The phone started ringing. Colleagues who’d dismissed neural networks for years suddenly wanted to know more. Within months, Google acquired Hinton’s startup for $44 million. Within a year, neural networks would be the dominant paradigm in AI. Within a decade, they would power everything from voice assistants to self-driving cars to ChatGPT.
But here’s the remarkable part: the core algorithm that made AlexNet work was invented in 1986. The math hadn’t changed. The architecture was similar to networks from the 1990s. What changed was data, compute, and a few clever tricks.
In this module, you’re going to learn that algorithm from scratch. Not by using PyTorch or TensorFlow, but by building every component yourself in pure Python and NumPy. By the end, you’ll understand exactly what happens when a neural network “learns”—and you’ll have built one that can read handwritten digits with 97% accuracy.
This isn’t just historical education. The engineers who understand what’s happening under the hood—not just how to call .fit()—are the ones who debug the hard problems, design new architectures, and push the field forward. Today, you join their ranks.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand what a neural network actually IS (not just how to use one)
- Implement forward propagation from scratch
- Derive and implement backpropagation by hand
- Train a network on MNIST without ANY frameworks
- Visualize the learning process
- Understand WHY neural networks work
Introduction: What IS a Neural Network?
Section titled “Introduction: What IS a Neural Network?”You’ve used neural networks through APIs and frameworks. You’ve seen them perform magic—generating text, recognizing images, understanding speech. But what’s actually happening inside?
A neural network is just a function.
That’s it. A very complex, parameterized function that takes inputs and produces outputs. The “learning” is just finding the right parameters.
Think of it like a combination lock with millions of dials. Each dial is a parameter. When you “train” a neural network, you’re slowly turning each dial until the lock opens—until the function produces the right outputs for your inputs. The genius of neural networks isn’t the lock itself; it’s the algorithm that figures out how to turn all those dials simultaneously.
Neural Network = f(x; θ)
Where: x = input (image, text, numbers) θ = parameters (weights and biases) f = the function we're learningThe magic isn’t in the architecture - it’s in the learning algorithm that finds good values for θ.
Did You Know? The Tumultuous History of Neural Networks
Section titled “Did You Know? The Tumultuous History of Neural Networks”The story of neural networks is a tale of boom and bust, faith and vindication. Understanding this history helps you appreciate why the field looks the way it does today.
The First Neural Network: 1943
Section titled “The First Neural Network: 1943”In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts published “A Logical Calculus of the Ideas Immanent in Nervous Activity.” They proposed the first mathematical model of a neuron.
“We propose to show how neurons might be connected to perform logical operations.” — McCulloch & Pitts, 1943
Their model was simple: a neuron either fires (1) or doesn’t (0), based on whether the weighted sum of its inputs exceeds a threshold. This was the birth of artificial neural networks.
The Perceptron: 1958
Section titled “The Perceptron: 1958”Frank Rosenblatt at Cornell built the Mark I Perceptron - actual hardware that could learn to recognize simple patterns. The New York Times reported:
“The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”
The hype was real. And wildly premature.
The AI Winter: 1969
Section titled “The AI Winter: 1969”In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” a book that mathematically proved single-layer perceptrons couldn’t learn XOR (a simple logical function). The conclusion was devastating:
“The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations.”
Funding for neural networks collapsed. Researchers moved on. This began the first “AI Winter.”
What they missed: Multi-layer networks COULD learn XOR. But training them was the unsolved problem.
The Backpropagation Revolution: 1986
Section titled “The Backpropagation Revolution: 1986”The solution had actually been discovered multiple times:
- Paul Werbos (1974) - In his PhD thesis, mostly ignored
- David Parker (1985) - Rediscovered independently
- Geoffrey Hinton, David Rumelhart, Ronald Williams (1986) - Made it famous
The 1986 paper “Learning representations by back-propagating errors” in Nature showed that multi-layer networks could be trained efficiently using backpropagation. Neural networks were back.
The Second AI Winter: 1990s
Section titled “The Second AI Winter: 1990s”Despite backpropagation, neural networks hit walls:
- Training was slow (no GPUs)
- Vanishing gradients killed deep networks
- Support Vector Machines often worked better
- Funding dried up again
Geoffrey Hinton was one of the few who kept the faith. He later said:
“I was pretty much the only person in the world who still believed in neural networks.”
The Deep Learning Revolution: 2012
Section titled “The Deep Learning Revolution: 2012”In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet competition with “AlexNet” - a deep convolutional neural network trained on GPUs.
They didn’t just win. They crushed the competition, reducing error rates from 26% to 15%.
The world noticed. The deep learning revolution began.
The irony: The core ideas (backpropagation, convolutions) were decades old. What changed was:
- Data: ImageNet provided millions of labeled images
- Compute: GPUs made training fast
- Techniques: ReLU, dropout, batch norm solved training issues
The Building Block: A Single Neuron
Section titled “The Building Block: A Single Neuron”Let’s start with the simplest possible neural network: one neuron.
The Biological Inspiration
Section titled “The Biological Inspiration”A biological neuron works like a tiny decision-maker in your brain:
- It receives signals from other neurons through dendrites (like antennas collecting radio signals)
- If the combined signal exceeds a threshold, it fires (like a voter who only acts when enough arguments convince them)
- The signal travels down the axon to other neurons (like sending a message down a wire)
Your brain has about 86 billion of these neurons, each connected to thousands of others. The miracle of intelligence emerges from simple units doing simple things, connected in complex patterns.
Dendrites Cell Body Axon (inputs) (processing) (output)
x₁ ─────┐ │ x₂ ─────┼──────► [ Σ + f ] ──────► y │ x₃ ─────┘The Artificial Neuron
Section titled “The Artificial Neuron”An artificial neuron does something similar:
def neuron(inputs, weights, bias): # 1. Weighted sum of inputs z = sum(x * w for x, w in zip(inputs, weights)) + bias
# 2. Apply activation function output = activation(z)
return outputMathematically:
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = w·x + by = f(z)Where:
- x = inputs (features)
- w = weights (learned parameters)
- b = bias (learned parameter)
- f = activation function
- y = output
Activation Functions
Section titled “Activation Functions”The activation function introduces non-linearity. Without it, stacking neurons would just give you another linear function.
Sigmoid (Historical)
Section titled “Sigmoid (Historical)”σ(z) = 1 / (1 + e^(-z))- Output range: (0, 1)
- Problem: Vanishing gradients for large |z|
- Used for: Binary classification outputs
Tanh (Historical)
Section titled “Tanh (Historical)”tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))- Output range: (-1, 1)
- Zero-centered (better than sigmoid)
- Still has vanishing gradient problem
ReLU (Modern Standard)
Section titled “ReLU (Modern Standard)”ReLU(z) = max(0, z)- Output range: [0, ∞)
- No vanishing gradient for positive values
- Computationally efficient
- Problem: “Dead neurons” (always output 0)
Leaky ReLU
Section titled “Leaky ReLU”LeakyReLU(z) = z if z > 0 else 0.01 * z- Fixes the dead neuron problem
- Small gradient for negative values
Visualization of Activation Functions:
Sigmoid: ReLU: 1 ───────── │ ╱ │ ╱ │ ╱ 0.5 │─────╱ │ ╱ │ ╱ │ ╱ 0 ───── └────── -5 0 5 -2 0 2 4From Neuron to Network
Section titled “From Neuron to Network”One neuron can only learn linear patterns. To learn complex patterns, we stack neurons into layers.
Network Architecture
Section titled “Network Architecture”Input Layer Hidden Layer(s) Output Layer x₁ ─────┐ ├──► h₁ ─────┐ x₂ ─────┼ ├──► y₁ ├──► h₂ ─────┤ x₃ ─────┼ ├──► y₂ ├──► h₃ ─────┘ x₄ ─────┘
Features Learned Predictions RepresentationsFully Connected (Dense) Layer: Every neuron connects to every neuron in the next layer.
Why This Module Matters
Section titled “Why This Module Matters”Think of it like an assembly line in a factory. Raw materials enter, and each station transforms them into something more refined. By the end, you have a finished product—but no single station could have built it alone.
Each layer learns increasingly abstract representations:
Image Recognition Example:
Layer 1: Edges and colorsLayer 2: Textures and patternsLayer 3: Parts (eyes, wheels)Layer 4: Objects (faces, cars)This is the key insight of deep learning: hierarchical feature learning.
Matrix Formulation
Section titled “Matrix Formulation”For efficiency, we represent layer computations as matrix operations:
# Single samplez = W @ x + b # Linear transformationa = activation(z) # Non-linearity
# Batch of samplesZ = W @ X + b # X is (features, batch_size)A = activation(Z)Where:
- W = weight matrix (output_neurons × input_neurons)
- X = input matrix (input_neurons × batch_size)
- b = bias vector (output_neurons × 1)
- Z = pre-activation (output_neurons × batch_size)
- A = activation (output_neurons × batch_size)
️ Forward Propagation
Section titled “️ Forward Propagation”Forward propagation is computing the output given an input. It’s just applying each layer’s transformation in sequence.
Algorithm
Section titled “Algorithm”def forward(X, parameters): """ Forward propagation through L layers.
Args: X: Input data (n_features, m_samples) parameters: Dict with W1, b1, W2, b2, ..., WL, bL
Returns: AL: Final output caches: Intermediate values (needed for backprop) """ caches = [] A = X L = len(parameters) // 2 # Number of layers
# Hidden layers (with ReLU) for l in range(1, L): A_prev = A W = parameters[f'W{l}'] b = parameters[f'b{l}']
Z = W @ A_prev + b A = relu(Z)
caches.append((A_prev, W, b, Z))
# Output layer (with sigmoid for binary classification) W = parameters[f'W{L}'] b = parameters[f'b{L}']
Z = W @ A + b AL = sigmoid(Z)
caches.append((A, W, b, Z))
return AL, cachesExample: 2-Layer Network
Section titled “Example: 2-Layer Network”Input: X (784 features - flattened 28x28 image)Hidden: 128 neurons with ReLUOutput: 10 neurons with softmax (digits 0-9)
Forward pass:1. Z1 = W1 @ X + b1 # (128, 784) @ (784, m) = (128, m)2. A1 = ReLU(Z1) # (128, m)3. Z2 = W2 @ A1 + b2 # (10, 128) @ (128, m) = (10, m)4. A2 = softmax(Z2) # (10, m) - probabilities for each classThe Loss Function
Section titled “The Loss Function”How do we know if our network’s predictions are good? We need a loss function (also called cost function or objective function).
Binary Cross-Entropy Loss
Section titled “Binary Cross-Entropy Loss”For binary classification (yes/no):
L(y, ŷ) = -[y * log(ŷ) + (1-y) * log(1-ŷ)]For m samples:
J = -(1/m) * Σ[y * log(ŷ) + (1-y) * log(1-ŷ)]Intuition:
- If y=1 and ŷ=1: loss ≈ 0 (correct and confident)
- If y=1 and ŷ=0: loss → ∞ (wrong and confident)
Categorical Cross-Entropy Loss
Section titled “Categorical Cross-Entropy Loss”For multi-class classification:
L(y, ŷ) = -Σ yᵢ * log(ŷᵢ)Where y is one-hot encoded (e.g., [0,0,1,0,0,0,0,0,0,0] for digit “2”).
Mean Squared Error (MSE)
Section titled “Mean Squared Error (MSE)”For regression:
J = (1/m) * Σ(y - ŷ)²⬅️ Backpropagation: The Key Algorithm
Section titled “⬅️ Backpropagation: The Key Algorithm”Backpropagation is how neural networks learn. It computes the gradient of the loss with respect to every parameter. This is the algorithm that Geoffrey Hinton championed for decades, the one that finally proved its worth in 2012.
Think of it like tracing blame through an organization. When something goes wrong (high loss), you need to figure out who was responsible (which parameters contributed to the error) and by how much. Backpropagation is the accounting system that assigns credit and blame to millions of parameters simultaneously.
The Chain Rule
Section titled “The Chain Rule”Backpropagation is just the chain rule from calculus, applied systematically. If you remember anything from calculus, the chain rule lets you compute derivatives of composed functions: how a change in an input ripples through to affect the output.
If y = f(g(x)), then:
dy/dx = dy/dg * dg/dxIn a neural network, the loss depends on parameters through a chain of operations:
Loss ← Output ← Hidden ← ... ← Input ← ParametersTo find how the loss changes with respect to a parameter, we multiply gradients along the path.
Deriving Backpropagation
Section titled “Deriving Backpropagation”Let’s derive backprop for a 2-layer network step by step.
Forward pass:
Z1 = W1 @ X + b1A1 = ReLU(Z1)Z2 = W2 @ A1 + b2A2 = sigmoid(Z2)L = cross_entropy(Y, A2)Backward pass (computing gradients):
- Output layer gradient:
dL/dZ2 = A2 - Y # For sigmoid + cross-entropy- Output layer parameters:
dL/dW2 = (1/m) * dZ2 @ A1.TdL/db2 = (1/m) * sum(dZ2, axis=1)- Hidden layer gradient:
dL/dA1 = W2.T @ dZ2dL/dZ1 = dL/dA1 * ReLU'(Z1) # Element-wise- Hidden layer parameters:
dL/dW1 = (1/m) * dZ1 @ X.TdL/db1 = (1/m) * sum(dZ1, axis=1)The Backpropagation Algorithm
Section titled “The Backpropagation Algorithm”def backward(AL, Y, caches): """ Backward propagation through L layers.
Args: AL: Final output from forward prop Y: True labels caches: Intermediate values from forward prop
Returns: gradients: Dict with dW1, db1, dW2, db2, ..., dWL, dbL """ gradients = {} L = len(caches) m = AL.shape[1]
# Output layer (sigmoid + cross-entropy) dAL = -(Y / AL) + (1 - Y) / (1 - AL)
# For sigmoid: dZ = dA * sigmoid'(Z) = dA * A * (1 - A) A_prev, W, b, Z = caches[L-1] dZ = AL - Y # Simplified for sigmoid + cross-entropy
gradients[f'dW{L}'] = (1/m) * dZ @ A_prev.T gradients[f'db{L}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True) dA_prev = W.T @ dZ
# Hidden layers (ReLU) for l in reversed(range(L-1)): A_prev, W, b, Z = caches[l]
# ReLU derivative: 1 if Z > 0, else 0 dZ = dA_prev * (Z > 0)
gradients[f'dW{l+1}'] = (1/m) * dZ @ A_prev.T gradients[f'db{l+1}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
if l > 0: dA_prev = W.T @ dZ
return gradientsVisualizing Backpropagation
Section titled “Visualizing Backpropagation”Forward: X ───► Z1 ───► A1 ───► Z2 ───► A2 ───► Loss ↑ ↑ ↑ ↑ W1 --- W2 ---
Backward: dX ◄─── dZ1 ◄─── dA1 ◄─── dZ2 ◄─── dA2 ◄─── dLoss ↓ ↓ dW1 dW2 db1 db2Gradient Descent
Section titled “Gradient Descent”Now that we can compute gradients, we need to update parameters to minimize the loss.
Basic Gradient Descent
Section titled “Basic Gradient Descent”def update_parameters(parameters, gradients, learning_rate): """ Update parameters using gradient descent.
θ = θ - α * ∂L/∂θ """ L = len(parameters) // 2
for l in range(1, L + 1): parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}'] parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
return parametersWhy Gradient Descent Works
Section titled “Why Gradient Descent Works”Imagine you’re blindfolded on a hilly landscape, trying to find the lowest point. You can’t see anything, but you can feel the slope under your feet. Here’s your strategy:
- Feel the slope under your feet (compute gradient)
- Take a step downhill (update parameters)
- Repeat until you can’t go lower
The gradient always points “uphill,” so going in the opposite direction takes you “downhill” toward lower loss. It’s like water flowing downhill—it always finds the path of steepest descent, eventually settling in a valley.
The remarkable thing is that this simple strategy works even in spaces with millions of dimensions. In a neural network with 10 million parameters, you’re navigating a 10-million-dimensional landscape—far beyond human imagination—but the math doesn’t care. Gradient descent still finds the way down.
Learning Rate
Section titled “Learning Rate”The learning rate (α) controls step size:
Too small: Slow convergence, may get stuckToo large: Overshooting, unstable trainingJust right: Smooth convergence to minimum
Loss │ │ ╲ Learning rate too large │ ╲╱╲ (overshooting) │ ╲ │ ───── Just right │ └────────────── iterationsVariants of Gradient Descent
Section titled “Variants of Gradient Descent”Batch Gradient Descent:
- Use ALL samples to compute gradient
- Slow for large datasets
- Smooth but expensive
Stochastic Gradient Descent (SGD):
- Use ONE sample per update
- Noisy but fast
- Can escape local minima
Mini-batch Gradient Descent:
- Use a small batch (32, 64, 128 samples)
- Best of both worlds
- Most common in practice
The Complete Training Loop
Section titled “The Complete Training Loop”Putting it all together:
def train(X, Y, layer_dims, learning_rate=0.01, num_iterations=1000): """ Train a neural network.
Args: X: Training data (n_features, m_samples) Y: Labels (n_classes, m_samples) layer_dims: List of layer sizes [n_x, n_h1, n_h2, ..., n_y] learning_rate: Step size for gradient descent num_iterations: Number of training iterations
Returns: parameters: Trained weights and biases costs: Loss history """ # Initialize parameters parameters = initialize_parameters(layer_dims) costs = []
for i in range(num_iterations): # Forward propagation AL, caches = forward(X, parameters)
# Compute cost cost = compute_cost(AL, Y) costs.append(cost)
# Backward propagation gradients = backward(AL, Y, caches)
# Update parameters parameters = update_parameters(parameters, gradients, learning_rate)
# Print progress if i % 100 == 0: print(f"Iteration {i}: cost = {cost:.4f}")
return parameters, costsMNIST: Our Training Ground
Section titled “MNIST: Our Training Ground”MNIST is the “Hello World” of deep learning - a dataset of 70,000 handwritten digits.
The Dataset
Section titled “The Dataset”Training: 60,000 imagesTest: 10,000 imagesSize: 28x28 pixels (784 features when flattened)Classes: 10 (digits 0-9)
Sample images:┌────────┐ ┌────────┐ ┌────────┐│ ██ │ │ █ │ │ ███ ││ █ █ │ │ ██ │ │ █ ││ █ █ │ │ █ │ │ ██ ││ █ █ │ │ █ │ │ █ ││ ████ │ │ ███ │ │ ███ │└────────┘ └────────┘ └────────┘ 0 1 2Why MNIST?
Section titled “Why MNIST?”- Small enough: Can train on CPU in minutes
- Hard enough: Requires real learning
- Well-understood: Benchmark results available
- Visual: Easy to interpret results
Target Accuracy
Section titled “Target Accuracy”| Model | Accuracy |
|---|---|
| Random guessing | 10% |
| Simple neural net (no hidden layers) | ~92% |
| 2-layer neural net | ~97% |
| CNN (state of art) | ~99.8% |
Our goal: Build a 2-layer network achieving ~97% accuracy.
Did You Know? MNIST Trivia
Section titled “Did You Know? MNIST Trivia”MNIST has become so central to machine learning education that it’s worth understanding its role and limitations.
The Origin
Section titled “The Origin”MNIST was created by Yann LeCun (now at Meta AI) and colleagues in 1998. It combines modified samples from NIST’s original dataset of handwritten digits.
The “Drosophila of Machine Learning”
Section titled “The “Drosophila of Machine Learning””MNIST is often called the “fruit fly of machine learning” - a simple model organism for experimenting. Just as biologists use fruit flies to understand genetics, ML researchers use MNIST to test new ideas.
It’s Actually Too Easy Now
Section titled “It’s Actually Too Easy Now”Modern techniques achieve >99.5% accuracy on MNIST. Researchers have moved on to harder datasets like:
- CIFAR-10: 60,000 color images in 10 classes
- ImageNet: 14 million images in 20,000+ classes
- Fashion-MNIST: Harder clothing items (drop-in MNIST replacement)
Human Performance
Section titled “Human Performance”Humans achieve about 97.5% accuracy on MNIST (yes, some digits are ambiguous!). A well-tuned neural network can beat human performance.
️ Implementation Details
Section titled “️ Implementation Details”Weight Initialization
Section titled “Weight Initialization”Bad initialization can kill training:
# BAD: All zeros (all neurons learn the same thing)W = np.zeros((n_out, n_in))
# BAD: Large random (exploding activations)W = np.random.randn(n_out, n_in) * 100
# GOOD: Xavier/Glorot initializationW = np.random.randn(n_out, n_in) * np.sqrt(1 / n_in)
# BETTER: He initialization (for ReLU)W = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)Why He initialization?
- Keeps variance of activations roughly constant across layers
- Prevents vanishing/exploding gradients
- Named after Kaiming He (Microsoft Research, 2015)
Numerical Stability
Section titled “Numerical Stability”# BAD: Can overflow/underflowdef sigmoid(z): return 1 / (1 + np.exp(-z))
# GOOD: Numerically stabledef sigmoid(z): z = np.clip(z, -500, 500) # Prevent overflow return 1 / (1 + np.exp(-z))
# Cross-entropy with stabilitydef cross_entropy(y, y_hat): epsilon = 1e-15 y_hat = np.clip(y_hat, epsilon, 1 - epsilon) return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))Softmax for Multi-class
Section titled “Softmax for Multi-class”def softmax(z): """ Softmax function for multi-class classification. Converts logits to probabilities that sum to 1. """ # Subtract max for numerical stability exp_z = np.exp(z - np.max(z, axis=0, keepdims=True)) return exp_z / np.sum(exp_z, axis=0, keepdims=True)Visualizing Learning
Section titled “Visualizing Learning”Loss Curve
Section titled “Loss Curve”plt.plot(costs)plt.xlabel('Iteration')plt.ylabel('Loss')plt.title('Training Loss')A healthy loss curve:
Loss │2.0├───╲ │ ╲1.0├─────╲──── │ ╲──────0.5├─────────────── │ └───────────────── 0 500 1000 iterationsDecision Boundaries
Section titled “Decision Boundaries”For 2D data, we can visualize what the network learned:
def plot_decision_boundary(model, X, Y): # Create grid x_min, x_max = X[0, :].min() - 0.5, X[0, :].max() + 0.5 y_min, y_max = X[1, :].min() - 0.5, X[1, :].max() + 0.5 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
# Predict on grid Z = model(np.c_[xx.ravel(), yy.ravel()].T) Z = Z.reshape(xx.shape)
# Plot plt.contourf(xx, yy, Z, alpha=0.8) plt.scatter(X[0, :], X[1, :], c=Y, edgecolors='black')Weight Visualization
Section titled “Weight Visualization”For MNIST, first-layer weights can be visualized as 28x28 images:
# Reshape and plot weightsfor i in range(10): plt.subplot(2, 5, i+1) plt.imshow(W1[i].reshape(28, 28), cmap='gray') plt.axis('off')You’ll see the network has learned to look for digit-like patterns!
Practical Exercises
Section titled “Practical Exercises”Exercise 1: Implement a Perceptron
Section titled “Exercise 1: Implement a Perceptron”Build a single-layer perceptron that can learn:
- AND gate
- OR gate
- (Show it fails on XOR)
Exercise 2: 2-Layer XOR Network
Section titled “Exercise 2: 2-Layer XOR Network”Build a 2-layer network that learns XOR:
Input: [0,0] → Output: 0Input: [0,1] → Output: 1Input: [1,0] → Output: 1Input: [1,1] → Output: 0Exercise 3: MNIST Classifier
Section titled “Exercise 3: MNIST Classifier”Build a network that achieves >95% accuracy on MNIST:
- Architecture: 784 → 128 → 64 → 10
- Activation: ReLU (hidden), Softmax (output)
- Loss: Cross-entropy
- Optimizer: Mini-batch gradient descent
Exercise 4: Hyperparameter Exploration
Section titled “Exercise 4: Hyperparameter Exploration”Experiment with:
- Learning rate: 0.001, 0.01, 0.1, 1.0
- Hidden layer sizes: 32, 64, 128, 256
- Number of layers: 1, 2, 3
- Batch size: 16, 32, 64, 128
Deliverables
Section titled “Deliverables”For this module, you will build:
1. Pure Python Neural Network
Section titled “1. Pure Python Neural Network”- No PyTorch, no TensorFlow, only NumPy
- Configurable architecture (any number of layers)
- Support for ReLU, Sigmoid, Tanh activations
- Forward and backward propagation
2. MNIST Classifier
Section titled “2. MNIST Classifier”- Train on MNIST dataset
- Achieve >95% test accuracy
- Visualize training progress
- Show misclassified examples
3. Neural Network Toolkit (Main Deliverable)
Section titled “3. Neural Network Toolkit (Main Deliverable)”- Complete implementation with all components
- Training visualization
- Model save/load functionality
- Performance benchmarks
- CLI interface
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- Backpropagation (Rumelhart et al., 1986)
- Gradient-Based Learning (LeCun et al., 1998)
- Deep Learning (Hinton, 2006)
- “Neural Networks and Deep Learning” by Michael Nielsen (free online)
- “Deep Learning” by Goodfellow, Bengio, Courville (the “bible”)
Videos
Section titled “Videos”- 3Blue1Brown: “Neural Networks” series (excellent visualizations)
- Andrej Karpathy: “Neural Networks: Zero to Hero”
Did You Know? Famous Neural Network Facts
Section titled “Did You Know? Famous Neural Network Facts”These facts illuminate why neural networks work and the breakthroughs that made modern deep learning possible.
The Vanishing Gradient Problem
Section titled “The Vanishing Gradient Problem”For decades, training deep networks was nearly impossible. Gradients would shrink exponentially as they propagated backward, making early layers learn incredibly slowly. Solutions:
- ReLU activation (2010)
- Batch normalization (2015)
- Skip connections (ResNet, 2015)
Universal Approximation Theorem
Section titled “Universal Approximation Theorem”A neural network with just ONE hidden layer can approximate ANY continuous function (given enough neurons). Proved by George Cybenko in 1989.
So why go deep?
- Deep networks are exponentially more efficient
- A function that needs 2^n neurons in a shallow network might need only n layers
The Lottery Ticket Hypothesis
Section titled “The Lottery Ticket Hypothesis”In 2019, Jonathan Frankle and Michael Carlin showed that large networks contain small subnetworks (“winning tickets”) that, when trained alone, match the full network’s performance. This suggests most of a network’s capacity is redundant!
Key Takeaways
Section titled “Key Takeaways”-
A neural network is just a function. Specifically, it’s a parameterized function f(x; θ) where learning means finding good values for θ. All the magic is in the parameters.
-
Forward propagation is function composition. Each layer transforms its input: Z = W @ A + b, then A = activation(Z). Stack enough layers, and you can approximate any function.
-
Loss functions measure wrongness. Cross-entropy for classification, MSE for regression. The lower the loss, the better the predictions match reality.
-
Backpropagation is just the chain rule. It assigns credit and blame to each parameter by computing how much changing that parameter would change the loss. No magic, just calculus.
-
Gradient descent navigates the loss landscape. By repeatedly moving in the direction that reduces loss most steeply, we find good parameter values—even in spaces with millions of dimensions.
-
Activation functions provide non-linearity. Without them, stacking linear layers would just produce another linear function. ReLU is the modern standard because it avoids vanishing gradients.
-
Initialization matters. He initialization keeps gradients flowing through deep networks. Bad initialization can kill training before it starts.
-
The core ideas are old; the results are new. Backpropagation (1986), convolutions (1989), and gradient descent (centuries old) power modern AI. What changed was data, compute, and clever tricks.
Next Steps
Section titled “Next Steps”With neural networks understood from scratch, you’re ready for Module 27: PyTorch Fundamentals.
You’ll learn how PyTorch:
- Automates gradient computation (autograd)
- Provides GPU acceleration
- Offers building blocks for complex architectures
- Makes training much more convenient
But now you know what’s happening under the hood. You’re not just using magic - you understand it.
Module 26: Building the foundation of deep learning understanding.