Skip to content

PyTorch Fundamentals

AI/ML Engineering Track | Complexity: [MEDIUM] | Time: 6-8

Or: Building a Brain with Nothing but Math

Section titled “Or: Building a Brain with Nothing but Math”

Reading Time: 7-8 hours Phase: 6 - Deep Learning Foundations


The Day a Machine Learned to See: When Math Became Magic

Section titled “The Day a Machine Learned to See: When Math Became Magic”

Toronto. December 3, 2012. 4:47 PM.

Geoffrey Hinton was staring at his computer screen, trying to process what he was seeing. For thirty years, he’d been a voice in the wilderness, insisting that neural networks could work if we just had enough data and computing power. His colleagues had called him stubborn. Funding agencies had called him delusional. The AI winters had frozen out most of his peers.

But today was different.

The ImageNet Large Scale Visual Recognition Challenge results were in. Hinton’s team—two graduate students, Alex Krizhevsky and Ilya Sutskever—had entered a neural network called “AlexNet.” The best competing systems achieved error rates around 26%. AlexNet achieved 15.3%.

Hinton read the number again: 15.3%.

Not slightly better. Not incrementally improved. A single neural network had reduced the error rate by more than 10 percentage points—a jump so large that reviewers initially assumed it was a mistake.

“I remember thinking: this changes everything. The ideas we’d been developing for decades—they actually work. We just needed scale.” — Geoffrey Hinton, reflecting on the moment in 2017

The phone started ringing. Colleagues who’d dismissed neural networks for years suddenly wanted to know more. Within months, Google acquired Hinton’s startup for $44 million. Within a year, neural networks would be the dominant paradigm in AI. Within a decade, they would power everything from voice assistants to self-driving cars to ChatGPT.

But here’s the remarkable part: the core algorithm that made AlexNet work was invented in 1986. The math hadn’t changed. The architecture was similar to networks from the 1990s. What changed was data, compute, and a few clever tricks.

In this module, you’re going to learn that algorithm from scratch. Not by using PyTorch or TensorFlow, but by building every component yourself in pure Python and NumPy. By the end, you’ll understand exactly what happens when a neural network “learns”—and you’ll have built one that can read handwritten digits with 97% accuracy.

This isn’t just historical education. The engineers who understand what’s happening under the hood—not just how to call .fit()—are the ones who debug the hard problems, design new architectures, and push the field forward. Today, you join their ranks.


By the end of this module, you will:

  • Understand what a neural network actually IS (not just how to use one)
  • Implement forward propagation from scratch
  • Derive and implement backpropagation by hand
  • Train a network on MNIST without ANY frameworks
  • Visualize the learning process
  • Understand WHY neural networks work

You’ve used neural networks through APIs and frameworks. You’ve seen them perform magic—generating text, recognizing images, understanding speech. But what’s actually happening inside?

A neural network is just a function.

That’s it. A very complex, parameterized function that takes inputs and produces outputs. The “learning” is just finding the right parameters.

Think of it like a combination lock with millions of dials. Each dial is a parameter. When you “train” a neural network, you’re slowly turning each dial until the lock opens—until the function produces the right outputs for your inputs. The genius of neural networks isn’t the lock itself; it’s the algorithm that figures out how to turn all those dials simultaneously.

Neural Network = f(x; θ)
Where:
x = input (image, text, numbers)
θ = parameters (weights and biases)
f = the function we're learning

The magic isn’t in the architecture - it’s in the learning algorithm that finds good values for θ.


Did You Know? The Tumultuous History of Neural Networks

Section titled “Did You Know? The Tumultuous History of Neural Networks”

The story of neural networks is a tale of boom and bust, faith and vindication. Understanding this history helps you appreciate why the field looks the way it does today.

In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts published “A Logical Calculus of the Ideas Immanent in Nervous Activity.” They proposed the first mathematical model of a neuron.

“We propose to show how neurons might be connected to perform logical operations.” — McCulloch & Pitts, 1943

Their model was simple: a neuron either fires (1) or doesn’t (0), based on whether the weighted sum of its inputs exceeds a threshold. This was the birth of artificial neural networks.

Frank Rosenblatt at Cornell built the Mark I Perceptron - actual hardware that could learn to recognize simple patterns. The New York Times reported:

“The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The hype was real. And wildly premature.

In 1969, Marvin Minsky and Seymour Papert published “Perceptrons,” a book that mathematically proved single-layer perceptrons couldn’t learn XOR (a simple logical function). The conclusion was devastating:

“The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations.”

Funding for neural networks collapsed. Researchers moved on. This began the first “AI Winter.”

What they missed: Multi-layer networks COULD learn XOR. But training them was the unsolved problem.

The solution had actually been discovered multiple times:

  • Paul Werbos (1974) - In his PhD thesis, mostly ignored
  • David Parker (1985) - Rediscovered independently
  • Geoffrey Hinton, David Rumelhart, Ronald Williams (1986) - Made it famous

The 1986 paper “Learning representations by back-propagating errors” in Nature showed that multi-layer networks could be trained efficiently using backpropagation. Neural networks were back.

Despite backpropagation, neural networks hit walls:

  • Training was slow (no GPUs)
  • Vanishing gradients killed deep networks
  • Support Vector Machines often worked better
  • Funding dried up again

Geoffrey Hinton was one of the few who kept the faith. He later said:

“I was pretty much the only person in the world who still believed in neural networks.”

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet competition with “AlexNet” - a deep convolutional neural network trained on GPUs.

They didn’t just win. They crushed the competition, reducing error rates from 26% to 15%.

The world noticed. The deep learning revolution began.

The irony: The core ideas (backpropagation, convolutions) were decades old. What changed was:

  1. Data: ImageNet provided millions of labeled images
  2. Compute: GPUs made training fast
  3. Techniques: ReLU, dropout, batch norm solved training issues

Let’s start with the simplest possible neural network: one neuron.

A biological neuron works like a tiny decision-maker in your brain:

  1. It receives signals from other neurons through dendrites (like antennas collecting radio signals)
  2. If the combined signal exceeds a threshold, it fires (like a voter who only acts when enough arguments convince them)
  3. The signal travels down the axon to other neurons (like sending a message down a wire)

Your brain has about 86 billion of these neurons, each connected to thousands of others. The miracle of intelligence emerges from simple units doing simple things, connected in complex patterns.

Dendrites Cell Body Axon
(inputs) (processing) (output)
x₁ ─────┐
x₂ ─────┼──────► [ Σ + f ] ──────► y
x₃ ─────┘

An artificial neuron does something similar:

def neuron(inputs, weights, bias):
# 1. Weighted sum of inputs
z = sum(x * w for x, w in zip(inputs, weights)) + bias
# 2. Apply activation function
output = activation(z)
return output

Mathematically:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = w·x + b
y = f(z)

Where:

  • x = inputs (features)
  • w = weights (learned parameters)
  • b = bias (learned parameter)
  • f = activation function
  • y = output

The activation function introduces non-linearity. Without it, stacking neurons would just give you another linear function.

σ(z) = 1 / (1 + e^(-z))
  • Output range: (0, 1)
  • Problem: Vanishing gradients for large |z|
  • Used for: Binary classification outputs
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
  • Output range: (-1, 1)
  • Zero-centered (better than sigmoid)
  • Still has vanishing gradient problem
ReLU(z) = max(0, z)
  • Output range: [0, ∞)
  • No vanishing gradient for positive values
  • Computationally efficient
  • Problem: “Dead neurons” (always output 0)
LeakyReLU(z) = z if z > 0 else 0.01 * z
  • Fixes the dead neuron problem
  • Small gradient for negative values
Visualization of Activation Functions:
Sigmoid: ReLU:
1 ───────── │ ╱
│ ╱ │ ╱
0.5 │─────╱ │ ╱
│ ╱ │ ╱
0 ───── └──────
-5 0 5 -2 0 2 4

One neuron can only learn linear patterns. To learn complex patterns, we stack neurons into layers.

Input Layer Hidden Layer(s) Output Layer
x₁ ─────┐
├──► h₁ ─────┐
x₂ ─────┼ ├──► y₁
├──► h₂ ─────┤
x₃ ─────┼ ├──► y₂
├──► h₃ ─────┘
x₄ ─────┘
Features Learned Predictions
Representations

Fully Connected (Dense) Layer: Every neuron connects to every neuron in the next layer.

Think of it like an assembly line in a factory. Raw materials enter, and each station transforms them into something more refined. By the end, you have a finished product—but no single station could have built it alone.

Each layer learns increasingly abstract representations:

Image Recognition Example:
Layer 1: Edges and colors
Layer 2: Textures and patterns
Layer 3: Parts (eyes, wheels)
Layer 4: Objects (faces, cars)

This is the key insight of deep learning: hierarchical feature learning.

For efficiency, we represent layer computations as matrix operations:

# Single sample
z = W @ x + b # Linear transformation
a = activation(z) # Non-linearity
# Batch of samples
Z = W @ X + b # X is (features, batch_size)
A = activation(Z)

Where:

  • W = weight matrix (output_neurons × input_neurons)
  • X = input matrix (input_neurons × batch_size)
  • b = bias vector (output_neurons × 1)
  • Z = pre-activation (output_neurons × batch_size)
  • A = activation (output_neurons × batch_size)

Forward propagation is computing the output given an input. It’s just applying each layer’s transformation in sequence.

def forward(X, parameters):
"""
Forward propagation through L layers.
Args:
X: Input data (n_features, m_samples)
parameters: Dict with W1, b1, W2, b2, ..., WL, bL
Returns:
AL: Final output
caches: Intermediate values (needed for backprop)
"""
caches = []
A = X
L = len(parameters) // 2 # Number of layers
# Hidden layers (with ReLU)
for l in range(1, L):
A_prev = A
W = parameters[f'W{l}']
b = parameters[f'b{l}']
Z = W @ A_prev + b
A = relu(Z)
caches.append((A_prev, W, b, Z))
# Output layer (with sigmoid for binary classification)
W = parameters[f'W{L}']
b = parameters[f'b{L}']
Z = W @ A + b
AL = sigmoid(Z)
caches.append((A, W, b, Z))
return AL, caches
Input: X (784 features - flattened 28x28 image)
Hidden: 128 neurons with ReLU
Output: 10 neurons with softmax (digits 0-9)
Forward pass:
1. Z1 = W1 @ X + b1 # (128, 784) @ (784, m) = (128, m)
2. A1 = ReLU(Z1) # (128, m)
3. Z2 = W2 @ A1 + b2 # (10, 128) @ (128, m) = (10, m)
4. A2 = softmax(Z2) # (10, m) - probabilities for each class

How do we know if our network’s predictions are good? We need a loss function (also called cost function or objective function).

For binary classification (yes/no):

L(y, ŷ) = -[y * log(ŷ) + (1-y) * log(1-ŷ)]

For m samples:

J = -(1/m) * Σ[y * log(ŷ) + (1-y) * log(1-ŷ)]

Intuition:

  • If y=1 and ŷ=1: loss ≈ 0 (correct and confident)
  • If y=1 and ŷ=0: loss → ∞ (wrong and confident)

For multi-class classification:

L(y, ŷ) = -Σ yᵢ * log(ŷᵢ)

Where y is one-hot encoded (e.g., [0,0,1,0,0,0,0,0,0,0] for digit “2”).

For regression:

J = (1/m) * Σ(y - ŷ)²

Backpropagation is how neural networks learn. It computes the gradient of the loss with respect to every parameter. This is the algorithm that Geoffrey Hinton championed for decades, the one that finally proved its worth in 2012.

Think of it like tracing blame through an organization. When something goes wrong (high loss), you need to figure out who was responsible (which parameters contributed to the error) and by how much. Backpropagation is the accounting system that assigns credit and blame to millions of parameters simultaneously.

Backpropagation is just the chain rule from calculus, applied systematically. If you remember anything from calculus, the chain rule lets you compute derivatives of composed functions: how a change in an input ripples through to affect the output.

If y = f(g(x)), then:

dy/dx = dy/dg * dg/dx

In a neural network, the loss depends on parameters through a chain of operations:

Loss ← Output ← Hidden ← ... ← Input ← Parameters

To find how the loss changes with respect to a parameter, we multiply gradients along the path.

Let’s derive backprop for a 2-layer network step by step.

Forward pass:

Z1 = W1 @ X + b1
A1 = ReLU(Z1)
Z2 = W2 @ A1 + b2
A2 = sigmoid(Z2)
L = cross_entropy(Y, A2)

Backward pass (computing gradients):

  1. Output layer gradient:
dL/dZ2 = A2 - Y # For sigmoid + cross-entropy
  1. Output layer parameters:
dL/dW2 = (1/m) * dZ2 @ A1.T
dL/db2 = (1/m) * sum(dZ2, axis=1)
  1. Hidden layer gradient:
dL/dA1 = W2.T @ dZ2
dL/dZ1 = dL/dA1 * ReLU'(Z1) # Element-wise
  1. Hidden layer parameters:
dL/dW1 = (1/m) * dZ1 @ X.T
dL/db1 = (1/m) * sum(dZ1, axis=1)
def backward(AL, Y, caches):
"""
Backward propagation through L layers.
Args:
AL: Final output from forward prop
Y: True labels
caches: Intermediate values from forward prop
Returns:
gradients: Dict with dW1, db1, dW2, db2, ..., dWL, dbL
"""
gradients = {}
L = len(caches)
m = AL.shape[1]
# Output layer (sigmoid + cross-entropy)
dAL = -(Y / AL) + (1 - Y) / (1 - AL)
# For sigmoid: dZ = dA * sigmoid'(Z) = dA * A * (1 - A)
A_prev, W, b, Z = caches[L-1]
dZ = AL - Y # Simplified for sigmoid + cross-entropy
gradients[f'dW{L}'] = (1/m) * dZ @ A_prev.T
gradients[f'db{L}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = W.T @ dZ
# Hidden layers (ReLU)
for l in reversed(range(L-1)):
A_prev, W, b, Z = caches[l]
# ReLU derivative: 1 if Z > 0, else 0
dZ = dA_prev * (Z > 0)
gradients[f'dW{l+1}'] = (1/m) * dZ @ A_prev.T
gradients[f'db{l+1}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
if l > 0:
dA_prev = W.T @ dZ
return gradients
Forward: X ───► Z1 ───► A1 ───► Z2 ───► A2 ───► Loss
↑ ↑ ↑ ↑
W1 --- W2 ---
Backward: dX ◄─── dZ1 ◄─── dA1 ◄─── dZ2 ◄─── dA2 ◄─── dLoss
↓ ↓
dW1 dW2
db1 db2

Now that we can compute gradients, we need to update parameters to minimize the loss.

def update_parameters(parameters, gradients, learning_rate):
"""
Update parameters using gradient descent.
θ = θ - α * ∂L/∂θ
"""
L = len(parameters) // 2
for l in range(1, L + 1):
parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
return parameters

Imagine you’re blindfolded on a hilly landscape, trying to find the lowest point. You can’t see anything, but you can feel the slope under your feet. Here’s your strategy:

  1. Feel the slope under your feet (compute gradient)
  2. Take a step downhill (update parameters)
  3. Repeat until you can’t go lower

The gradient always points “uphill,” so going in the opposite direction takes you “downhill” toward lower loss. It’s like water flowing downhill—it always finds the path of steepest descent, eventually settling in a valley.

The remarkable thing is that this simple strategy works even in spaces with millions of dimensions. In a neural network with 10 million parameters, you’re navigating a 10-million-dimensional landscape—far beyond human imagination—but the math doesn’t care. Gradient descent still finds the way down.

The learning rate (α) controls step size:

Too small: Slow convergence, may get stuck
Too large: Overshooting, unstable training
Just right: Smooth convergence to minimum
Loss
│ ╲ Learning rate too large
│ ╲╱╲ (overshooting)
│ ╲
│ ───── Just right
└────────────── iterations

Batch Gradient Descent:

  • Use ALL samples to compute gradient
  • Slow for large datasets
  • Smooth but expensive

Stochastic Gradient Descent (SGD):

  • Use ONE sample per update
  • Noisy but fast
  • Can escape local minima

Mini-batch Gradient Descent:

  • Use a small batch (32, 64, 128 samples)
  • Best of both worlds
  • Most common in practice

Putting it all together:

def train(X, Y, layer_dims, learning_rate=0.01, num_iterations=1000):
"""
Train a neural network.
Args:
X: Training data (n_features, m_samples)
Y: Labels (n_classes, m_samples)
layer_dims: List of layer sizes [n_x, n_h1, n_h2, ..., n_y]
learning_rate: Step size for gradient descent
num_iterations: Number of training iterations
Returns:
parameters: Trained weights and biases
costs: Loss history
"""
# Initialize parameters
parameters = initialize_parameters(layer_dims)
costs = []
for i in range(num_iterations):
# Forward propagation
AL, caches = forward(X, parameters)
# Compute cost
cost = compute_cost(AL, Y)
costs.append(cost)
# Backward propagation
gradients = backward(AL, Y, caches)
# Update parameters
parameters = update_parameters(parameters, gradients, learning_rate)
# Print progress
if i % 100 == 0:
print(f"Iteration {i}: cost = {cost:.4f}")
return parameters, costs

MNIST is the “Hello World” of deep learning - a dataset of 70,000 handwritten digits.

Training: 60,000 images
Test: 10,000 images
Size: 28x28 pixels (784 features when flattened)
Classes: 10 (digits 0-9)
Sample images:
┌────────┐ ┌────────┐ ┌────────┐
│ ██ │ │ █ │ │ ███ │
│ █ █ │ │ ██ │ │ █ │
│ █ █ │ │ █ │ │ ██ │
│ █ █ │ │ █ │ │ █ │
│ ████ │ │ ███ │ │ ███ │
└────────┘ └────────┘ └────────┘
0 1 2
  1. Small enough: Can train on CPU in minutes
  2. Hard enough: Requires real learning
  3. Well-understood: Benchmark results available
  4. Visual: Easy to interpret results
ModelAccuracy
Random guessing10%
Simple neural net (no hidden layers)~92%
2-layer neural net~97%
CNN (state of art)~99.8%

Our goal: Build a 2-layer network achieving ~97% accuracy.


MNIST has become so central to machine learning education that it’s worth understanding its role and limitations.

MNIST was created by Yann LeCun (now at Meta AI) and colleagues in 1998. It combines modified samples from NIST’s original dataset of handwritten digits.

MNIST is often called the “fruit fly of machine learning” - a simple model organism for experimenting. Just as biologists use fruit flies to understand genetics, ML researchers use MNIST to test new ideas.

Modern techniques achieve >99.5% accuracy on MNIST. Researchers have moved on to harder datasets like:

  • CIFAR-10: 60,000 color images in 10 classes
  • ImageNet: 14 million images in 20,000+ classes
  • Fashion-MNIST: Harder clothing items (drop-in MNIST replacement)

Humans achieve about 97.5% accuracy on MNIST (yes, some digits are ambiguous!). A well-tuned neural network can beat human performance.


Bad initialization can kill training:

# BAD: All zeros (all neurons learn the same thing)
W = np.zeros((n_out, n_in))
# BAD: Large random (exploding activations)
W = np.random.randn(n_out, n_in) * 100
# GOOD: Xavier/Glorot initialization
W = np.random.randn(n_out, n_in) * np.sqrt(1 / n_in)
# BETTER: He initialization (for ReLU)
W = np.random.randn(n_out, n_in) * np.sqrt(2 / n_in)

Why He initialization?

  • Keeps variance of activations roughly constant across layers
  • Prevents vanishing/exploding gradients
  • Named after Kaiming He (Microsoft Research, 2015)
# BAD: Can overflow/underflow
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# GOOD: Numerically stable
def sigmoid(z):
z = np.clip(z, -500, 500) # Prevent overflow
return 1 / (1 + np.exp(-z))
# Cross-entropy with stability
def cross_entropy(y, y_hat):
epsilon = 1e-15
y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
def softmax(z):
"""
Softmax function for multi-class classification.
Converts logits to probabilities that sum to 1.
"""
# Subtract max for numerical stability
exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
return exp_z / np.sum(exp_z, axis=0, keepdims=True)

plt.plot(costs)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training Loss')

A healthy loss curve:

Loss
2.0├───╲
│ ╲
1.0├─────╲────
│ ╲──────
0.5├───────────────
└─────────────────
0 500 1000 iterations

For 2D data, we can visualize what the network learned:

def plot_decision_boundary(model, X, Y):
# Create grid
x_min, x_max = X[0, :].min() - 0.5, X[0, :].max() + 0.5
y_min, y_max = X[1, :].min() - 0.5, X[1, :].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
# Predict on grid
Z = model(np.c_[xx.ravel(), yy.ravel()].T)
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[0, :], X[1, :], c=Y, edgecolors='black')

For MNIST, first-layer weights can be visualized as 28x28 images:

# Reshape and plot weights
for i in range(10):
plt.subplot(2, 5, i+1)
plt.imshow(W1[i].reshape(28, 28), cmap='gray')
plt.axis('off')

You’ll see the network has learned to look for digit-like patterns!


Build a single-layer perceptron that can learn:

  • AND gate
  • OR gate
  • (Show it fails on XOR)

Build a 2-layer network that learns XOR:

Input: [0,0] → Output: 0
Input: [0,1] → Output: 1
Input: [1,0] → Output: 1
Input: [1,1] → Output: 0

Build a network that achieves >95% accuracy on MNIST:

  • Architecture: 784 → 128 → 64 → 10
  • Activation: ReLU (hidden), Softmax (output)
  • Loss: Cross-entropy
  • Optimizer: Mini-batch gradient descent

Experiment with:

  • Learning rate: 0.001, 0.01, 0.1, 1.0
  • Hidden layer sizes: 32, 64, 128, 256
  • Number of layers: 1, 2, 3
  • Batch size: 16, 32, 64, 128

For this module, you will build:

  • No PyTorch, no TensorFlow, only NumPy
  • Configurable architecture (any number of layers)
  • Support for ReLU, Sigmoid, Tanh activations
  • Forward and backward propagation
  • Train on MNIST dataset
  • Achieve >95% test accuracy
  • Visualize training progress
  • Show misclassified examples

3. Neural Network Toolkit (Main Deliverable)

Section titled “3. Neural Network Toolkit (Main Deliverable)”
  • Complete implementation with all components
  • Training visualization
  • Model save/load functionality
  • Performance benchmarks
  • CLI interface

  • “Neural Networks and Deep Learning” by Michael Nielsen (free online)
  • “Deep Learning” by Goodfellow, Bengio, Courville (the “bible”)
  • 3Blue1Brown: “Neural Networks” series (excellent visualizations)
  • Andrej Karpathy: “Neural Networks: Zero to Hero”

These facts illuminate why neural networks work and the breakthroughs that made modern deep learning possible.

For decades, training deep networks was nearly impossible. Gradients would shrink exponentially as they propagated backward, making early layers learn incredibly slowly. Solutions:

  • ReLU activation (2010)
  • Batch normalization (2015)
  • Skip connections (ResNet, 2015)

A neural network with just ONE hidden layer can approximate ANY continuous function (given enough neurons). Proved by George Cybenko in 1989.

So why go deep?

  • Deep networks are exponentially more efficient
  • A function that needs 2^n neurons in a shallow network might need only n layers

In 2019, Jonathan Frankle and Michael Carlin showed that large networks contain small subnetworks (“winning tickets”) that, when trained alone, match the full network’s performance. This suggests most of a network’s capacity is redundant!


  1. A neural network is just a function. Specifically, it’s a parameterized function f(x; θ) where learning means finding good values for θ. All the magic is in the parameters.

  2. Forward propagation is function composition. Each layer transforms its input: Z = W @ A + b, then A = activation(Z). Stack enough layers, and you can approximate any function.

  3. Loss functions measure wrongness. Cross-entropy for classification, MSE for regression. The lower the loss, the better the predictions match reality.

  4. Backpropagation is just the chain rule. It assigns credit and blame to each parameter by computing how much changing that parameter would change the loss. No magic, just calculus.

  5. Gradient descent navigates the loss landscape. By repeatedly moving in the direction that reduces loss most steeply, we find good parameter values—even in spaces with millions of dimensions.

  6. Activation functions provide non-linearity. Without them, stacking linear layers would just produce another linear function. ReLU is the modern standard because it avoids vanishing gradients.

  7. Initialization matters. He initialization keeps gradients flowing through deep networks. Bad initialization can kill training before it starts.

  8. The core ideas are old; the results are new. Backpropagation (1986), convolutions (1989), and gradient descent (centuries old) power modern AI. What changed was data, compute, and clever tricks.


With neural networks understood from scratch, you’re ready for Module 27: PyTorch Fundamentals.

You’ll learn how PyTorch:

  • Automates gradient computation (autograd)
  • Provides GPU acceleration
  • Offers building blocks for complex architectures
  • Makes training much more convenient

But now you know what’s happening under the hood. You’re not just using magic - you understand it.


Module 26: Building the foundation of deep learning understanding.