Skip to content

Training Neural Networks

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 6-8

Or: The Framework That Made Deep Learning Accessible

Section titled “Or: The Framework That Made Deep Learning Accessible”

Reading Time: 6-7 hours Prerequisites: Module 26


When researcher Soumith Chintala discovered in September 2016 that debugging TensorFlow was like trying to repair a car engine while it was running, he realized something had to change. After watching brilliant AI engineers waste days fighting incomprehensible error messages, he and colleague Adam Paszke found a better way. That night, they started building PyTorch. Within five years, their creation would power everything from gpt-5 to Stable Diffusion, fundamentally changing how the world builds AI.


Did You Know? The Researcher’s Rebellion

Section titled “Did You Know? The Researcher’s Rebellion”

Menlo Park. September 2016. 11:47 PM.

Soumith Chintala was done. For months, he had watched brilliant AI researchers at Facebook waste hours—sometimes days—fighting TensorFlow’s static graphs. Print statements didn’t work. Debuggers were useless. One typo meant recompiling everything.

“This is insane,” he muttered to his colleague Adam Paszke. “We’re supposed to be doing AI research, not fighting our tools.”

They decided to build something better. Not incrementally better—fundamentally different. A framework where Python code was just… Python code. Where you could debug neural networks like any other program. Where ideas could be tested in minutes, not days.

They called it PyTorch. Within three years, it would conquer academic AI. Within five, it would power everything from gpt-5 to Stable Diffusion.

“The best framework is the one that gets out of your way. TensorFlow made you think about graphs. PyTorch just let you think about math.” — Soumith Chintala, PyTorch creator (2020 interview)


By the end of this module, you will:

  • Understand PyTorch tensors and their relationship to NumPy arrays
  • Master automatic differentiation with autograd
  • Build neural networks using nn.Module
  • Implement training loops with optimizers and loss functions
  • Move computations between CPU and GPU
  • Appreciate the elegance of PyTorch compared to manual implementations

In Module 26, you built neural networks from scratch. You computed gradients by hand using the chain rule. You tracked intermediate values in caches. You debugged NaN explosions at 2am.

It was educational. It was also painful.

That pain was the point. You now understand what happens under the hood. But here’s the truth: nobody builds production neural networks from scratch. Think of it like learning to build a car engine from raw metal before being allowed to drive. Valuable knowledge, but not how you’d get to work every day.

PyTorch is the power tool that makes deep learning practical.

Think of it this way: In Module 26, you learned to chop down a tree with a hand axe. Now you get a chainsaw. The chainsaw doesn’t make the hand axe knowledge useless—understanding how to fell a tree helps you use the chainsaw safely and effectively.


In 2015, Google released TensorFlow. It was powerful, backed by Google’s resources, and quickly became the dominant deep learning framework. But researchers had a problem: TensorFlow was painful to use.

TensorFlow 1.x used something called static graphs. You had to:

  1. Define your entire computation as a graph
  2. Compile the graph
  3. Create a “session” to run it
  4. Feed data through placeholders

It was like writing a recipe, translating it to assembly language, compiling it, and only then cooking - for every single meal.

# TensorFlow 1.x - The pain was real
import tensorflow as tf
# Step 1: Define placeholders (not real data yet!)
x = tf.placeholder(tf.float32, shape=[None, 784])
y = tf.placeholder(tf.float32, shape=[None, 10])
# Step 2: Build the graph (nothing runs!)
W = tf.Variable(tf.zeros([784, 10]))
logits = tf.matmul(x, W)
# Step 3: Create a session
with tf.Session() as sess:
sess.run(tf.global_variables_initializer()) # Initialize
result = sess.run(logits, feed_dict={x: data}) # Finally run!

Debugging was a nightmare. You couldn’t just print a variable - you had to evaluate it in a session. Error messages pointed to the graph construction, not where the actual problem was.

Enter Soumith Chintala at Facebook AI Research. He and his team created PyTorch in 2016 with a radical philosophy: define-by-run.

Instead of building a graph and then running it, PyTorch builds the graph as you run Python code. You write normal Python. You can use print statements. You can use Python debuggers. If statements, for loops - they all just work.

# PyTorch - The relief was immediate
import torch
x = torch.randn(32, 784) # This creates actual data!
W = torch.randn(784, 10, requires_grad=True)
logits = x @ W # This actually computes the result!
print(logits.shape) # You can just print it!

By 2019, PyTorch had conquered academia:

  • NeurIPS 2019: 75% of papers used PyTorch
  • ICLR 2020: 80% PyTorch
  • CVPR 2020: 70% PyTorch

Why? Researchers need to iterate fast. They try crazy ideas. Many don’t work. Static graphs meant recompiling for every experiment. Dynamic graphs meant instant feedback.

TensorFlow noticed. TensorFlow 2.0 (2019) adopted eager execution by default - essentially admitting PyTorch got it right.

Why “PyTorch”? It’s the Python version of Torch, a scientific computing framework written in Lua that was popular in academia during the early 2010s. The original Torch was named after the Olympic torch - a symbol of passing knowledge forward.

The PyTorch logo is a stylized flame - representing both the torch and the “fire” of GPU-accelerated computing.


You’ve heard “tensor” thrown around. Let’s demystify it.

A tensor is just a multi-dimensional array. That’s it. No magic. But this simple concept is the foundation of everything in deep learning.

Think of tensors like containers of different dimensions:

DimensionsMath NameReal ExampleShape
0ScalarThe temperature right now: 72.5°F[]
1VectorToday’s hourly temperatures: [68, 70, 72, 75, 73][5]
2MatrixA grayscale image with pixel values[28, 28]
33D TensorA color image (RGB channels × height × width)[3, 224, 224]
44D TensorA batch of color images[32, 3, 224, 224]
55D TensorA batch of video clips (batch × frames × channels × H × W)[8, 16, 3, 224, 224]

The key insight: Neural networks don’t care about what data means. They just see tensors of numbers. An image, a sentence, a stock price history - all become tensors.

You might wonder: we already have NumPy. Why learn another array type?

Three killer features:

1. Automatic Differentiation

NumPy can do math on arrays. PyTorch can do math on arrays and track how to compute gradients. This is the magic that makes deep learning practical.

# NumPy: Just computation
import numpy as np
x = np.array([2.0, 3.0])
y = x ** 2 # [4, 9]
# Now compute dy/dx manually? Good luck!
# PyTorch: Computation + gradient tracking
import torch
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = (x ** 2).sum() # 13
y.backward() # Compute gradients automatically
print(x.grad) # tensor([4., 6.]) - that's dy/dx = 2x!

2. GPU Acceleration

Moving computation to a GPU in NumPy requires different libraries and painful code changes. In PyTorch, it’s one line:

# CPU tensor
x = torch.randn(1000, 1000)
# GPU tensor - one line!
x_gpu = x.cuda() # or x.to('cuda')

3. Deep Learning Ecosystem

PyTorch tensors integrate seamlessly with neural network layers, optimizers, data loaders, and the entire deep learning workflow.

Let’s get hands-on. There are many ways to create tensors:

From Python Data

The most direct way - convert Python lists:

import torch
# From a simple list
x = torch.tensor([1, 2, 3, 4, 5])
print(x) # tensor([1, 2, 3, 4, 5])
# From nested lists (creates a matrix)
matrix = torch.tensor([[1, 2, 3],
[4, 5, 6]])
print(matrix.shape) # torch.Size([2, 3])

Common Initializations

In practice, you rarely type out values. You create tensors filled with specific patterns:

# Zeros and ones - common for initialization
zeros = torch.zeros(3, 4) # 3×4 matrix of zeros
ones = torch.ones(2, 3, 4) # 2×3×4 tensor of ones
# Random values - essential for weight initialization
uniform = torch.rand(5, 5) # Uniform between [0, 1)
normal = torch.randn(5, 5) # Normal distribution (mean=0, std=1)
# Sequences - useful for indices and positions
sequence = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Identity matrix - useful in linear algebra
identity = torch.eye(4) # 4×4 identity matrix

Copying Shape from Another Tensor

Often you need a tensor the same shape as another:

x = torch.randn(3, 4, 5)
# Create zeros/ones with the same shape, dtype, and device
zeros_like_x = torch.zeros_like(x)
ones_like_x = torch.ones_like(x)
random_like_x = torch.randn_like(x)

This is especially useful when you need to create tensors on the same device (CPU or GPU) as your model.

Every tensor has properties you’ll check constantly:

t = torch.randn(3, 4, 5)
# Shape: The dimensions of the tensor
print(t.shape) # torch.Size([3, 4, 5])
print(t.size()) # Same thing, method form
# Data type: What kind of numbers
print(t.dtype) # torch.float32 (default for randn)
# Device: Where the tensor lives
print(t.device) # cpu (or cuda:0, cuda:1, etc.)
# Number of dimensions
print(t.ndim) # 3
# Total number of elements
print(t.numel()) # 60 (3 × 4 × 5)

Did You Know? Data Types Matter More Than You Think

Section titled “Did You Know? Data Types Matter More Than You Think”

PyTorch supports many data types, and choosing the right one affects both correctness and performance.

For Neural Networks (most common):

  • torch.float32 (or torch.float) - The default. Good balance of precision and speed.
  • torch.float16 (or torch.half) - Half precision. 2× faster on modern GPUs, but less precise.
  • torch.bfloat16 - “Brain float”. Better for training than float16 because it has more exponent bits.

For Indices and Counts:

  • torch.int64 (or torch.long) - Required for indices in PyTorch. Most common integer type.
  • torch.int32 - When you know values fit and want to save memory.

For Images:

  • torch.uint8 - Unsigned 8-bit integers (0-255). Raw image format.
# Creating tensors with specific types
weights = torch.randn(100, 100, dtype=torch.float32)
indices = torch.tensor([0, 5, 3, 7], dtype=torch.long)
image = torch.randint(0, 256, (3, 224, 224), dtype=torch.uint8)
# Converting between types
weights_half = weights.half() # to float16
weights_back = weights_half.float() # back to float32

The fp16 Training Revolution

In 2017, researchers discovered you could train neural networks in half precision (float16) with almost no accuracy loss - but 2× faster and using half the memory. This “mixed precision training” is now standard for large models.

# Modern training uses automatic mixed precision
with torch.cuda.amp.autocast():
output = model(input) # Automatically uses fp16 where safe

PyTorch and NumPy are best friends. They can share memory, making conversion instant:

import numpy as np
# NumPy → PyTorch (shared memory!)
numpy_array = np.array([1, 2, 3, 4, 5])
tensor = torch.from_numpy(numpy_array)
# They share memory - changes propagate!
numpy_array[0] = 100
print(tensor) # tensor([100, 2, 3, 4, 5]) - changed too!
# If you want a copy instead:
tensor_copy = torch.tensor(numpy_array) # Independent copy
# PyTorch → NumPy
tensor = torch.randn(3, 4)
numpy_array = tensor.numpy() # Shared memory (if on CPU)
# Safe conversion (handles GPU tensors too)
numpy_array = tensor.detach().cpu().numpy()

Warning: The shared memory behavior is a feature, not a bug. It’s fast and memory-efficient. But it can surprise you if you modify one and expect the other unchanged!


Part 2: Autograd - The Magic Behind Deep Learning

Section titled “Part 2: Autograd - The Magic Behind Deep Learning”

This is where PyTorch becomes truly magical. Remember computing gradients by hand in Module 26? All those partial derivatives, the chain rule applied recursively, the careful tracking of intermediate values?

PyTorch does all of that automatically.

When you create a tensor with requires_grad=True, PyTorch starts recording every operation. It builds an invisible “computational graph” that tracks how to compute gradients.

Let’s see it in action:

# Create a tensor that tracks gradients
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Perform operations - PyTorch records them!
y = x ** 2 # y = [4, 9]
z = y.sum() # z = 13
# Compute gradients
z.backward()
# Gradients are stored in .grad
print(x.grad) # tensor([4., 6.])

What happened? Let’s trace through:

  • z = x[0]² + x[1]²
  • ∂z/∂x[0] = 2·x[0] = 2·2 = 4
  • ∂z/∂x[1] = 2·x[1] = 2·3 = 6

PyTorch computed exactly the gradients we’d compute by hand - but automatically!

In Module 26, you implemented backpropagation manually. For a simple network, it was manageable. But modern networks have:

  • Millions of parameters
  • Hundreds of layers
  • Complex architectures (skip connections, attention, normalization)

Computing gradients by hand for gpt-5? That would be tens of thousands of lines of gradient code. With PyTorch:

loss.backward() # That's it. Gradients for every parameter.

The real power shows with complex computations:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Complex computation with multiple steps
y = x * 2 # y = [2, 4, 6]
z = y ** 2 # z = [4, 16, 36]
loss = z.mean() # loss = 56/3 ≈ 18.67
# One backward call - all gradients computed!
loss.backward()
print(x.grad) # tensor([2.6667, 5.3333, 8.0000])

Let’s verify: The chain rule says:

  • ∂loss/∂x = ∂loss/∂z · ∂z/∂y · ∂y/∂x
  • ∂loss/∂z = 1/3 (derivative of mean)
  • ∂z/∂y = 2y = [4, 8, 12]
  • ∂y/∂x = 2
  • Result: [4·2/3, 8·2/3, 12·2/3] = [8/3, 16/3, 24/3]

Did You Know? The Secret History of Automatic Differentiation

Section titled “Did You Know? The Secret History of Automatic Differentiation”

Automatic differentiation isn’t a deep learning invention. It was developed in the 1960s and 1970s for computational physics and engineering!

Reverse-mode automatic differentiation (what PyTorch uses) was published by Seppo Linnainmaa in 1970 for his master’s thesis. Backpropagation in neural networks was rediscovered independently in the 1980s - it’s the same algorithm applied to neural network computation graphs.

The key insight: forward-mode AD is efficient when you have few inputs and many outputs. Reverse-mode is efficient when you have many inputs and few outputs. Neural networks have millions of inputs (weights) and one output (loss) - reverse mode wins!

This trips up every PyTorch beginner:

x = torch.tensor([1.0], requires_grad=True)
y = x * 2
y.backward()
print(x.grad) # tensor([2.])
y = x * 3
y.backward()
print(x.grad) # tensor([5.]) - Not 3! It's 2 + 3!

By default, calling .backward() adds to existing gradients. This is actually useful for some advanced techniques (like gradient accumulation), but usually you want fresh gradients each time.

The fix: Zero gradients before each backward pass:

x.grad.zero_() # Zero out accumulated gradients
y = x * 4
y.backward()
print(x.grad) # tensor([4.]) - Fresh gradient

In training loops, you’ll see optimizer.zero_grad() - this zeros all parameter gradients.

Sometimes you want to use a computed value without tracking gradients:

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
# Detach: Creates a new tensor, no gradient tracking
z = y.detach()
print(z.requires_grad) # False
# Or use no_grad context
with torch.no_grad():
z = y * 2
print(z.requires_grad) # False

When to use this:

  • Computing metrics (accuracy, etc.) during training
  • Using a frozen pretrained model
  • Preventing gradients from flowing to certain parts of the network

Part 3: Building Neural Networks with nn.Module

Section titled “Part 3: Building Neural Networks with nn.Module”

Now we get to build actual neural networks! PyTorch provides torch.nn, a module specifically designed for deep learning.

Every neural network in PyTorch inherits from nn.Module. This base class provides:

  • Automatic parameter registration
  • Easy device movement (CPU ↔ GPU)
  • Training/evaluation mode switching
  • Model saving and loading

Here’s the pattern you’ll use hundreds of times:

import torch.nn as nn
class SimpleNetwork(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__() # Always call this first!
# Define layers (registered automatically!)
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
self.relu = nn.ReLU()
def forward(self, x):
# Define how data flows through the network
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
# Create the network
model = SimpleNetwork(784, 128, 10)
# Forward pass - just call the model!
x = torch.randn(32, 784) # Batch of 32 images
output = model(x) # Calls forward() automatically
print(output.shape) # torch.Size([32, 10])

Compare this to Module 26 where you manually created weight matrices, implemented forward propagation, and tracked everything yourself. The PyTorch version is almost self-documenting!

That super().__init__() call isn’t just Python formality. It initializes PyTorch’s internal machinery that:

  • Creates a registry for parameters
  • Enables recursive calls like .to(device) on all submodules
  • Sets up hooks for saving/loading

Forget it, and your model silently breaks in confusing ways. Every PyTorch tutorial includes it, and now you know why!

PyTorch provides layers for every architecture. Here are the ones you’ll use most:

Linear (Fully Connected) Layers

These are the basic building blocks - matrix multiplication plus bias:

# Linear: y = xW^T + b
layer = nn.Linear(in_features=784, out_features=128)
# What it creates internally:
# - weight: [128, 784] matrix
# - bias: [128] vector (optional, bias=True by default)

Activation Functions

Activations introduce non-linearity (without them, a deep network is just one linear transformation):

# As modules (use in __init__)
nn.ReLU() # max(0, x) - most common
nn.LeakyReLU(0.01) # Allows small negative values
nn.GELU() # Used in transformers (smoother than ReLU)
nn.Sigmoid() # Squashes to [0, 1]
nn.Tanh() # Squashes to [-1, 1]
# As functions (use in forward)
import torch.nn.functional as F
output = F.relu(x)
output = F.gelu(x)

Normalization Layers

These stabilize training by normalizing intermediate values:

nn.BatchNorm1d(num_features) # Normalize across batch
nn.LayerNorm(normalized_shape) # Normalize across features (transformers)

Dropout

Randomly zeroes elements during training to prevent overfitting:

nn.Dropout(p=0.5) # 50% of elements set to zero during training

For simple architectures, you don’t need a custom class:

model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 10)
)
# Works exactly like a custom nn.Module
output = model(input)

Use Sequential for prototypes and simple models. Use custom classes when you need complex control flow (if statements, loops, skip connections).

PyTorch makes it easy to see what’s inside:

model = SimpleNetwork(784, 128, 10)
# Print model structure
print(model)
# SimpleNetwork(
# (fc1): Linear(in_features=784, out_features=128, bias=True)
# (fc2): Linear(in_features=128, out_features=10, bias=True)
# (relu): ReLU()
# )
# List all parameters with names
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
# fc1.weight: torch.Size([128, 784])
# fc1.bias: torch.Size([128])
# fc2.weight: torch.Size([10, 128])
# fc2.bias: torch.Size([10])
# Count total parameters
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}") # 101,770

Now for the payoff - actually training a network!

A loss function measures how wrong your predictions are. Lower is better.

For Classification (choosing between categories):

# Cross-entropy loss - the workhorse of classification
criterion = nn.CrossEntropyLoss()
# It expects:
# - Input: Raw logits (NOT softmaxed!) Shape: [batch, num_classes]
# - Target: Class indices (NOT one-hot!) Shape: [batch]
logits = torch.randn(32, 10) # Raw network output
labels = torch.randint(0, 10, (32,)) # Class labels 0-9
loss = criterion(logits, labels)

Critical: CrossEntropyLoss applies softmax internally. Don’t softmax your outputs first - you’ll get wrong gradients and worse training!

For Regression (predicting numbers):

criterion = nn.MSELoss() # Mean squared error
criterion = nn.L1Loss() # Mean absolute error

Optimizers implement gradient descent algorithms. They take gradients and update parameters:

import torch.optim as optim
# SGD: Simple, but needs tuning
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Adam: Usually works well out of the box
optimizer = optim.Adam(model.parameters(), lr=0.001)
# AdamW: Adam with proper weight decay (recommended for transformers)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

Adam (2014) combined ideas from two earlier optimizers:

  • Momentum: Use exponentially weighted average of past gradients
  • RMSprop: Adapt learning rate per-parameter based on gradient history

The name “Adam” comes from “adaptive moment estimation”. Within a year of publication, it became the default optimizer for most deep learning - it just works in most situations without careful tuning.

But it’s not perfect! Researchers later found that Adam’s weight decay implementation was subtly wrong. AdamW (2017) fixed this, and it’s now preferred for large models.

Here’s the standard PyTorch training pattern you’ll use forever:

model = SimpleNetwork(784, 128, 10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
model.train() # Enable training mode (dropout, batchnorm behave differently)
for batch_idx, (data, labels) in enumerate(train_loader):
# 1. Zero gradients from previous batch
optimizer.zero_grad()
# 2. Forward pass
outputs = model(data)
# 3. Compute loss
loss = criterion(outputs, labels)
# 4. Backward pass (compute gradients)
loss.backward()
# 5. Update weights
optimizer.step()
if batch_idx % 100 == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

The five steps are always the same:

  1. Zero gradients
  2. Forward pass
  3. Compute loss
  4. Backward pass
  5. Update weights

This pattern works whether you’re training a 2-layer network on MNIST or a billion-parameter language model.

When evaluating (not training), you need to:

  1. Switch to evaluation mode (changes dropout/batchnorm behavior)
  2. Disable gradient computation (faster, uses less memory)
model.eval() # Evaluation mode
correct = 0
total = 0
with torch.no_grad(): # Don't compute gradients
for data, labels in test_loader:
outputs = model(data)
_, predicted = outputs.max(1) # Get predicted class
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"Accuracy: {accuracy:.2f}%")

GPUs can make training 10-100× faster. PyTorch makes GPU computing almost trivially easy.

# Check if CUDA (GPU) is available
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Count: {torch.cuda.device_count()}")
else:
print("No GPU available, using CPU")
# The standard pattern: create a device variable
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move tensors
x = torch.randn(1000, 1000)
x_gpu = x.to(device)
# Move models (moves all parameters)
model = SimpleNetwork(784, 128, 10).to(device)
# Create tensors directly on GPU
y = torch.randn(1000, 1000, device=device)

The only change from CPU training: move data to the GPU each batch.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNetwork(784, 128, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs):
for data, labels in train_loader:
# Move data to GPU
data = data.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

GPU memory is precious and limited. Common mistakes:

Memory Leak #1: Storing Tensors in Python Lists

# BAD - keeps entire computation graph!
losses = []
for batch in loader:
loss = criterion(model(batch), labels)
losses.append(loss) # Full tensor with gradient graph!
# GOOD - just keep the number
losses = []
for batch in loader:
loss = criterion(model(batch), labels)
losses.append(loss.item()) # Just the Python float

Memory Leak #2: Not Using no_grad() During Evaluation

# BAD - builds computation graph unnecessarily
accuracy = (model(data).argmax(1) == labels).float().mean()
# GOOD - no gradient tracking needed
with torch.no_grad():
accuracy = (model(data).argmax(1) == labels).float().mean()

Checking Memory:

print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

PyTorch’s DataLoader handles batching, shuffling, and parallel loading.

from torch.utils.data import DataLoader, TensorDataset
# Create a dataset from tensors
X = torch.randn(1000, 784)
y = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X, y)
# Create a data loader
loader = DataLoader(
dataset,
batch_size=32, # Samples per batch
shuffle=True, # Shuffle each epoch (for training)
num_workers=4, # Parallel data loading processes
pin_memory=True # Faster GPU transfer
)
# Iterate
for batch_x, batch_y in loader:
print(batch_x.shape, batch_y.shape) # [32, 784], [32]

PyTorch provides standard datasets through torchvision:

from torchvision import datasets, transforms
# Define preprocessing
transform = transforms.Compose([
transforms.ToTensor(), # PIL Image → Tensor
transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean/std
])
# Load MNIST
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, download=True, transform=transform)
# Create loaders
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

Always save your trained models!

# Save just the weights (recommended)
torch.save(model.state_dict(), 'model_weights.pth')
# Load
model = SimpleNetwork(784, 128, 10) # Create architecture first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval() # Set to evaluation mode

For long training runs, save checkpoints to resume later:

# Save checkpoint
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')
# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']

PyTorch model files use Python’s pickle format. Pickle can execute arbitrary code when loading!

# This could run malicious code!
model = torch.load('untrusted_model.pth') # Dangerous!

Only load models from sources you trust. For sharing models publicly, consider the new safetensors format that can’t execute code.


Part 8: PyTorch vs From-Scratch Comparison

Section titled “Part 8: PyTorch vs From-Scratch Comparison”

Let’s appreciate how far we’ve come. Here’s Module 26 versus PyTorch:

# Forward pass - tracking everything manually
def forward(self, X):
self.cache = {'A0': X}
A = X
for l in range(1, len(self.layer_dims)):
Z = self.params[f'W{l}'] @ A + self.params[f'b{l}']
A = self.activation_fn(Z)
self.cache[f'A{l}'] = A
return A
# Backward pass - chain rule by hand
def backward(self, Y):
m = Y.shape[1]
dA = -(Y / self.cache[f'A{L}'])
for l in reversed(range(1, L + 1)):
dZ = dA * self.activation_derivative(self.cache[f'A{l}'])
self.grads[f'dW{l}'] = (1/m) * dZ @ self.cache[f'A{l-1}'].T
self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
dA = self.params[f'W{l}'].T @ dZ
class Network(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
# Training - all the complexity hidden
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward() # All gradients computed!
optimizer.step() # All weights updated!

What PyTorch automates:

  • Gradient computation for any architecture
  • Cache management
  • Numerical stability
  • GPU support
  • Optimizers
  • Data loading
  • Model serialization

PyTorch 2.0 (2022) introduced something remarkable: torch.compile().

model = MyModel()
model = torch.compile(model) # That's it!

One line turns your dynamic, debuggable PyTorch model into an optimized, compiled version that runs 30-200% faster. The dynamic graph philosophy remains - you can still use Python control flow, print statements, debuggers - but get static-graph performance.

This represents PyTorch’s philosophy: make the right thing easy. Build your model the simple way, debug it, make sure it works. Then compile for production.


PyTorch doesn’t cost money to use, but the ecosystem has significant economic implications:

Development Time Comparison:

TaskManual NumPyPyTorchSavings
Simple MLP1 day1 hour87%
CNN for images3 days4 hours83%
LSTM/Transformer1 week1 day86%
Training loop2 hours15 min88%
GPU support1-2 days5 min99%

At $150/hour senior engineer rate:

  • Manual implementation: $2,400 for CNN
  • PyTorch implementation: $600 for CNN
  • Savings: $1,800 per model

Training neural networks requires GPUs. The economics are stark:

GPUPurchase CostCloud Cost (AWS)MemorySpeed
RTX 3090$1,500-24GB1x
A100 40GB$15,000$3.06/hr40GB3x
A100 80GB$25,000$4.10/hr80GB3.5x
H100$30,000+$5.50/hr80GB5x

The crossover point: At ~250 hours of usage, buying a 3090 beats renting cloud GPUs.

FrameworkGitHub StarsPyPI Downloads/MonthJob Postings
PyTorch85,000+25M+65%
TensorFlow180,000+15M+30%
JAX30,000+3M+5%

The trend: PyTorch dominates research (75%+ of papers) and is rapidly gaining in production. TensorFlow is still strong in production deployments but declining.

Career impact data (from industry surveys):

  • Average salary premium for PyTorch skills: +$15,000/year
  • Time to become productive: 2-4 weeks
  • ROI: 375% in first year (assuming $15k premium / 4 weeks investment)

Q1: “What is automatic differentiation and how does PyTorch implement it?”

Strong Answer: “Automatic differentiation computes gradients by recording operations on tensors and building a computational graph. PyTorch uses reverse-mode autodiff—when you call .backward() on a loss, it traverses the graph backwards applying the chain rule at each node. This is more efficient than numerical differentiation (which requires many forward passes) and less error-prone than symbolic differentiation. In PyTorch, tensors with requires_grad=True track their operations. Each operation creates a grad_fn that knows how to compute its gradient. The graph is dynamic—rebuilt each forward pass—which enables Python control flow like if statements and loops.”

Q2: “Explain the difference between .detach(), .data, and torch.no_grad().”

Strong Answer: “.detach() creates a new tensor that shares storage but doesn’t track gradients—it’s a safe way to stop gradient flow. torch.no_grad() is a context manager that temporarily disables gradient computation for all operations—used during inference for speed and memory savings. .data is legacy and dangerous—it accesses the underlying tensor but can cause silent gradient errors. Modern code should use .detach() for new tensors and torch.no_grad() for inference blocks. In evaluation, always use model.eval() with torch.no_grad().”

Q3: “Why do we call optimizer.zero_grad() before backward()?”

Strong Answer: “PyTorch accumulates gradients by default—calling backward() adds to existing .grad values rather than replacing them. This is useful for gradient accumulation when you want to simulate larger batches than fit in memory. But usually, you want fresh gradients each step, so you zero them first. The typical training loop is: zero_grad → forward → loss → backward → step. Forgetting zero_grad leads to exploding gradients and incorrect updates. Some teams use model.zero_grad() instead, but optimizer.zero_grad() is preferred when using multiple optimizers or gradient accumulation.”

Q4: “How would you debug a neural network that’s not converging?”

Strong Answer: “Systematic debugging approach: First, check the data—visualize inputs, verify labels are correct, ensure proper normalization. Second, check the loss—is it NaN or constant? NaN means gradient explosion (reduce learning rate, add gradient clipping). Constant means gradients aren’t flowing (check activation functions, initialization). Third, overfit on one batch—if you can’t memorize a single batch, the model architecture or training code is broken. Fourth, check gradient flow—print gradient norms per layer. Vanishing gradients suggest ReLU dying or bad initialization. Fifth, try a known-good hyperparameter set before experimenting. The debugging motto: start simple, verify each component, add complexity gradually.”

Q5: “What’s the difference between nn.Module attributes and regular Python attributes?”

Strong Answer: “PyTorch’s nn.Module performs automatic registration. If you assign an nn.Module as an attribute (self.layer = nn.Linear()), it’s registered as a submodule—it appears in .parameters(), moves with .to(device), and saves with state_dict(). Regular Python attributes don’t get this treatment. There’s also nn.Parameter for custom trainable tensors and nn.Buffer for non-trainable state (like batch norm running averages). A common bug: storing layers in a Python list instead of nn.ModuleList—the layers won’t be registered and won’t train. Always use nn.ModuleList or nn.ModuleDict for dynamic layer collections.”

Q: “Design a PyTorch training pipeline for a large dataset that doesn’t fit in memory.”

Strong Answer Structure:

  1. DataLoader with num_workers: “Use multiple worker processes to load and preprocess data in parallel. Set num_workers=4-8 typically. Enable pin_memory=True for faster GPU transfer.”

  2. Memory-mapped datasets: “For huge files, use memory-mapped arrays (np.memmap) or streaming formats (WebDataset, TFDS). Load samples lazily on access.”

  3. Gradient accumulation: “For effective batch sizes larger than GPU memory allows, accumulate gradients over N steps before calling optimizer.step().”

  4. Mixed precision training: “Use torch.cuda.amp.autocast() for automatic fp16 where safe. Halves memory usage, doubles throughput on modern GPUs.”

  5. Checkpointing: “Save regularly. For very long runs, use torch.utils.checkpoint to trade compute for memory—recompute activations during backward.”

  6. Distributed training: “For multiple GPUs, use DistributedDataParallel (DDP), not DataParallel. DDP is faster and scales better.”


PyTorch started as a research framework but has matured for production:

Timeline of Production Features:

  • 2019: TorchScript for model export
  • 2020: TorchServe for serving models
  • 2021: Mobile support (iOS, Android)
  • 2022: torch.compile() for performance
  • 2023: ExecuTorch for edge devices

Who Uses PyTorch in Production?:

  • Tesla: Self-driving neural networks
  • Meta: Instagram recommendations, content moderation
  • Microsoft: Bing search ranking, Azure AI services
  • OpenAI: GPT models (pre-training and fine-tuning)
  • Stability AI: Stable Diffusion

Models trained in PyTorch can run anywhere via ONNX (Open Neural Network Exchange):

# Export to ONNX
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx")
# Run in ONNX Runtime (optimized for production)
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
output = session.run(None, {"input": numpy_input})

ONNX models can run on:

  • C++ applications (no Python dependency)
  • Mobile devices (iOS, Android)
  • Web browsers (ONNX.js)
  • Hardware accelerators (custom chips)

Did You Know? The Million-Dollar Gradient Explosion

Section titled “Did You Know? The Million-Dollar Gradient Explosion”

San Francisco. November 2021. 3:47 AM.

The Slack message woke up the entire ML team at a fintech startup. Their PyTorch model—which had been running perfectly for six months—was suddenly producing garbage predictions. Customer trades were being rejected. Losses were mounting.

The senior engineer’s first thought was a data pipeline bug. But the data looked fine. The model architecture hadn’t changed. The weights… wait. The weights were all NaN.

After four frantic hours, they found it: someone had “optimized” the training script by removing optimizer.zero_grad(). In production, they were running periodic retraining, and without zeroing gradients, they accumulated over 10,000 backward passes. The gradients exploded to infinity, then became NaN, and those NaNs propagated to the entire model.

The fix took one line. The outage cost $1.2M in lost trades and customer compensation.

“The most expensive bugs are the ones in code that seems too simple to be wrong.” — Their post-mortem document

The lesson: PyTorch’s gradient accumulation is a feature, not a bug. But forgetting that feature in production can be catastrophic. Always include optimizer.zero_grad() in your training loops, and add assertions that catch NaN values before they propagate.


Mistake #1: Forgetting to Call model.eval()

Section titled “Mistake #1: Forgetting to Call model.eval()”
# WRONG - dropout and batchnorm are still in training mode!
model.load_state_dict(torch.load('model.pth'))
predictions = model(test_data) # Results will be wrong!
# CORRECT - always switch to eval mode for inference
model.load_state_dict(torch.load('model.pth'))
model.eval() # Critical!
with torch.no_grad():
predictions = model(test_data)

Why it matters: Dropout randomly zeroes 50% of neurons during training. If you forget .eval(), you’re making predictions with half your model disabled. BatchNorm uses running statistics differently between modes.

Mistake #2: In-Place Operations Breaking Autograd

Section titled “Mistake #2: In-Place Operations Breaking Autograd”
# WRONG - in-place operations can break gradient computation
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x.relu_() # In-place operation (notice the underscore)
z = y.sum()
z.backward() # RuntimeError: gradient computation requires non-inplace operations
# CORRECT - use out-of-place operations
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x.relu() # Out-of-place (returns new tensor)
z = y.sum()
z.backward() # Works!
print(x.grad) # tensor([1., 1.])

The rule: Operations ending with _ modify tensors in-place and can break gradient tracking. Avoid them on tensors that need gradients.

# WRONG - MSELoss for classification
criterion = nn.MSELoss()
loss = criterion(outputs, labels.float()) # Numerically unstable!
# CORRECT - CrossEntropyLoss for classification
criterion = nn.CrossEntropyLoss()
loss = criterion(outputs, labels) # Proper log-softmax handling

Mistake #4: Sending Model and Data to Different Devices

Section titled “Mistake #4: Sending Model and Data to Different Devices”
# WRONG - model on GPU, data on CPU
model = Model().cuda()
data = torch.randn(32, 784) # On CPU by default!
output = model(data) # RuntimeError: Input and parameter tensors are not on the same device
# CORRECT - ensure everything is on the same device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Model().to(device)
data = torch.randn(32, 784).to(device)
output = model(data) # Works!

Mistake #5: Using Python Lists Instead of ModuleList

Section titled “Mistake #5: Using Python Lists Instead of ModuleList”
# WRONG - layers won't be registered as parameters!
class BadModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = [nn.Linear(10, 10) for _ in range(5)] # Python list
# Check registered parameters:
model = BadModel()
print(list(model.parameters())) # Empty! Layers aren't registered!
# CORRECT - use nn.ModuleList
class GoodModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(5)])
model = GoodModel()
print(len(list(model.parameters()))) # 10 (5 weights + 5 biases)

Think of PyTorch like an orchestra conductor:

Without PyTorch (Manual Backprop): You’re not just conducting—you’re simultaneously playing every instrument. You have to track every note (forward pass), compute how each instrument should adjust (gradients), and remember the exact moment each note was played (caches). Exhausting and error-prone.

With PyTorch: You’re a conductor with a magical sheet music. You just wave your baton (call loss.backward()), and every musician instantly knows exactly how to adjust. The sheet music (computational graph) records everything automatically. You focus on the music (model architecture), not the mechanics.

With torch.compile(): Now you have an AI assistant analyzing your conducting patterns and pre-positioning the musicians for optimal performance. Same music, 30-200% faster.

This is why PyTorch transformed deep learning research: researchers could finally focus on the science instead of the plumbing.


Build intuition for autograd by experimenting with different computational graphs:

# Create tensors and compute gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Try different operations and predict gradients before running:
# 1. y = x.sum() - what's x.grad?
# 2. y = (x ** 2).sum() - what's x.grad?
# 3. y = x.mean() - what's x.grad?
# 4. y = x.max() - what's x.grad? (hint: sparse!)
# Verify your predictions with backward() and print x.grad

Challenge: Implement a custom function using autograd.Function that computes both forward and backward passes.

Train a complete neural network on MNIST:

# Requirements:
# - 2-3 hidden layers
# - Dropout for regularization
# - Adam optimizer
# - CrossEntropyLoss
# - Training and validation loop
# - Achieve >98% accuracy
# Starter code:
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
# ... complete the implementation

Success Criteria: >98% test accuracy in under 10 epochs.

Compare CPU vs GPU performance:

import time
def benchmark(device, size=4096, iterations=100):
x = torch.randn(size, size, device=device)
y = torch.randn(size, size, device=device)
# Warmup
for _ in range(10):
z = x @ y
if device.type == 'cuda':
torch.cuda.synchronize()
start = time.time()
for _ in range(iterations):
z = x @ y
if device.type == 'cuda':
torch.cuda.synchronize()
elapsed = time.time() - start
return elapsed / iterations
# Compare and create a plot of speedup vs matrix size

Expected Result: 10-50x speedup for large matrices on GPU.

Fix the bugs in this broken training loop:

# This code has 5 bugs. Find and fix them all!
class BuggyModel(nn.Module):
def __init__(self):
# Bug 1: Missing something here
self.layers = [nn.Linear(784, 128), nn.Linear(128, 10)]
def forward(self, x):
for layer in self.layers:
x = layer(x)
return x
model = BuggyModel()
criterion = nn.MSELoss() # Bug 2: Wrong loss for classification
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(10):
for data, labels in train_loader:
# Bug 3: Missing something before forward pass
outputs = model(data)
loss = criterion(outputs, labels) # Bug 4: labels need processing
loss.backward()
optimizer.step()
# Bug 5: Evaluation without proper mode switching
accuracy = (model(test_data).argmax(1) == test_labels).float().mean()

Deliverable: Fixed code that trains to >95% accuracy.


  1. PyTorch won the framework wars by prioritizing developer experience. Dynamic graphs and Pythonic design made research iteration 10x faster than static-graph alternatives.

  2. Tensors are the universal container for all data in deep learning. Images, text, audio—everything becomes tensors of floats.

  3. Autograd is magic that you understand. Having built backprop manually, you know what happens when you call loss.backward().

  4. nn.Module is the foundation for all PyTorch models. Always call super().init(), and use ModuleList/ModuleDict for dynamic layers.

  5. The training loop is always the same: zero_grad → forward → loss → backward → step. This works for any model, any scale.

  6. GPU computing is trivially easy: .to(device) moves anything. But watch for memory leaks—use .item() for scalars, no_grad() for inference.

  7. DataLoader handles the plumbing: Batching, shuffling, parallel loading. Set num_workers and pin_memory for maximum throughput.

  8. Save checkpoints religiously. Training failures happen. Don’t lose hours of GPU time to a crash.

  9. torch.compile() is the future. One line for 30-200% speedup, with no code changes.

  10. PyTorch doesn’t replace understanding—it amplifies it. You know what happens inside loss.backward(). That knowledge makes you dangerous.

Having built neural networks from scratch in Module 26, you now deeply appreciate what PyTorch gives you:

# Module 26: Dozens of lines of careful gradient computation
# PyTorch:
loss.backward()

PyTorch doesn’t replace understanding - it amplifies it. You know what happens inside that one line. You can debug it when things go wrong. You can extend it when needed.


In Module 28, you’ll learn Training Deep Networks:

  • Why deep networks are hard to train
  • Batch normalization
  • Dropout and regularization
  • Weight initialization strategies
  • Learning rate scheduling
  • Debugging training failures

The foundation is set. Now let’s learn the art of making networks actually converge!


While PyTorch dominates, there’s a rising challenger: JAX, developed at Google.

JAX started as “NumPy on steroids” but has become a serious deep learning framework. Its philosophy is different: instead of dynamic graphs (PyTorch) or static graphs (TensorFlow), JAX uses functional transformations.

# JAX: Transform functions, not tensors
import jax
import jax.numpy as jnp
def loss_fn(params, x, y):
predictions = predict(params, x)
return jnp.mean((predictions - y) ** 2)
# Get gradient function by transforming loss_fn
grad_fn = jax.grad(loss_fn)
gradients = grad_fn(params, x, y)

Who uses JAX?:

  • Google DeepMind: AlphaFold, Gemini
  • OpenAI: Some internal experiments
  • Research teams: When they need maximum performance on TPUs

The PyTorch vs JAX trade-off:

AspectPyTorchJAX
DebuggingPython debugger worksHarder (functional transforms)
EcosystemMassive (HuggingFace, etc.)Growing
TPU supportExists but limitedExcellent (Google’s TPUs)
GPU supportExcellentGood
Learning curveModerateSteep
Production toolingTorchServe, ONNXLess mature

Bottom line: PyTorch remains the default choice for 90%+ of practitioners. JAX is worth exploring if you need extreme performance, work with TPUs, or do cutting-edge research in areas like neural ODEs or differentiable physics. However, the PyTorch ecosystem’s maturity, especially HuggingFace integration and extensive tooling support, makes it the safer choice for most production applications. Unless you have a specific reason to choose JAX (like TPU-first deployment or cutting-edge functional programming research), start with PyTorch.


Books:

  • Deep Learning with PyTorch (Eli Stevens, Luca Antiga) - Official PyTorch book, free online
  • Programming PyTorch for Deep Learning (Ian Pointer) - O’Reilly practical guide
  • PyTorch Pocket Reference (Joe Papa) - Quick reference for common patterns

Video Courses:

  • Andrej Karpathy’s “Neural Networks: Zero to Hero” - Free YouTube series, builds intuition
  • fast.ai - Practical deep learning, uses PyTorch, emphasizes getting things working
  • NYU Deep Learning (Yann LeCun) - Graduate-level theory, available on YouTube

Interactive:

  • PyTorch Lightning - Framework that reduces boilerplate
  • Weights & Biases - Experiment tracking, integrates seamlessly
  • Hugging Face Transformers - Pre-trained models, all PyTorch-native

Forums and Communities:

  • PyTorch Forums (discuss.pytorch.org) - Official, active, helpful
  • r/pytorch - Reddit community
  • Stack Overflow [pytorch] - 50,000+ questions answered
  • PyTorch Discord - Real-time help

When Debugging:

  1. Check PyTorch version compatibility
  2. Search the error message verbatim
  3. Minimal reproducible example helps others help you
  4. The forums are friendlier than Stack Overflow for beginners

PyTorch is open source with over 3,000 contributors. If you find a bug or want to add a feature:

  1. File an issue on GitHub first
  2. Small PRs are more likely to be merged
  3. Documentation improvements are always welcome
  4. The contributing guide is thorough

  1. Official PyTorch Tutorials: https://pytorch.org/tutorials/
  2. Deep Learning with PyTorch (free book): https://pytorch.org/deep-learning-with-pytorch
  3. Andrej Karpathy’s micrograd: https://github.com/karpathy/micrograd - A tiny autograd engine for educational purposes
  4. PyTorch Internals: http://blog.ezyang.com/2019/05/pytorch-internals/ - How the magic works
  5. The Annotated Transformers: http://nlp.seas.harvard.edu/annotated-transformer - Transformer implementation in PyTorch with explanations

Last updated: 2025-12-11 Status: Complete