Skip to content

CNNs & Computer Vision

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: The Art of Making Neural Networks Actually Work

Section titled “Or: The Art of Making Neural Networks Actually Work”

Reading Time: 6-8 hours Prerequisites: Module 27


By the end of this module, you will:

  • Understand why deep networks are notoriously difficult to train (and the historical struggles)
  • Master batch normalization and layer normalization (the techniques that made deep learning possible)
  • Apply dropout correctly (and understand the common mistakes)
  • Choose the right weight initialization for your architecture
  • Implement learning rate schedules that converge faster
  • Use gradient clipping to prevent exploding gradients
  • Implement early stopping and checkpointing like a production ML engineer

The Dark Ages of Deep Learning: Why Training Was Nearly Impossible

Section titled “The Dark Ages of Deep Learning: Why Training Was Nearly Impossible”

Before we dive into the techniques, you need to understand something important: for decades, training networks deeper than 2-3 layers was essentially impossible. Not “difficult” — impossible.

Imagine you’re trying to pass a message through a chain of 100 people playing telephone. By the time the message reaches the last person, it’s completely garbled. That’s what happened to gradients in deep networks — they either exploded into infinity or vanished into nothing.

Did You Know? In 2006, Geoffrey Hinton published a paper called “A Fast Learning Algorithm for Deep Belief Nets” that’s often credited with kickstarting the deep learning revolution. But here’s the thing: his networks were only 3-4 layers deep. Even that was considered “deep” at the time! Today, networks like gpt-5 have hundreds of layers. The techniques in this module are what made that possible.

The Two Nightmares: Vanishing and Exploding Gradients

Section titled “The Two Nightmares: Vanishing and Exploding Gradients”

When you train a neural network, you’re computing gradients through backpropagation. Each layer multiplies the gradient by its weights. Here’s the problem:

Vanishing Gradients: If your weights are small (say, 0.5), multiplying many times gives you:

0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 = 0.001

After just 10 layers, your gradient is 1/1000th of what it started. After 50 layers? Essentially zero. The early layers never learn anything.

Exploding Gradients: If your weights are large (say, 2.0):

2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 1024

After 10 layers, your gradient is a thousand times larger. After 50 layers? Your computer gives up and returns NaN (not a number).

Did You Know? The exploding gradient problem was so common in the early days that researchers would joke about “NaN debugging” — spending hours figuring out why their loss suddenly became infinity. One famous story: a PhD student at Stanford spent three months debugging a model only to find that a single wrong activation function was causing gradients to explode after 15 iterations.

The Historical Solutions (That Didn’t Quite Work)

Section titled “The Historical Solutions (That Didn’t Quite Work)”

Before the modern techniques we’ll learn, researchers tried several approaches:

  1. Shallow Networks: Just… don’t go deep. Use 2-3 layers max.
  2. Careful Initialization: Initialize weights to very specific values (we’ll see this still matters)
  3. Layer-by-Layer Pre-training: Train one layer at a time, then fine-tune (tedious!)
  4. Gradient Checking: Manually verify gradients (slow and painful)

None of these scaled to the architectures we use today. What changed everything? The techniques in this module.


Batch Normalization: The Single Most Important Technique in Modern Deep Learning

Section titled “Batch Normalization: The Single Most Important Technique in Modern Deep Learning”

If you learn only one thing from this module, make it batch normalization. Seriously.

In 2015, Sergey Ioffe and Christian Szegedy at Google published a paper that would change deep learning forever: “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.”

Did You Know? The BatchNorm paper has been cited over 60,000 times, making it one of the most influential papers in machine learning history. For context, Einstein’s special relativity paper has about 3,000 citations. BatchNorm literally changed more papers than Einstein’s most famous work!

But here’s the funny thing: the paper’s explanation of why BatchNorm works is probably wrong.

The paper claimed BatchNorm works by reducing “internal covariate shift” — the idea that each layer’s inputs are constantly changing during training. Sounds reasonable, right?

In 2018, researchers at MIT published a paper called “How Does Batch Normalization Help Optimization?” They showed that BatchNorm doesn’t actually reduce internal covariate shift much at all. Instead, it smooths the loss landscape, making optimization easier.

This is a beautiful example of science: you can discover something that works incredibly well without fully understanding why. The theory caught up later.

Imagine you’re training a model and layer 5 outputs values ranging from -1000 to +1000. Layer 6 has to deal with these huge values. Then, during training, suddenly layer 5 starts outputting values from -0.001 to +0.001. Layer 6 is completely confused!

BatchNorm says: “Let’s force each layer’s outputs to be nice and normalized — mean 0, standard deviation 1.”

Here’s the beautiful part: it does this within each mini-batch during training, which is why it’s called batch normalization.

# The idea behind BatchNorm (simplified)
def batch_norm_simplified(x, gamma, beta, eps=1e-5):
"""
x: input tensor of shape (batch_size, features)
gamma: learnable scale parameter
beta: learnable shift parameter
"""
# Calculate statistics across the batch
mean = x.mean(dim=0) # Mean of each feature
var = x.var(dim=0) # Variance of each feature
# Normalize
x_norm = (x - mean) / torch.sqrt(var + eps)
# Scale and shift (learnable!)
return gamma * x_norm + beta

The eps (epsilon) is there to prevent division by zero. It’s typically 1e-5.

You might wonder: “If we normalize everything to mean 0 and std 1, aren’t we removing information?”

Great question! That’s why BatchNorm includes two learnable parameters:

  • gamma (γ): scales the normalized values
  • beta (β): shifts them

This means the network can learn to undo the normalization if that’s helpful. In practice, it usually doesn’t fully undo it, but having the option prevents BatchNorm from limiting the network’s expressiveness.

PyTorch makes this easy:

import torch
import torch.nn as nn
class NetworkWithBatchNorm(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(784, 256),
nn.BatchNorm1d(256), # BatchNorm for 1D data (fully connected)
nn.ReLU(),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
return self.layers(x)

For convolutional networks, use BatchNorm2d:

class CNNWithBatchNorm(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64), # BatchNorm for 2D data (images)
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
)

Here’s something that trips up almost every beginner: BatchNorm behaves differently during training and inference.

During training, BatchNorm uses the statistics (mean, variance) from the current mini-batch.

During inference, you usually process one sample at a time. You can’t compute meaningful statistics from a single sample! So BatchNorm uses running averages of the statistics it saw during training.

This is why you must call model.train() before training and model.eval() before inference:

# Training
model.train() # BatchNorm uses batch statistics
for batch in train_loader:
outputs = model(batch)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# Inference
model.eval() # BatchNorm uses running statistics
with torch.no_grad():
predictions = model(test_data)

Did You Know? Forgetting to call model.eval() before inference is one of the most common bugs in deep learning. It can cause your model to give wildly different predictions depending on batch size, leading to hours of confused debugging. One famous case: a self-driving car company shipped a model that worked great with batch size 32 but gave garbage predictions with batch size 1. The fix? Adding one line: model.eval().

There’s an important relationship between batch size and BatchNorm effectiveness:

  • Large batches (32+): BatchNorm works great
  • Small batches (8-16): Statistics become noisy, performance degrades
  • Very small batches (1-4): BatchNorm can actually hurt performance!

Why? With a batch size of 2, your “mean” is just the average of 2 numbers. That’s not a meaningful statistic.

This limitation led to alternatives like Layer Normalization, which we’ll cover next.


Layer Normalization: When Batches Don’t Make Sense

Section titled “Layer Normalization: When Batches Don’t Make Sense”

In 2016, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton introduced Layer Normalization. It normalizes across features within a single sample rather than across the batch.

Why This Module Matters Layer Norm solves it by not using batches at all!

Section titled “Why This Module Matters Layer Norm solves it by not using batches at all!”

Instead of computing statistics across different samples, Layer Norm computes statistics across different features within the same sample:

def layer_norm_simplified(x, gamma, beta, eps=1e-5):
"""
x: input tensor of shape (batch_size, features)
Unlike BatchNorm, we normalize across features, not batch
"""
# Calculate statistics across features (for each sample independently)
mean = x.mean(dim=-1, keepdim=True) # Mean across features
var = x.var(dim=-1, keepdim=True) # Variance across features
# Normalize
x_norm = (x - mean) / torch.sqrt(var + eps)
# Scale and shift
return gamma * x_norm + beta

Layer Normalization is the standard choice for:

  1. Transformers/Attention Models: The GPT family, BERT, and virtually all modern NLP models use Layer Norm
  2. Recurrent Networks (RNNs/LSTMs): Where batch statistics don’t make sense over time
  3. Small Batch Training: When you can’t fit large batches in memory
  4. Online Learning: When you process one sample at a time

Did You Know? Every single layer of gpt-5 uses Layer Normalization. When you chat with ChatGPT, your text passes through hundreds of Layer Norm operations. The original GPT paper actually tried BatchNorm first but found Layer Norm worked much better for language modeling.

import torch.nn as nn
# For a fully connected layer with 256 features
layer_norm = nn.LayerNorm(256)
# In a Transformer-style block
class TransformerBlock(nn.Module):
def __init__(self, d_model=512):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, num_heads=8)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.ReLU(),
nn.Linear(d_model * 4, d_model)
)
def forward(self, x):
# Pre-norm architecture (modern standard)
x = x + self.attention(self.norm1(x), self.norm1(x), self.norm1(x))[0]
x = x + self.ffn(self.norm2(x))
return x
SituationBest ChoiceWhy
CNNs for imagesBatchNormLarge batches, spatial structure
TransformersLayerNormVariable sequence lengths, small batches
RNNs/LSTMsLayerNormRecurrent structure breaks batch assumptions
Small batches (<8)LayerNormBatch statistics too noisy
Large batches (>32)EitherBoth work well
Single-sample inferenceLayerNormNo batch to compute statistics

Dropout: Randomly Breaking Your Network (On Purpose)

Section titled “Dropout: Randomly Breaking Your Network (On Purpose)”

Dropout is one of those ideas that sounds completely crazy until you realize it works brilliantly.

In 2012, Geoffrey Hinton (yes, him again), along with Nitish Srivastava and others, proposed dropout in “Improving Neural Networks by Preventing Co-adaptation of Feature Detectors.”

The idea: during training, randomly set some neurons to zero.

That’s it. Just… turn things off randomly.

Did You Know? Geoffrey Hinton has said that dropout was inspired by how genes work in biological evolution. Sexual reproduction means each child gets a random combination of genes. This prevents individual genes from becoming too specialized or “co-adapted.” Dropout creates a similar effect in neural networks.

Think of a team where one person does all the work. If that person gets sick, the team fails. But if everyone shares responsibility, losing any one person is survivable.

Dropout forces every neuron to be useful on its own, without relying too heavily on other specific neurons. This creates redundancy and generalization.

Here’s another way to think about it: dropout creates an implicit ensemble of networks. With N neurons that can each be on or off, you’re effectively training 2^N different network configurations!

import torch.nn as nn
class NetworkWithDropout(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.5), # 50% of neurons zeroed
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, 10) # No dropout before output!
)
def forward(self, x):
return self.layers(x)

There’s a subtle but important detail: during training, we zero out half the neurons. But during inference, all neurons are active. Doesn’t that change the expected output?

Yes! That’s why dropout scales the remaining activations during training. If dropout rate is 0.5, the remaining neurons are multiplied by 2. This way, the expected sum stays the same.

PyTorch handles this automatically, but it’s good to know what’s happening under the hood.

  1. Using dropout during inference: Always call model.eval() to disable dropout during testing
  2. Dropout after the final layer: Don’t zero out your predictions!
  3. Too much dropout: Values above 0.5 can make training very slow
  4. Dropout with BatchNorm: The combination can be tricky; some argue you should use one or the other

Did You Know? Dropout was so successful that it won the “Test of Time” award at NeurIPS 2022, ten years after its publication. The award committee noted that dropout “has been incorporated into virtually every modern deep learning system.”

While dropout is still widely used, some alternatives have emerged:

  1. DropPath/Stochastic Depth: Drop entire layers instead of neurons (used in ResNets)
  2. DropBlock: For CNNs, drop contiguous regions instead of random pixels
  3. Attention Dropout: Specialized for Transformer attention layers
  4. Dropout-Free Training: Some modern architectures don’t need dropout at all!
# DropPath (Stochastic Depth) example
class DropPath(nn.Module):
def __init__(self, drop_prob=0.1):
super().__init__()
self.drop_prob = drop_prob
def forward(self, x):
if not self.training or self.drop_prob == 0:
return x
keep_prob = 1 - self.drop_prob
# Create random tensor for the batch
shape = (x.shape[0],) + (1,) * (x.ndim - 1)
random_tensor = keep_prob + torch.rand(shape, device=x.device)
random_tensor = random_tensor.floor() # Binarize
return x / keep_prob * random_tensor

Weight Initialization: Where You Start Matters More Than You Think

Section titled “Weight Initialization: Where You Start Matters More Than You Think”

You might think that since neural networks learn their weights, initialization doesn’t matter much. You’d be very wrong.

In the early days of neural networks, people would initialize weights randomly from a uniform distribution like [-1, 1] or a normal distribution with mean 0 and standard deviation 1.

This worked terribly.

The problem? Gradients would either vanish or explode right from the start, before the network could learn anything useful.

Xavier Initialization: The First Good Answer

Section titled “Xavier Initialization: The First Good Answer”

In 2010, Xavier Glorot and Yoshua Bengio published “Understanding the difficulty of training deep feedforward neural networks.” They showed mathematically that weights should be initialized based on the number of input and output connections.

The Xavier formula:

weights ~ Uniform(-sqrt(6/(n_in + n_out)), sqrt(6/(n_in + n_out)))

Or the normal distribution version:

weights ~ Normal(0, sqrt(2/(n_in + n_out)))

Where n_in is the number of input features and n_out is the number of output features.

Did You Know? Xavier Glorot was a PhD student when he published this paper. His advisor, Yoshua Bengio, is now one of the “Godfathers of Deep Learning” and won the Turing Award in 2018. The Xavier initialization is sometimes called “Glorot initialization” after its inventor.

There was one problem with Xavier initialization: it assumed symmetric activations (like tanh or sigmoid). But ReLU, which zeros out negative values, is not symmetric.

In 2015, Kaiming He (then at Microsoft Research) proposed an adjustment specifically for ReLU:

weights ~ Normal(0, sqrt(2/n_in))

Notice the factor is 2/n_in instead of 2/(n_in + n_out). This accounts for ReLU’s asymmetry.

Did You Know? Kaiming He went on to invent ResNets, one of the most influential architectures in deep learning history. He’s won multiple best paper awards and is considered one of the most important researchers in computer vision. His initialization formula, like his networks, is elegantly simple.

PyTorch does reasonable initialization by default, but you can be explicit:

import torch.nn as nn
import torch.nn.init as init
def init_weights_xavier(m):
"""Xavier initialization for Linear and Conv layers"""
if isinstance(m, (nn.Linear, nn.Conv2d)):
init.xavier_uniform_(m.weight)
if m.bias is not None:
init.zeros_(m.bias)
def init_weights_he(m):
"""He (Kaiming) initialization for ReLU networks"""
if isinstance(m, (nn.Linear, nn.Conv2d)):
init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
if m.bias is not None:
init.zeros_(m.bias)
# Apply to model
model = MyNetwork()
model.apply(init_weights_he) # Applies to all layers recursively
Activation FunctionRecommended Initialization
ReLU, Leaky ReLUHe (Kaiming)
tanh, sigmoidXavier (Glorot)
SELULeCun (similar to Xavier)
GELUHe often works well
Linear (no activation)Xavier

Modern Transformers often use special initialization schemes:

# GPT-style initialization
def gpt_init(module):
if isinstance(module, nn.Linear):
# Scale by 1/sqrt(2 * num_layers) for residual connections
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Did You Know? The specific value 0.02 for the standard deviation in GPT comes from experimentation at OpenAI. When training the original GPT, they found this value worked well empirically. It’s now become a standard in Transformer training, even though there’s no deep theoretical justification for this exact number.


Learning Rate Scheduling: The Art of Knowing When to Slow Down

Section titled “Learning Rate Scheduling: The Art of Knowing When to Slow Down”

Learning rate is arguably the most important hyperparameter in deep learning. Too high, and your training explodes. Too low, and you never converge. But here’s the thing: the optimal learning rate changes during training.

Imagine you’re searching for the lowest point in a valley:

  • At the start, you’re far away — take big steps to make progress
  • As you get closer, take smaller steps to avoid overshooting
  • Near the minimum, take tiny steps for fine-tuning

This is exactly what learning rate schedules do.

Step decay is the simplest and oldest learning rate schedule. The idea is straightforward: train at a high learning rate until progress plateaus, then drop the rate and continue. It’s like shifting gears in a car — you start in a high gear for speed, then shift down for precision.

Why does dropping the LR help? Early in training, you want big steps to escape bad regions quickly. But as you approach the minimum, those same big steps cause you to bounce around instead of settling in. Dropping the learning rate is like switching from running to walking when you get close to your destination.

Did You Know? The “divide by 10 at epochs 30, 60, 90” schedule was used to train the original ResNet paper by Kaiming He and colleagues in 2015. It became so standard that it’s still the default in many image classification codebases today — even though smoother schedules often work better. Sometimes the “good enough” solution from a famous paper becomes the industry default.

Implementation (PyTorch/TensorFlow/conceptually the same):

# PyTorch
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# TensorFlow
tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.001, decay_steps=30*steps_per_epoch, decay_rate=0.1
)
# The math (framework-agnostic):
# new_lr = initial_lr * (gamma ^ floor(epoch / step_size))
# At epoch 30: 0.001 * 0.1 = 0.0001
# At epoch 60: 0.001 * 0.01 = 0.00001

When to use step decay:

  • Simple baseline that usually works
  • When you don’t want to tune fancy schedules
  • Legacy codebases that expect this pattern

When to avoid:

  • The sudden drops can destabilize training
  • Cosine annealing usually works as well or better with less tuning

While step decay makes sudden jumps, cosine annealing provides a smooth, continuous decrease. Think of it like a car slowing down gradually as it approaches a red light, rather than slamming on the brakes.

Why cosine specifically? The cosine function has a nice property: it decreases slowly at first, faster in the middle, and slowly again at the end. This means:

  • Early training: LR stays high longer, allowing continued exploration
  • Mid training: LR drops steadily as the model refines
  • Late training: LR decreases very slowly for fine-tuning

This matches our intuition about training: we want to explore broadly at first, then settle into a good minimum carefully.

# PyTorch
from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(optimizer, T_max=100) # Anneal over 100 epochs
# TensorFlow
tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.001, decay_steps=100*steps_per_epoch
)

The formula (for the curious):

lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(epoch * π / T_max))
Worked example (lr_max=0.001, lr_min=0, T_max=100):
- Epoch 0: 0.5 * 0.001 * (1 + cos(0)) = 0.5 * 0.001 * 2 = 0.001 (max)
- Epoch 25: 0.5 * 0.001 * (1 + cos(π/4)) = 0.5 * 0.001 * 1.7 = 0.00085
- Epoch 50: 0.5 * 0.001 * (1 + cos(π/2)) = 0.5 * 0.001 * 1 = 0.0005 (half)
- Epoch 100: 0.5 * 0.001 * (1 + cos(π)) = 0.5 * 0.001 * 0 = 0 (min)

Notice how the LR drops faster in the middle (0.00085 → 0.0005) than at the extremes. This is the “sweet spot” of cosine annealing.

Modern large-scale training almost always uses warmup: start with a very low learning rate and gradually increase it.

Why? At the beginning of training:

  • Your random weights give garbage outputs
  • Gradients can be unstable
  • Large learning rates can push you into bad regions

Warmup gives the network time to “warm up” before hitting full speed.

def linear_warmup_cosine_decay(epoch, warmup_epochs, total_epochs, base_lr):
if epoch < warmup_epochs:
# Linear warmup
return base_lr * epoch / warmup_epochs
else:
# Cosine decay
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return base_lr * 0.5 * (1 + math.cos(math.pi * progress))
# PyTorch implementation
from torch.optim.lr_scheduler import LambdaLR
scheduler = LambdaLR(
optimizer,
lr_lambda=lambda epoch: linear_warmup_cosine_decay(
epoch, warmup_epochs=10, total_epochs=100, base_lr=1.0
)
)

Did You Know? The BERT paper (2018) used warmup for the first 10,000 steps, then linear decay. When researchers tried to train BERT without warmup, training often diverged entirely. Warmup isn’t optional for large language models — it’s essential.

In 2018, Leslie Smith proposed the “1cycle” policy that trains faster and achieves better results:

  1. Start with low learning rate
  2. Increase to maximum
  3. Decrease back down
  4. Drop to very low for final fine-tuning
from torch.optim.lr_scheduler import OneCycleLR
scheduler = OneCycleLR(
optimizer,
max_lr=0.01,
total_steps=total_steps,
pct_start=0.3, # 30% of training for warmup
anneal_strategy='cos'
)
# Call scheduler.step() after EVERY BATCH, not every epoch
for batch in train_loader:
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
scheduler.step() # Per-batch update

Did You Know? The 1cycle policy is sometimes called “super-convergence” because it can train models 4-10x faster than traditional schedules while achieving equal or better accuracy. fastai popularized this technique, and it’s now standard in many codebases.

How do you choose the maximum learning rate? Try them all!

def find_learning_rate(model, train_loader, start_lr=1e-7, end_lr=10, num_iter=100):
"""
Run training with exponentially increasing LR, plot loss.
Pick LR where loss is decreasing most rapidly.
"""
model_state = model.state_dict() # Save initial state
optimizer = optim.Adam(model.parameters(), lr=start_lr)
lrs, losses = [], []
lr = start_lr
factor = (end_lr / start_lr) ** (1 / num_iter)
for i, (inputs, targets) in enumerate(train_loader):
if i >= num_iter:
break
optimizer.param_groups[0]['lr'] = lr
outputs = model(inputs)
loss = criterion(outputs, targets)
lrs.append(lr)
losses.append(loss.item())
loss.backward()
optimizer.step()
optimizer.zero_grad()
lr *= factor
model.load_state_dict(model_state) # Restore initial state
return lrs, losses
# Plot and pick LR where loss is dropping fastest (not the minimum!)

The rule of thumb: choose a learning rate about 10x smaller than where the loss is minimum, in the steepest part of the descent.


Gradient Clipping: Taming Explosive Updates

Section titled “Gradient Clipping: Taming Explosive Updates”

Even with good initialization and normalization, gradients can sometimes explode. This is especially common with:

  • RNNs and LSTMs (long sequence dependencies)
  • Very deep networks
  • Large learning rates
  • Unusual data (outliers)

The most common approach: if the total gradient norm exceeds a threshold, scale it down.

import torch.nn.utils as nn_utils
# During training
optimizer.zero_grad()
loss.backward()
# Clip gradients before optimizer step
max_grad_norm = 1.0
nn_utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()

This ensures that no matter how large the computed gradients are, the actual update is bounded.

Did You Know? The choice of max_grad_norm = 1.0 comes from the LSTM paper (1997) and has been validated empirically many times since. Some models use different values (GPT-3 uses 1.0, BERT uses 1.0), but 1.0 is a reasonable default for almost any architecture.

An alternative: clip each gradient value independently.

nn_utils.clip_grad_value_(model.parameters(), clip_value=0.5)

This clips each gradient to [-0.5, 0.5]. It’s simpler but can change the direction of the gradient, while norm clipping preserves direction.

SituationRecommendation
Training RNNs/LSTMsAlways use (norm clipping)
Training TransformersUsually use (norm clipping)
Standard CNNsOften unnecessary
Large learning ratesRecommended
Seeing NaN lossesTry clipping as diagnostic
# A robust training loop with gradient clipping
def train_epoch(model, loader, optimizer, criterion, max_grad_norm=1.0):
model.train()
total_loss = 0
for batch in loader:
inputs, targets = batch
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
# Gradient clipping
grad_norm = nn_utils.clip_grad_norm_(model.parameters(), max_grad_norm)
# Optional: log if clipping occurred
if grad_norm > max_grad_norm:
print(f"Gradient clipped: {grad_norm:.2f} -> {max_grad_norm}")
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)

Training a neural network too long leads to overfitting — the network memorizes the training data instead of learning general patterns.

Track performance on a validation set (data the model doesn’t train on). When validation performance stops improving, stop training.

Training Loss: ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ (keeps decreasing)
Val Loss: ↓ ↓ ↓ ↓ ↓ → → ↑ ↑ ↑ (stops, then increases = overfitting!)
^ STOP HERE

A good early stopping implementation needs three key ingredients:

  1. Patience: How many epochs without improvement before stopping. Think of it like fishing — you don’t leave after one bad cast, but after 10 casts with no bites, it’s time to try another spot.

  2. Minimum Delta: What counts as “improvement”? If validation loss drops from 0.5000 to 0.4999, is that real progress or just noise? A min_delta of 0.001 means we only count improvements larger than 0.1%.

  3. Restore Best: When we stop, should we restore the model to its best state? If patience is 10 and we stopped after 10 epochs of no improvement, the current model is worse than it was 10 epochs ago. We almost always want to restore.

Here’s a reusable implementation:

class EarlyStopping:
"""Stop training when validation loss stops improving."""
def __init__(self, patience=7, min_delta=0.001, restore_best=True):
self.patience = patience
self.min_delta = min_delta
self.restore_best = restore_best
self.best_loss = float('inf')
self.best_model = None
self.counter = 0
self.should_stop = False

The __call__ method makes this class callable like a function. Each epoch, we check if validation loss improved:

def __call__(self, val_loss, model):
if val_loss < self.best_loss - self.min_delta:
# Improvement! Reset patience counter
self.best_loss = val_loss
self.best_model = model.state_dict().copy()
self.counter = 0
else:
# No improvement - increment counter
self.counter += 1
if self.counter >= self.patience:
self.should_stop = True
if self.restore_best:
model.load_state_dict(self.best_model)
return self.should_stop

Notice how we save a copy of the model state dict, not a reference. Without .copy(), we’d just have a pointer that gets overwritten every epoch!

Using it in your training loop:

early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(max_epochs):
train_loss = train_epoch(model, train_loader, optimizer, criterion)
val_loss = validate(model, val_loader, criterion)
print(f"Epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}")
if early_stopping(val_loss, model):
print(f"Early stopping at epoch {epoch}")
break

Did You Know? Early stopping was formalized in the 1990s but the idea goes back to the earliest days of neural networks. It’s sometimes called “regularization for free” because it prevents overfitting without any changes to the loss function or model architecture.

Choosing patience is a trade-off:

  • Too small: Stop before the model has a chance to improve
  • Too large: Waste time training an overfitting model

Rules of thumb:

  • Small datasets: patience = 5-10
  • Large datasets: patience = 10-20
  • With learning rate scheduling: larger patience (the LR drop might help)

Model Checkpointing: Never Lose Your Progress

Section titled “Model Checkpointing: Never Lose Your Progress”

Training large models can take days or weeks. Hardware can fail. Jobs get killed. Always save checkpoints.

A complete checkpoint includes:

  1. Model weights (model.state_dict())
  2. Optimizer state (optimizer.state_dict())
  3. Learning rate scheduler state (scheduler.state_dict())
  4. Current epoch/step
  5. Best validation score
  6. Any other training state (random seeds, etc.)
import torch
import os
def save_checkpoint(state, filename='checkpoint.pt', is_best=False):
"""Save training checkpoint."""
torch.save(state, filename)
if is_best:
best_filename = filename.replace('.pt', '_best.pt')
torch.save(state, best_filename)
def load_checkpoint(filename, model, optimizer=None, scheduler=None):
"""Load training checkpoint."""
checkpoint = torch.load(filename)
model.load_state_dict(checkpoint['model_state_dict'])
if optimizer and 'optimizer_state_dict' in checkpoint:
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
if scheduler and 'scheduler_state_dict' in checkpoint:
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
return checkpoint.get('epoch', 0), checkpoint.get('best_val_loss', float('inf'))
# In training loop
for epoch in range(start_epoch, max_epochs):
train_loss = train_epoch(model, train_loader, optimizer, criterion)
val_loss = validate(model, val_loader, criterion)
scheduler.step()
# Save checkpoint
is_best = val_loss < best_val_loss
if is_best:
best_val_loss = val_loss
save_checkpoint({
'epoch': epoch + 1,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'best_val_loss': best_val_loss,
}, filename=f'checkpoint_epoch_{epoch}.pt', is_best=is_best)

Don’t keep every checkpoint forever — you’ll run out of disk space!

import glob
def cleanup_old_checkpoints(checkpoint_dir, keep_last=3, keep_best=True):
"""Remove old checkpoints, keeping only the most recent."""
checkpoints = sorted(
glob.glob(os.path.join(checkpoint_dir, 'checkpoint_epoch_*.pt')),
key=os.path.getmtime
)
# Keep best checkpoint
if keep_best:
best_checkpoint = os.path.join(checkpoint_dir, 'checkpoint_best.pt')
if os.path.exists(best_checkpoint):
checkpoints = [c for c in checkpoints if c != best_checkpoint]
# Delete old checkpoints
for checkpoint in checkpoints[:-keep_last]:
os.remove(checkpoint)
print(f"Removed old checkpoint: {checkpoint}")

Did You Know? Google’s TPU training infrastructure automatically handles checkpointing to Google Cloud Storage. When they trained GPT-3-sized models, they saved checkpoints every 10 minutes because hardware failures were that common. At scale, checkpointing isn’t optional — it’s survival.


Putting It All Together: A Production Training Loop

Section titled “Putting It All Together: A Production Training Loop”

Here’s a complete training script that incorporates everything we’ve learned:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
import torch.nn.utils as nn_utils
import os
from datetime import datetime
class ProductionTrainer:
"""A production-ready training class with all best practices."""
def __init__(
self,
model,
train_loader,
val_loader,
criterion,
learning_rate=1e-3,
max_epochs=100,
patience=10,
max_grad_norm=1.0,
checkpoint_dir='checkpoints',
device='cuda' if torch.cuda.is_available() else 'cpu'
):
self.model = model.to(device)
self.train_loader = train_loader
self.val_loader = val_loader
self.criterion = criterion
self.device = device
self.max_epochs = max_epochs
self.max_grad_norm = max_grad_norm
self.checkpoint_dir = checkpoint_dir
# Create checkpoint directory
os.makedirs(checkpoint_dir, exist_ok=True)
# Optimizer with weight decay (L2 regularization)
self.optimizer = optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=0.01
)
# Learning rate scheduler with warmup
self.scheduler = CosineAnnealingWarmRestarts(
self.optimizer,
T_0=10, # Restart every 10 epochs
T_mult=2 # Double the restart period each time
)
# Early stopping
self.early_stopping = EarlyStopping(patience=patience)
# Tracking
self.best_val_loss = float('inf')
self.history = {'train_loss': [], 'val_loss': [], 'lr': []}
def train_epoch(self):
"""Train for one epoch."""
self.model.train()
total_loss = 0
num_batches = len(self.train_loader)
for batch_idx, (inputs, targets) in enumerate(self.train_loader):
inputs, targets = inputs.to(self.device), targets.to(self.device)
# Forward pass
self.optimizer.zero_grad()
outputs = self.model(inputs)
loss = self.criterion(outputs, targets)
# Backward pass
loss.backward()
# Gradient clipping
nn_utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
# Update weights
self.optimizer.step()
total_loss += loss.item()
# Progress indicator
if batch_idx % 50 == 0:
print(f" Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")
return total_loss / num_batches
@torch.no_grad()
def validate(self):
"""Validate the model."""
self.model.eval()
total_loss = 0
for inputs, targets in self.val_loader:
inputs, targets = inputs.to(self.device), targets.to(self.device)
outputs = self.model(inputs)
loss = self.criterion(outputs, targets)
total_loss += loss.item()
return total_loss / len(self.val_loader)
def save_checkpoint(self, epoch, is_best=False):
"""Save training checkpoint."""
checkpoint = {
'epoch': epoch,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'scheduler_state_dict': self.scheduler.state_dict(),
'best_val_loss': self.best_val_loss,
'history': self.history,
}
path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}.pt')
torch.save(checkpoint, path)
if is_best:
best_path = os.path.join(self.checkpoint_dir, 'checkpoint_best.pt')
torch.save(checkpoint, best_path)
print(f" New best model saved!")
def train(self):
"""Full training loop."""
print(f"Training on {self.device}")
print(f"Model parameters: {sum(p.numel() for p in self.model.parameters()):,}")
print("=" * 50)
for epoch in range(self.max_epochs):
start_time = datetime.now()
# Training
train_loss = self.train_epoch()
# Validation
val_loss = self.validate()
# Learning rate scheduling
current_lr = self.optimizer.param_groups[0]['lr']
self.scheduler.step()
# Track history
self.history['train_loss'].append(train_loss)
self.history['val_loss'].append(val_loss)
self.history['lr'].append(current_lr)
# Check for best model
is_best = val_loss < self.best_val_loss
if is_best:
self.best_val_loss = val_loss
# Save checkpoint
self.save_checkpoint(epoch, is_best)
# Logging
elapsed = datetime.now() - start_time
print(f"Epoch {epoch+1}/{self.max_epochs}")
print(f" Train Loss: {train_loss:.4f}")
print(f" Val Loss: {val_loss:.4f}")
print(f" LR: {current_lr:.6f}")
print(f" Time: {elapsed}")
# Early stopping check
if self.early_stopping(val_loss, self.model):
print(f"\n Early stopping triggered at epoch {epoch+1}")
break
print("\n" + "=" * 50)
print(f"Training complete! Best validation loss: {self.best_val_loss:.4f}")
return self.history
# Example usage
if __name__ == "__main__":
# Create model with all our techniques
model = nn.Sequential(
nn.Linear(784, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 10)
)
# Initialize with He initialization
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
if m.bias is not None:
nn.init.zeros_(m.bias)
model.apply(init_weights)
# Train
trainer = ProductionTrainer(
model=model,
train_loader=train_loader, # You'd create these
val_loader=val_loader,
criterion=nn.CrossEntropyLoss(),
learning_rate=1e-3,
max_epochs=100,
patience=10
)
history = trainer.train()

# WRONG
predictions = model(test_data) # BatchNorm/Dropout still in training mode!
# RIGHT
model.eval()
with torch.no_grad():
predictions = model(test_data)
# WRONG: Starting with huge learning rate
optimizer = optim.Adam(model.parameters(), lr=1.0) # NaN in 3... 2... 1...
# RIGHT: Start conservative
optimizer = optim.Adam(model.parameters(), lr=1e-3) # Standard starting point
# WRONG: RNN without gradient clipping
loss.backward()
optimizer.step() # Gradients might explode!
# RIGHT: Always clip RNN gradients
loss.backward()
nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Mistake 4: Wrong Initialization for Activation

Section titled “Mistake 4: Wrong Initialization for Activation”
# WRONG: Xavier init with ReLU
nn.init.xavier_uniform_(layer.weight) # Suboptimal for ReLU
# RIGHT: He init for ReLU
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
# DEBATABLE: BatchNorm after Dropout
nn.Sequential(
nn.Linear(256, 128),
nn.Dropout(0.5),
nn.BatchNorm1d(128), # Sees different distributions during train/eval
nn.ReLU()
)
# OFTEN BETTER: BatchNorm before Dropout (or skip one)
nn.Sequential(
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.5)
)

Training deep networks pushes hardware to its limits. Understanding memory constraints and performance tradeoffs is essential for real-world training.

Out of Memory (OOM) — The Most Common Error

Section titled “Out of Memory (OOM) — The Most Common Error”

You’ll encounter CUDA out of memory more times than you can count. Here’s how to handle it:

Quick fixes (in order of preference):

  1. Reduce batch size — The most effective solution. If batch 64 fails, try 32, then 16.
  2. Enable gradient checkpointing — Trade compute for memory:
    from torch.utils.checkpoint import checkpoint
    # Instead of: output = self.layer(x)
    output = checkpoint(self.layer, x) # Recomputes forward during backward
  3. Use mixed precision training — Cut memory usage nearly in half:
    from torch.cuda.amp import autocast, GradScaler
    scaler = GradScaler()
    with autocast():
    output = model(input)
    loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
  4. Clear cache between batches — When desperate:
    torch.cuda.empty_cache() # Frees cached memory, but slows training

Root causes to investigate:

  • Storing intermediate activations unnecessarily (use del tensor when done)
  • Accumulating gradients without stepping (check your training loop!)
  • Large embedding tables eating memory
  • Model too big for your GPU — consider model parallelism
Batch SizeProsCons
Small (8-32)Lower memory, noisier gradients act as regularization, better generalizationSlower training, GPU underutilized
Medium (64-256)Balanced memory/speed, stable trainingSweet spot for most tasks
Large (512+)Faster training, smoother gradients, better GPU utilizationHigh memory, may need LR warmup, can hurt generalization

The learning rate scaling rule: When you increase batch size by N, increase learning rate by √N (or N with warmup). This keeps the effective update size similar.

# Example: doubling batch size from 32 to 64
base_lr = 1e-3
batch_multiplier = 64 / 32 # = 2
new_lr = base_lr * (batch_multiplier ** 0.5) # = 1.4e-3

Gradient Accumulation — Big Batches on Small GPUs

Section titled “Gradient Accumulation — Big Batches on Small GPUs”

Can’t fit batch size 64 in memory? Use gradient accumulation to simulate it:

accumulation_steps = 4 # Accumulate 4 mini-batches
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(loader):
outputs = model(inputs)
loss = criterion(outputs, targets) / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

This gives you the gradient statistics of batch 64 while only using memory for batch 16.

When one GPU isn’t enough:

StrategyUse CaseComplexity
DataParallelQuick & dirty multi-GPULow (1 line of code)
DistributedDataParallelProduction trainingMedium (requires setup)
Model ParallelismModels larger than 1 GPUHigh (manual splitting)
FSDPLarge models, efficient memoryMedium-High
# DataParallel — easiest option
model = nn.DataParallel(model) # Uses all available GPUs
# DistributedDataParallel — better performance (requires proper init)
model = nn.parallel.DistributedDataParallel(model)

Did You Know? GPT-3 was trained on thousands of GPUs using tensor parallelism, where individual matrix multiplications are split across GPUs. The communication overhead was so high that they had to invent new parallelism strategies. Most practitioners will never need this level of scale — DataParallel is fine for 2-8 GPUs.

Find the bottleneck before optimizing:

# Simple timing
import time
start = time.time()
output = model(input)
torch.cuda.synchronize() # Important! GPU ops are async
print(f"Forward: {time.time() - start:.3f}s")
# PyTorch profiler for detailed analysis
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
output = model(input)
print(prof.key_averages().table(sort_by="cuda_time_total"))

Common bottlenecks:

  • Data loading — Use num_workers > 0 and pin_memory=True
  • CPU-GPU transfer — Batch your transfers, avoid frequent small copies
  • Synchronization — Minimize .item() and .numpy() calls during training

Train the same network with different initializations and compare:

  1. Random uniform [-1, 1]
  2. Xavier/Glorot
  3. He/Kaiming

Measure: training speed, final accuracy, gradient magnitudes.

Exercise 2: Learning Rate Schedule Comparison

Section titled “Exercise 2: Learning Rate Schedule Comparison”

Implement and compare:

  1. Constant learning rate
  2. Step decay
  3. Cosine annealing
  4. 1cycle

Plot learning rate over time and final accuracy.

Train a simple network on MNIST with:

  1. No normalization
  2. BatchNorm
  3. LayerNorm

Vary batch size from 8 to 512 and measure the effect on each.


A major tech company trained a large language model for 6 weeks on expensive GPU clusters. At week 5, training loss suddenly spiked to infinity and never recovered. Root cause: No gradient clipping, and a rare data batch caused gradient explosion. They had to restart from scratch because their checkpoint from week 4 was corrupted.

Lesson learned: Always use gradient clipping (max_norm=1.0), checkpoint frequently (every 1000 steps), and validate checkpoint integrity.

A computer vision team deployed a model that worked perfectly in training but gave random predictions in production. The model used BatchNorm, but production inference ran with batch_size=1. BatchNorm’s running statistics were wrong because they forgot to call model.eval().

# The bug that cost 3 weeks of debugging
model = load_model(checkpoint)
predictions = model(batch) # WRONG: model still in train mode
# The fix
model = load_model(checkpoint)
model.eval() # Critical for BatchNorm and Dropout!
with torch.no_grad():
predictions = model(batch)

Lesson learned: Always verify model mode. Add assertions in production code:

assert not model.training, "Model must be in eval mode for inference"

Symptom: Loss oscillates wildly or explodes to NaN

Fix: Use learning rate finder, start with 1e-4 for Adam, 1e-2 for SGD

Symptom: Gradients accumulate, training diverges

# Bug: gradients accumulate across batches
for batch in dataloader:
loss = criterion(model(batch), targets)
loss.backward()
optimizer.step() # Gradients keep accumulating!
# Fix: zero gradients each step
for batch in dataloader:
optimizer.zero_grad() # Reset gradients
loss = criterion(model(batch), targets)
loss.backward()
optimizer.step()

Symptom: Dead neurons (ReLU) or vanishing gradients (sigmoid/tanh)

# Wrong: Xavier for ReLU
nn.init.xavier_uniform_(layer.weight) # Assumes linear activation
# Right: He/Kaiming for ReLU
nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')

Symptom: Training crashes in first few batches

# Add warmup: start low, ramp up over first 1000 steps
warmup_steps = 1000
for step in range(total_steps):
if step < warmup_steps:
lr = base_lr * (step / warmup_steps)
else:
lr = base_lr
for param_group in optimizer.param_groups:
param_group['lr'] = lr

Q: What’s the difference between BatchNorm and LayerNorm? When would you use each?

BatchNorm normalizes across the batch dimension — great for CNNs with large batches (32+). LayerNorm normalizes across feature dimensions — essential for Transformers and when batch sizes vary. Use BatchNorm for computer vision, LayerNorm for NLP and variable batch sizes.

Q: Why does gradient clipping help training?

Gradient clipping prevents exploding gradients by capping the gradient norm. Without it, a single bad batch can produce huge gradients that destroy learned weights. It’s essential for RNNs and helpful for any deep network. Typical values: max_norm=1.0 for RNNs, max_norm=5.0 for Transformers.

Q: Explain the 1cycle learning rate policy.

1cycle starts with a low learning rate, ramps up to a maximum over 30% of training, then gradually decreases back down. It achieves “super-convergence” — training faster with better final accuracy than constant learning rate. The key insight is that high learning rates help escape local minima early in training.

Q: How would you debug a model that trains well but performs poorly on validation?

This is overfitting. Debugging steps: (1) Add/increase dropout, (2) Use data augmentation, (3) Add L2 regularization (weight decay), (4) Early stopping based on validation loss, (5) Reduce model capacity, (6) Get more training data. Monitor the gap between train and val loss — should be small.

Q: What learning rate would you start with for a new project?

For Adam optimizer, start with 1e-4 (0.0001) — it’s a safe default that works for most architectures. For SGD with momentum, try 1e-2 (0.01). Then use a learning rate finder: train for a few hundred steps while exponentially increasing LR from 1e-7 to 1. Plot loss vs LR and pick a value just before the loss starts climbing. Always add warmup for the first 5-10% of training steps.


  • Training Toolkit: A reusable training class with all best practices
  • Initialization Comparison: Script comparing different initializations
  • Learning Rate Finder: Implementation of LR range test
  • Early Stopping: Production-ready early stopping implementation
  • Checkpointing System: Complete save/load functionality

Success Criteria: Train a network to >98% accuracy on MNIST using all techniques.


  1. “Batch Normalization: Accelerating Deep Network Training” - Ioffe & Szegedy (2015)
  2. “How Does Batch Normalization Help Optimization?” - Santurkar et al. (2018)
  3. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” - Srivastava et al. (2014)
  4. “Understanding the difficulty of training deep feedforward neural networks” - Glorot & Bengio (2010)
  5. “Delving Deep into Rectifiers” - He et al. (2015)
  6. “Super-Convergence” - Leslie Smith (2018)

  1. BatchNorm made deep networks trainable — use it for CNNs
  2. LayerNorm is the standard for Transformers and small batches
  3. Dropout prevents overfitting but remember train/eval modes
  4. Proper initialization (He for ReLU, Xavier for tanh) prevents gradient problems
  5. Learning rate schedules with warmup train faster and better
  6. Gradient clipping is essential for RNNs, helpful elsewhere
  7. Early stopping prevents overfitting for free
  8. Checkpointing is not optional for serious training

You’ve mastered the art of training deep networks. Now it’s time to build specific architectures!

Module 29: Convolutional Neural Networks (CNNs) for images Module 30: Recurrent Neural Networks (RNNs/LSTMs) for sequences Module 31: Transformer Architecture from scratch


“Training deep networks is like raising children: you need patience, consistency, and knowing when to let go.” — Unknown ML practitioner