CNNs & Computer Vision
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Or: The Art of Making Neural Networks Actually Work
Section titled “Or: The Art of Making Neural Networks Actually Work”Reading Time: 6-8 hours Prerequisites: Module 27
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand why deep networks are notoriously difficult to train (and the historical struggles)
- Master batch normalization and layer normalization (the techniques that made deep learning possible)
- Apply dropout correctly (and understand the common mistakes)
- Choose the right weight initialization for your architecture
- Implement learning rate schedules that converge faster
- Use gradient clipping to prevent exploding gradients
- Implement early stopping and checkpointing like a production ML engineer
The Dark Ages of Deep Learning: Why Training Was Nearly Impossible
Section titled “The Dark Ages of Deep Learning: Why Training Was Nearly Impossible”Before we dive into the techniques, you need to understand something important: for decades, training networks deeper than 2-3 layers was essentially impossible. Not “difficult” — impossible.
Imagine you’re trying to pass a message through a chain of 100 people playing telephone. By the time the message reaches the last person, it’s completely garbled. That’s what happened to gradients in deep networks — they either exploded into infinity or vanished into nothing.
Did You Know? In 2006, Geoffrey Hinton published a paper called “A Fast Learning Algorithm for Deep Belief Nets” that’s often credited with kickstarting the deep learning revolution. But here’s the thing: his networks were only 3-4 layers deep. Even that was considered “deep” at the time! Today, networks like gpt-5 have hundreds of layers. The techniques in this module are what made that possible.
The Two Nightmares: Vanishing and Exploding Gradients
Section titled “The Two Nightmares: Vanishing and Exploding Gradients”When you train a neural network, you’re computing gradients through backpropagation. Each layer multiplies the gradient by its weights. Here’s the problem:
Vanishing Gradients: If your weights are small (say, 0.5), multiplying many times gives you:
0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 = 0.001After just 10 layers, your gradient is 1/1000th of what it started. After 50 layers? Essentially zero. The early layers never learn anything.
Exploding Gradients: If your weights are large (say, 2.0):
2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 1024After 10 layers, your gradient is a thousand times larger. After 50 layers? Your computer gives up and returns NaN (not a number).
Did You Know? The exploding gradient problem was so common in the early days that researchers would joke about “NaN debugging” — spending hours figuring out why their loss suddenly became infinity. One famous story: a PhD student at Stanford spent three months debugging a model only to find that a single wrong activation function was causing gradients to explode after 15 iterations.
The Historical Solutions (That Didn’t Quite Work)
Section titled “The Historical Solutions (That Didn’t Quite Work)”Before the modern techniques we’ll learn, researchers tried several approaches:
- Shallow Networks: Just… don’t go deep. Use 2-3 layers max.
- Careful Initialization: Initialize weights to very specific values (we’ll see this still matters)
- Layer-by-Layer Pre-training: Train one layer at a time, then fine-tune (tedious!)
- Gradient Checking: Manually verify gradients (slow and painful)
None of these scaled to the architectures we use today. What changed everything? The techniques in this module.
Batch Normalization: The Single Most Important Technique in Modern Deep Learning
Section titled “Batch Normalization: The Single Most Important Technique in Modern Deep Learning”If you learn only one thing from this module, make it batch normalization. Seriously.
The Story of BatchNorm
Section titled “The Story of BatchNorm”In 2015, Sergey Ioffe and Christian Szegedy at Google published a paper that would change deep learning forever: “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.”
Did You Know? The BatchNorm paper has been cited over 60,000 times, making it one of the most influential papers in machine learning history. For context, Einstein’s special relativity paper has about 3,000 citations. BatchNorm literally changed more papers than Einstein’s most famous work!
But here’s the funny thing: the paper’s explanation of why BatchNorm works is probably wrong.
The paper claimed BatchNorm works by reducing “internal covariate shift” — the idea that each layer’s inputs are constantly changing during training. Sounds reasonable, right?
In 2018, researchers at MIT published a paper called “How Does Batch Normalization Help Optimization?” They showed that BatchNorm doesn’t actually reduce internal covariate shift much at all. Instead, it smooths the loss landscape, making optimization easier.
This is a beautiful example of science: you can discover something that works incredibly well without fully understanding why. The theory caught up later.
What BatchNorm Actually Does
Section titled “What BatchNorm Actually Does”Imagine you’re training a model and layer 5 outputs values ranging from -1000 to +1000. Layer 6 has to deal with these huge values. Then, during training, suddenly layer 5 starts outputting values from -0.001 to +0.001. Layer 6 is completely confused!
BatchNorm says: “Let’s force each layer’s outputs to be nice and normalized — mean 0, standard deviation 1.”
Here’s the beautiful part: it does this within each mini-batch during training, which is why it’s called batch normalization.
# The idea behind BatchNorm (simplified)def batch_norm_simplified(x, gamma, beta, eps=1e-5): """ x: input tensor of shape (batch_size, features) gamma: learnable scale parameter beta: learnable shift parameter """ # Calculate statistics across the batch mean = x.mean(dim=0) # Mean of each feature var = x.var(dim=0) # Variance of each feature
# Normalize x_norm = (x - mean) / torch.sqrt(var + eps)
# Scale and shift (learnable!) return gamma * x_norm + betaThe eps (epsilon) is there to prevent division by zero. It’s typically 1e-5.
The Learnable Parameters: gamma and beta
Section titled “The Learnable Parameters: gamma and beta”You might wonder: “If we normalize everything to mean 0 and std 1, aren’t we removing information?”
Great question! That’s why BatchNorm includes two learnable parameters:
- gamma (γ): scales the normalized values
- beta (β): shifts them
This means the network can learn to undo the normalization if that’s helpful. In practice, it usually doesn’t fully undo it, but having the option prevents BatchNorm from limiting the network’s expressiveness.
BatchNorm in PyTorch
Section titled “BatchNorm in PyTorch”PyTorch makes this easy:
import torchimport torch.nn as nn
class NetworkWithBatchNorm(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), # BatchNorm for 1D data (fully connected) nn.ReLU(),
nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(),
nn.Linear(128, 10) )
def forward(self, x): return self.layers(x)For convolutional networks, use BatchNorm2d:
class CNNWithBatchNorm(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(3, 64, kernel_size=3, padding=1), nn.BatchNorm2d(64), # BatchNorm for 2D data (images) nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), )The Train/Eval Mode Gotcha
Section titled “The Train/Eval Mode Gotcha”Here’s something that trips up almost every beginner: BatchNorm behaves differently during training and inference.
During training, BatchNorm uses the statistics (mean, variance) from the current mini-batch.
During inference, you usually process one sample at a time. You can’t compute meaningful statistics from a single sample! So BatchNorm uses running averages of the statistics it saw during training.
This is why you must call model.train() before training and model.eval() before inference:
# Trainingmodel.train() # BatchNorm uses batch statisticsfor batch in train_loader: outputs = model(batch) loss = criterion(outputs, targets) loss.backward() optimizer.step()
# Inferencemodel.eval() # BatchNorm uses running statisticswith torch.no_grad(): predictions = model(test_data)Did You Know? Forgetting to call
model.eval()before inference is one of the most common bugs in deep learning. It can cause your model to give wildly different predictions depending on batch size, leading to hours of confused debugging. One famous case: a self-driving car company shipped a model that worked great with batch size 32 but gave garbage predictions with batch size 1. The fix? Adding one line:model.eval().
Batch Size and BatchNorm
Section titled “Batch Size and BatchNorm”There’s an important relationship between batch size and BatchNorm effectiveness:
- Large batches (32+): BatchNorm works great
- Small batches (8-16): Statistics become noisy, performance degrades
- Very small batches (1-4): BatchNorm can actually hurt performance!
Why? With a batch size of 2, your “mean” is just the average of 2 numbers. That’s not a meaningful statistic.
This limitation led to alternatives like Layer Normalization, which we’ll cover next.
Layer Normalization: When Batches Don’t Make Sense
Section titled “Layer Normalization: When Batches Don’t Make Sense”In 2016, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton introduced Layer Normalization. It normalizes across features within a single sample rather than across the batch.
Why This Module Matters Layer Norm solves it by not using batches at all!
Section titled “Why This Module Matters Layer Norm solves it by not using batches at all!”Instead of computing statistics across different samples, Layer Norm computes statistics across different features within the same sample:
def layer_norm_simplified(x, gamma, beta, eps=1e-5): """ x: input tensor of shape (batch_size, features) Unlike BatchNorm, we normalize across features, not batch """ # Calculate statistics across features (for each sample independently) mean = x.mean(dim=-1, keepdim=True) # Mean across features var = x.var(dim=-1, keepdim=True) # Variance across features
# Normalize x_norm = (x - mean) / torch.sqrt(var + eps)
# Scale and shift return gamma * x_norm + betaWhere Layer Norm Shines
Section titled “Where Layer Norm Shines”Layer Normalization is the standard choice for:
- Transformers/Attention Models: The GPT family, BERT, and virtually all modern NLP models use Layer Norm
- Recurrent Networks (RNNs/LSTMs): Where batch statistics don’t make sense over time
- Small Batch Training: When you can’t fit large batches in memory
- Online Learning: When you process one sample at a time
Did You Know? Every single layer of gpt-5 uses Layer Normalization. When you chat with ChatGPT, your text passes through hundreds of Layer Norm operations. The original GPT paper actually tried BatchNorm first but found Layer Norm worked much better for language modeling.
Layer Norm in PyTorch
Section titled “Layer Norm in PyTorch”import torch.nn as nn
# For a fully connected layer with 256 featureslayer_norm = nn.LayerNorm(256)
# In a Transformer-style blockclass TransformerBlock(nn.Module): def __init__(self, d_model=512): super().__init__() self.attention = nn.MultiheadAttention(d_model, num_heads=8) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.ffn = nn.Sequential( nn.Linear(d_model, d_model * 4), nn.ReLU(), nn.Linear(d_model * 4, d_model) )
def forward(self, x): # Pre-norm architecture (modern standard) x = x + self.attention(self.norm1(x), self.norm1(x), self.norm1(x))[0] x = x + self.ffn(self.norm2(x)) return xBatchNorm vs LayerNorm: When to Use Which
Section titled “BatchNorm vs LayerNorm: When to Use Which”| Situation | Best Choice | Why |
|---|---|---|
| CNNs for images | BatchNorm | Large batches, spatial structure |
| Transformers | LayerNorm | Variable sequence lengths, small batches |
| RNNs/LSTMs | LayerNorm | Recurrent structure breaks batch assumptions |
| Small batches (<8) | LayerNorm | Batch statistics too noisy |
| Large batches (>32) | Either | Both work well |
| Single-sample inference | LayerNorm | No batch to compute statistics |
Dropout: Randomly Breaking Your Network (On Purpose)
Section titled “Dropout: Randomly Breaking Your Network (On Purpose)”Dropout is one of those ideas that sounds completely crazy until you realize it works brilliantly.
The Story of Dropout
Section titled “The Story of Dropout”In 2012, Geoffrey Hinton (yes, him again), along with Nitish Srivastava and others, proposed dropout in “Improving Neural Networks by Preventing Co-adaptation of Feature Detectors.”
The idea: during training, randomly set some neurons to zero.
That’s it. Just… turn things off randomly.
Did You Know? Geoffrey Hinton has said that dropout was inspired by how genes work in biological evolution. Sexual reproduction means each child gets a random combination of genes. This prevents individual genes from becoming too specialized or “co-adapted.” Dropout creates a similar effect in neural networks.
Why Dropout Works
Section titled “Why Dropout Works”Think of a team where one person does all the work. If that person gets sick, the team fails. But if everyone shares responsibility, losing any one person is survivable.
Dropout forces every neuron to be useful on its own, without relying too heavily on other specific neurons. This creates redundancy and generalization.
Here’s another way to think about it: dropout creates an implicit ensemble of networks. With N neurons that can each be on or off, you’re effectively training 2^N different network configurations!
Dropout in Practice
Section titled “Dropout in Practice”import torch.nn as nn
class NetworkWithDropout(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.5), # 50% of neurons zeroed
nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(128, 10) # No dropout before output! )
def forward(self, x): return self.layers(x)The Scaling Trick
Section titled “The Scaling Trick”There’s a subtle but important detail: during training, we zero out half the neurons. But during inference, all neurons are active. Doesn’t that change the expected output?
Yes! That’s why dropout scales the remaining activations during training. If dropout rate is 0.5, the remaining neurons are multiplied by 2. This way, the expected sum stays the same.
PyTorch handles this automatically, but it’s good to know what’s happening under the hood.
Common Dropout Mistakes
Section titled “Common Dropout Mistakes”- Using dropout during inference: Always call
model.eval()to disable dropout during testing - Dropout after the final layer: Don’t zero out your predictions!
- Too much dropout: Values above 0.5 can make training very slow
- Dropout with BatchNorm: The combination can be tricky; some argue you should use one or the other
Did You Know? Dropout was so successful that it won the “Test of Time” award at NeurIPS 2022, ten years after its publication. The award committee noted that dropout “has been incorporated into virtually every modern deep learning system.”
Modern Alternatives to Dropout
Section titled “Modern Alternatives to Dropout”While dropout is still widely used, some alternatives have emerged:
- DropPath/Stochastic Depth: Drop entire layers instead of neurons (used in ResNets)
- DropBlock: For CNNs, drop contiguous regions instead of random pixels
- Attention Dropout: Specialized for Transformer attention layers
- Dropout-Free Training: Some modern architectures don’t need dropout at all!
# DropPath (Stochastic Depth) exampleclass DropPath(nn.Module): def __init__(self, drop_prob=0.1): super().__init__() self.drop_prob = drop_prob
def forward(self, x): if not self.training or self.drop_prob == 0: return x
keep_prob = 1 - self.drop_prob # Create random tensor for the batch shape = (x.shape[0],) + (1,) * (x.ndim - 1) random_tensor = keep_prob + torch.rand(shape, device=x.device) random_tensor = random_tensor.floor() # Binarize
return x / keep_prob * random_tensorWeight Initialization: Where You Start Matters More Than You Think
Section titled “Weight Initialization: Where You Start Matters More Than You Think”You might think that since neural networks learn their weights, initialization doesn’t matter much. You’d be very wrong.
The Bad Old Days
Section titled “The Bad Old Days”In the early days of neural networks, people would initialize weights randomly from a uniform distribution like [-1, 1] or a normal distribution with mean 0 and standard deviation 1.
This worked terribly.
The problem? Gradients would either vanish or explode right from the start, before the network could learn anything useful.
Xavier Initialization: The First Good Answer
Section titled “Xavier Initialization: The First Good Answer”In 2010, Xavier Glorot and Yoshua Bengio published “Understanding the difficulty of training deep feedforward neural networks.” They showed mathematically that weights should be initialized based on the number of input and output connections.
The Xavier formula:
weights ~ Uniform(-sqrt(6/(n_in + n_out)), sqrt(6/(n_in + n_out)))Or the normal distribution version:
weights ~ Normal(0, sqrt(2/(n_in + n_out)))Where n_in is the number of input features and n_out is the number of output features.
Did You Know? Xavier Glorot was a PhD student when he published this paper. His advisor, Yoshua Bengio, is now one of the “Godfathers of Deep Learning” and won the Turing Award in 2018. The Xavier initialization is sometimes called “Glorot initialization” after its inventor.
He Initialization: For ReLU Networks
Section titled “He Initialization: For ReLU Networks”There was one problem with Xavier initialization: it assumed symmetric activations (like tanh or sigmoid). But ReLU, which zeros out negative values, is not symmetric.
In 2015, Kaiming He (then at Microsoft Research) proposed an adjustment specifically for ReLU:
weights ~ Normal(0, sqrt(2/n_in))Notice the factor is 2/n_in instead of 2/(n_in + n_out). This accounts for ReLU’s asymmetry.
Did You Know? Kaiming He went on to invent ResNets, one of the most influential architectures in deep learning history. He’s won multiple best paper awards and is considered one of the most important researchers in computer vision. His initialization formula, like his networks, is elegantly simple.
Initialization in PyTorch
Section titled “Initialization in PyTorch”PyTorch does reasonable initialization by default, but you can be explicit:
import torch.nn as nnimport torch.nn.init as init
def init_weights_xavier(m): """Xavier initialization for Linear and Conv layers""" if isinstance(m, (nn.Linear, nn.Conv2d)): init.xavier_uniform_(m.weight) if m.bias is not None: init.zeros_(m.bias)
def init_weights_he(m): """He (Kaiming) initialization for ReLU networks""" if isinstance(m, (nn.Linear, nn.Conv2d)): init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu') if m.bias is not None: init.zeros_(m.bias)
# Apply to modelmodel = MyNetwork()model.apply(init_weights_he) # Applies to all layers recursivelyWhich Initialization to Use?
Section titled “Which Initialization to Use?”| Activation Function | Recommended Initialization |
|---|---|
| ReLU, Leaky ReLU | He (Kaiming) |
| tanh, sigmoid | Xavier (Glorot) |
| SELU | LeCun (similar to Xavier) |
| GELU | He often works well |
| Linear (no activation) | Xavier |
Special Cases: Transformers and Attention
Section titled “Special Cases: Transformers and Attention”Modern Transformers often use special initialization schemes:
# GPT-style initializationdef gpt_init(module): if isinstance(module, nn.Linear): # Scale by 1/sqrt(2 * num_layers) for residual connections torch.nn.init.normal_(module.weight, mean=0.0, std=0.02) if module.bias is not None: torch.nn.init.zeros_(module.bias) elif isinstance(module, nn.Embedding): torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)Did You Know? The specific value
0.02for the standard deviation in GPT comes from experimentation at OpenAI. When training the original GPT, they found this value worked well empirically. It’s now become a standard in Transformer training, even though there’s no deep theoretical justification for this exact number.
Learning Rate Scheduling: The Art of Knowing When to Slow Down
Section titled “Learning Rate Scheduling: The Art of Knowing When to Slow Down”Learning rate is arguably the most important hyperparameter in deep learning. Too high, and your training explodes. Too low, and you never converge. But here’s the thing: the optimal learning rate changes during training.
The Intuition
Section titled “The Intuition”Imagine you’re searching for the lowest point in a valley:
- At the start, you’re far away — take big steps to make progress
- As you get closer, take smaller steps to avoid overshooting
- Near the minimum, take tiny steps for fine-tuning
This is exactly what learning rate schedules do.
Step Decay: The Classic Approach
Section titled “Step Decay: The Classic Approach”Step decay is the simplest and oldest learning rate schedule. The idea is straightforward: train at a high learning rate until progress plateaus, then drop the rate and continue. It’s like shifting gears in a car — you start in a high gear for speed, then shift down for precision.
Why does dropping the LR help? Early in training, you want big steps to escape bad regions quickly. But as you approach the minimum, those same big steps cause you to bounce around instead of settling in. Dropping the learning rate is like switching from running to walking when you get close to your destination.
Did You Know? The “divide by 10 at epochs 30, 60, 90” schedule was used to train the original ResNet paper by Kaiming He and colleagues in 2015. It became so standard that it’s still the default in many image classification codebases today — even though smoother schedules often work better. Sometimes the “good enough” solution from a famous paper becomes the industry default.
Implementation (PyTorch/TensorFlow/conceptually the same):
# PyTorchfrom torch.optim.lr_scheduler import StepLRscheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# TensorFlowtf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=0.001, decay_steps=30*steps_per_epoch, decay_rate=0.1)
# The math (framework-agnostic):# new_lr = initial_lr * (gamma ^ floor(epoch / step_size))# At epoch 30: 0.001 * 0.1 = 0.0001# At epoch 60: 0.001 * 0.01 = 0.00001When to use step decay:
- Simple baseline that usually works
- When you don’t want to tune fancy schedules
- Legacy codebases that expect this pattern
When to avoid:
- The sudden drops can destabilize training
- Cosine annealing usually works as well or better with less tuning
Cosine Annealing: Smooth and Effective
Section titled “Cosine Annealing: Smooth and Effective”While step decay makes sudden jumps, cosine annealing provides a smooth, continuous decrease. Think of it like a car slowing down gradually as it approaches a red light, rather than slamming on the brakes.
Why cosine specifically? The cosine function has a nice property: it decreases slowly at first, faster in the middle, and slowly again at the end. This means:
- Early training: LR stays high longer, allowing continued exploration
- Mid training: LR drops steadily as the model refines
- Late training: LR decreases very slowly for fine-tuning
This matches our intuition about training: we want to explore broadly at first, then settle into a good minimum carefully.
# PyTorchfrom torch.optim.lr_scheduler import CosineAnnealingLRscheduler = CosineAnnealingLR(optimizer, T_max=100) # Anneal over 100 epochs
# TensorFlowtf.keras.optimizers.schedules.CosineDecay( initial_learning_rate=0.001, decay_steps=100*steps_per_epoch)The formula (for the curious):
lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(epoch * π / T_max))
Worked example (lr_max=0.001, lr_min=0, T_max=100):- Epoch 0: 0.5 * 0.001 * (1 + cos(0)) = 0.5 * 0.001 * 2 = 0.001 (max)- Epoch 25: 0.5 * 0.001 * (1 + cos(π/4)) = 0.5 * 0.001 * 1.7 = 0.00085- Epoch 50: 0.5 * 0.001 * (1 + cos(π/2)) = 0.5 * 0.001 * 1 = 0.0005 (half)- Epoch 100: 0.5 * 0.001 * (1 + cos(π)) = 0.5 * 0.001 * 0 = 0 (min)Notice how the LR drops faster in the middle (0.00085 → 0.0005) than at the extremes. This is the “sweet spot” of cosine annealing.
Warmup: Start Slow, Then Speed Up
Section titled “Warmup: Start Slow, Then Speed Up”Modern large-scale training almost always uses warmup: start with a very low learning rate and gradually increase it.
Why? At the beginning of training:
- Your random weights give garbage outputs
- Gradients can be unstable
- Large learning rates can push you into bad regions
Warmup gives the network time to “warm up” before hitting full speed.
def linear_warmup_cosine_decay(epoch, warmup_epochs, total_epochs, base_lr): if epoch < warmup_epochs: # Linear warmup return base_lr * epoch / warmup_epochs else: # Cosine decay progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs) return base_lr * 0.5 * (1 + math.cos(math.pi * progress))
# PyTorch implementationfrom torch.optim.lr_scheduler import LambdaLR
scheduler = LambdaLR( optimizer, lr_lambda=lambda epoch: linear_warmup_cosine_decay( epoch, warmup_epochs=10, total_epochs=100, base_lr=1.0 ))Did You Know? The BERT paper (2018) used warmup for the first 10,000 steps, then linear decay. When researchers tried to train BERT without warmup, training often diverged entirely. Warmup isn’t optional for large language models — it’s essential.
One Cycle Learning Rate: The Fast Path
Section titled “One Cycle Learning Rate: The Fast Path”In 2018, Leslie Smith proposed the “1cycle” policy that trains faster and achieves better results:
- Start with low learning rate
- Increase to maximum
- Decrease back down
- Drop to very low for final fine-tuning
from torch.optim.lr_scheduler import OneCycleLR
scheduler = OneCycleLR( optimizer, max_lr=0.01, total_steps=total_steps, pct_start=0.3, # 30% of training for warmup anneal_strategy='cos')
# Call scheduler.step() after EVERY BATCH, not every epochfor batch in train_loader: loss = compute_loss(model, batch) loss.backward() optimizer.step() scheduler.step() # Per-batch updateDid You Know? The 1cycle policy is sometimes called “super-convergence” because it can train models 4-10x faster than traditional schedules while achieving equal or better accuracy. fastai popularized this technique, and it’s now standard in many codebases.
Learning Rate Finder: Don’t Guess, Test
Section titled “Learning Rate Finder: Don’t Guess, Test”How do you choose the maximum learning rate? Try them all!
def find_learning_rate(model, train_loader, start_lr=1e-7, end_lr=10, num_iter=100): """ Run training with exponentially increasing LR, plot loss. Pick LR where loss is decreasing most rapidly. """ model_state = model.state_dict() # Save initial state optimizer = optim.Adam(model.parameters(), lr=start_lr)
lrs, losses = [], [] lr = start_lr factor = (end_lr / start_lr) ** (1 / num_iter)
for i, (inputs, targets) in enumerate(train_loader): if i >= num_iter: break
optimizer.param_groups[0]['lr'] = lr
outputs = model(inputs) loss = criterion(outputs, targets)
lrs.append(lr) losses.append(loss.item())
loss.backward() optimizer.step() optimizer.zero_grad()
lr *= factor
model.load_state_dict(model_state) # Restore initial state return lrs, losses
# Plot and pick LR where loss is dropping fastest (not the minimum!)The rule of thumb: choose a learning rate about 10x smaller than where the loss is minimum, in the steepest part of the descent.
Gradient Clipping: Taming Explosive Updates
Section titled “Gradient Clipping: Taming Explosive Updates”Even with good initialization and normalization, gradients can sometimes explode. This is especially common with:
- RNNs and LSTMs (long sequence dependencies)
- Very deep networks
- Large learning rates
- Unusual data (outliers)
Gradient Norm Clipping
Section titled “Gradient Norm Clipping”The most common approach: if the total gradient norm exceeds a threshold, scale it down.
import torch.nn.utils as nn_utils
# During trainingoptimizer.zero_grad()loss.backward()
# Clip gradients before optimizer stepmax_grad_norm = 1.0nn_utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()This ensures that no matter how large the computed gradients are, the actual update is bounded.
Did You Know? The choice of
max_grad_norm = 1.0comes from the LSTM paper (1997) and has been validated empirically many times since. Some models use different values (GPT-3 uses 1.0, BERT uses 1.0), but 1.0 is a reasonable default for almost any architecture.
Gradient Value Clipping
Section titled “Gradient Value Clipping”An alternative: clip each gradient value independently.
nn_utils.clip_grad_value_(model.parameters(), clip_value=0.5)This clips each gradient to [-0.5, 0.5]. It’s simpler but can change the direction of the gradient, while norm clipping preserves direction.
When to Use Gradient Clipping
Section titled “When to Use Gradient Clipping”| Situation | Recommendation |
|---|---|
| Training RNNs/LSTMs | Always use (norm clipping) |
| Training Transformers | Usually use (norm clipping) |
| Standard CNNs | Often unnecessary |
| Large learning rates | Recommended |
| Seeing NaN losses | Try clipping as diagnostic |
# A robust training loop with gradient clippingdef train_epoch(model, loader, optimizer, criterion, max_grad_norm=1.0): model.train() total_loss = 0
for batch in loader: inputs, targets = batch
optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward()
# Gradient clipping grad_norm = nn_utils.clip_grad_norm_(model.parameters(), max_grad_norm)
# Optional: log if clipping occurred if grad_norm > max_grad_norm: print(f"Gradient clipped: {grad_norm:.2f} -> {max_grad_norm}")
optimizer.step() total_loss += loss.item()
return total_loss / len(loader)Early Stopping: Knowing When to Quit
Section titled “Early Stopping: Knowing When to Quit”Training a neural network too long leads to overfitting — the network memorizes the training data instead of learning general patterns.
The Concept
Section titled “The Concept”Track performance on a validation set (data the model doesn’t train on). When validation performance stops improving, stop training.
Training Loss: ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ (keeps decreasing)Val Loss: ↓ ↓ ↓ ↓ ↓ → → ↑ ↑ ↑ (stops, then increases = overfitting!) ^ STOP HEREImplementing Early Stopping
Section titled “Implementing Early Stopping”A good early stopping implementation needs three key ingredients:
-
Patience: How many epochs without improvement before stopping. Think of it like fishing — you don’t leave after one bad cast, but after 10 casts with no bites, it’s time to try another spot.
-
Minimum Delta: What counts as “improvement”? If validation loss drops from 0.5000 to 0.4999, is that real progress or just noise? A
min_deltaof 0.001 means we only count improvements larger than 0.1%. -
Restore Best: When we stop, should we restore the model to its best state? If patience is 10 and we stopped after 10 epochs of no improvement, the current model is worse than it was 10 epochs ago. We almost always want to restore.
Here’s a reusable implementation:
class EarlyStopping: """Stop training when validation loss stops improving."""
def __init__(self, patience=7, min_delta=0.001, restore_best=True): self.patience = patience self.min_delta = min_delta self.restore_best = restore_best
self.best_loss = float('inf') self.best_model = None self.counter = 0 self.should_stop = FalseThe __call__ method makes this class callable like a function. Each epoch, we check if validation loss improved:
def __call__(self, val_loss, model): if val_loss < self.best_loss - self.min_delta: # Improvement! Reset patience counter self.best_loss = val_loss self.best_model = model.state_dict().copy() self.counter = 0 else: # No improvement - increment counter self.counter += 1 if self.counter >= self.patience: self.should_stop = True if self.restore_best: model.load_state_dict(self.best_model)
return self.should_stopNotice how we save a copy of the model state dict, not a reference. Without .copy(), we’d just have a pointer that gets overwritten every epoch!
Using it in your training loop:
early_stopping = EarlyStopping(patience=10, min_delta=0.001)
for epoch in range(max_epochs): train_loss = train_epoch(model, train_loader, optimizer, criterion) val_loss = validate(model, val_loader, criterion)
print(f"Epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}")
if early_stopping(val_loss, model): print(f"Early stopping at epoch {epoch}") breakDid You Know? Early stopping was formalized in the 1990s but the idea goes back to the earliest days of neural networks. It’s sometimes called “regularization for free” because it prevents overfitting without any changes to the loss function or model architecture.
Patience: How Long to Wait
Section titled “Patience: How Long to Wait”Choosing patience is a trade-off:
- Too small: Stop before the model has a chance to improve
- Too large: Waste time training an overfitting model
Rules of thumb:
- Small datasets: patience = 5-10
- Large datasets: patience = 10-20
- With learning rate scheduling: larger patience (the LR drop might help)
Model Checkpointing: Never Lose Your Progress
Section titled “Model Checkpointing: Never Lose Your Progress”Training large models can take days or weeks. Hardware can fail. Jobs get killed. Always save checkpoints.
What to Save
Section titled “What to Save”A complete checkpoint includes:
- Model weights (
model.state_dict()) - Optimizer state (
optimizer.state_dict()) - Learning rate scheduler state (
scheduler.state_dict()) - Current epoch/step
- Best validation score
- Any other training state (random seeds, etc.)
import torchimport os
def save_checkpoint(state, filename='checkpoint.pt', is_best=False): """Save training checkpoint.""" torch.save(state, filename) if is_best: best_filename = filename.replace('.pt', '_best.pt') torch.save(state, best_filename)
def load_checkpoint(filename, model, optimizer=None, scheduler=None): """Load training checkpoint.""" checkpoint = torch.load(filename)
model.load_state_dict(checkpoint['model_state_dict'])
if optimizer and 'optimizer_state_dict' in checkpoint: optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
if scheduler and 'scheduler_state_dict' in checkpoint: scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
return checkpoint.get('epoch', 0), checkpoint.get('best_val_loss', float('inf'))
# In training loopfor epoch in range(start_epoch, max_epochs): train_loss = train_epoch(model, train_loader, optimizer, criterion) val_loss = validate(model, val_loader, criterion)
scheduler.step()
# Save checkpoint is_best = val_loss < best_val_loss if is_best: best_val_loss = val_loss
save_checkpoint({ 'epoch': epoch + 1, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict(), 'best_val_loss': best_val_loss, }, filename=f'checkpoint_epoch_{epoch}.pt', is_best=is_best)Checkpoint Management
Section titled “Checkpoint Management”Don’t keep every checkpoint forever — you’ll run out of disk space!
import glob
def cleanup_old_checkpoints(checkpoint_dir, keep_last=3, keep_best=True): """Remove old checkpoints, keeping only the most recent.""" checkpoints = sorted( glob.glob(os.path.join(checkpoint_dir, 'checkpoint_epoch_*.pt')), key=os.path.getmtime )
# Keep best checkpoint if keep_best: best_checkpoint = os.path.join(checkpoint_dir, 'checkpoint_best.pt') if os.path.exists(best_checkpoint): checkpoints = [c for c in checkpoints if c != best_checkpoint]
# Delete old checkpoints for checkpoint in checkpoints[:-keep_last]: os.remove(checkpoint) print(f"Removed old checkpoint: {checkpoint}")Did You Know? Google’s TPU training infrastructure automatically handles checkpointing to Google Cloud Storage. When they trained GPT-3-sized models, they saved checkpoints every 10 minutes because hardware failures were that common. At scale, checkpointing isn’t optional — it’s survival.
Putting It All Together: A Production Training Loop
Section titled “Putting It All Together: A Production Training Loop”Here’s a complete training script that incorporates everything we’ve learned:
import torchimport torch.nn as nnimport torch.optim as optimfrom torch.optim.lr_scheduler import CosineAnnealingWarmRestartsimport torch.nn.utils as nn_utilsimport osfrom datetime import datetime
class ProductionTrainer: """A production-ready training class with all best practices."""
def __init__( self, model, train_loader, val_loader, criterion, learning_rate=1e-3, max_epochs=100, patience=10, max_grad_norm=1.0, checkpoint_dir='checkpoints', device='cuda' if torch.cuda.is_available() else 'cpu' ): self.model = model.to(device) self.train_loader = train_loader self.val_loader = val_loader self.criterion = criterion self.device = device self.max_epochs = max_epochs self.max_grad_norm = max_grad_norm self.checkpoint_dir = checkpoint_dir
# Create checkpoint directory os.makedirs(checkpoint_dir, exist_ok=True)
# Optimizer with weight decay (L2 regularization) self.optimizer = optim.AdamW( model.parameters(), lr=learning_rate, weight_decay=0.01 )
# Learning rate scheduler with warmup self.scheduler = CosineAnnealingWarmRestarts( self.optimizer, T_0=10, # Restart every 10 epochs T_mult=2 # Double the restart period each time )
# Early stopping self.early_stopping = EarlyStopping(patience=patience)
# Tracking self.best_val_loss = float('inf') self.history = {'train_loss': [], 'val_loss': [], 'lr': []}
def train_epoch(self): """Train for one epoch.""" self.model.train() total_loss = 0 num_batches = len(self.train_loader)
for batch_idx, (inputs, targets) in enumerate(self.train_loader): inputs, targets = inputs.to(self.device), targets.to(self.device)
# Forward pass self.optimizer.zero_grad() outputs = self.model(inputs) loss = self.criterion(outputs, targets)
# Backward pass loss.backward()
# Gradient clipping nn_utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
# Update weights self.optimizer.step()
total_loss += loss.item()
# Progress indicator if batch_idx % 50 == 0: print(f" Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")
return total_loss / num_batches
@torch.no_grad() def validate(self): """Validate the model.""" self.model.eval() total_loss = 0
for inputs, targets in self.val_loader: inputs, targets = inputs.to(self.device), targets.to(self.device) outputs = self.model(inputs) loss = self.criterion(outputs, targets) total_loss += loss.item()
return total_loss / len(self.val_loader)
def save_checkpoint(self, epoch, is_best=False): """Save training checkpoint.""" checkpoint = { 'epoch': epoch, 'model_state_dict': self.model.state_dict(), 'optimizer_state_dict': self.optimizer.state_dict(), 'scheduler_state_dict': self.scheduler.state_dict(), 'best_val_loss': self.best_val_loss, 'history': self.history, }
path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}.pt') torch.save(checkpoint, path)
if is_best: best_path = os.path.join(self.checkpoint_dir, 'checkpoint_best.pt') torch.save(checkpoint, best_path) print(f" New best model saved!")
def train(self): """Full training loop.""" print(f"Training on {self.device}") print(f"Model parameters: {sum(p.numel() for p in self.model.parameters()):,}") print("=" * 50)
for epoch in range(self.max_epochs): start_time = datetime.now()
# Training train_loss = self.train_epoch()
# Validation val_loss = self.validate()
# Learning rate scheduling current_lr = self.optimizer.param_groups[0]['lr'] self.scheduler.step()
# Track history self.history['train_loss'].append(train_loss) self.history['val_loss'].append(val_loss) self.history['lr'].append(current_lr)
# Check for best model is_best = val_loss < self.best_val_loss if is_best: self.best_val_loss = val_loss
# Save checkpoint self.save_checkpoint(epoch, is_best)
# Logging elapsed = datetime.now() - start_time print(f"Epoch {epoch+1}/{self.max_epochs}") print(f" Train Loss: {train_loss:.4f}") print(f" Val Loss: {val_loss:.4f}") print(f" LR: {current_lr:.6f}") print(f" Time: {elapsed}")
# Early stopping check if self.early_stopping(val_loss, self.model): print(f"\n Early stopping triggered at epoch {epoch+1}") break
print("\n" + "=" * 50) print(f"Training complete! Best validation loss: {self.best_val_loss:.4f}")
return self.history
# Example usageif __name__ == "__main__": # Create model with all our techniques model = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 10) )
# Initialize with He initialization def init_weights(m): if isinstance(m, nn.Linear): nn.init.kaiming_normal_(m.weight, nonlinearity='relu') if m.bias is not None: nn.init.zeros_(m.bias)
model.apply(init_weights)
# Train trainer = ProductionTrainer( model=model, train_loader=train_loader, # You'd create these val_loader=val_loader, criterion=nn.CrossEntropyLoss(), learning_rate=1e-3, max_epochs=100, patience=10 )
history = trainer.train()Common Mistakes and How to Avoid Them
Section titled “Common Mistakes and How to Avoid Them”Mistake 1: Forgetting train()/eval()
Section titled “Mistake 1: Forgetting train()/eval()”# WRONGpredictions = model(test_data) # BatchNorm/Dropout still in training mode!
# RIGHTmodel.eval()with torch.no_grad(): predictions = model(test_data)Mistake 2: Wrong Learning Rate
Section titled “Mistake 2: Wrong Learning Rate”# WRONG: Starting with huge learning rateoptimizer = optim.Adam(model.parameters(), lr=1.0) # NaN in 3... 2... 1...
# RIGHT: Start conservativeoptimizer = optim.Adam(model.parameters(), lr=1e-3) # Standard starting pointMistake 3: No Gradient Clipping for RNNs
Section titled “Mistake 3: No Gradient Clipping for RNNs”# WRONG: RNN without gradient clippingloss.backward()optimizer.step() # Gradients might explode!
# RIGHT: Always clip RNN gradientsloss.backward()nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)optimizer.step()Mistake 4: Wrong Initialization for Activation
Section titled “Mistake 4: Wrong Initialization for Activation”# WRONG: Xavier init with ReLUnn.init.xavier_uniform_(layer.weight) # Suboptimal for ReLU
# RIGHT: He init for ReLUnn.init.kaiming_normal_(layer.weight, nonlinearity='relu')Mistake 5: BatchNorm After Dropout
Section titled “Mistake 5: BatchNorm After Dropout”# DEBATABLE: BatchNorm after Dropoutnn.Sequential( nn.Linear(256, 128), nn.Dropout(0.5), nn.BatchNorm1d(128), # Sees different distributions during train/eval nn.ReLU())
# OFTEN BETTER: BatchNorm before Dropout (or skip one)nn.Sequential( nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.5))️ Memory & Performance Notes
Section titled “️ Memory & Performance Notes”Training deep networks pushes hardware to its limits. Understanding memory constraints and performance tradeoffs is essential for real-world training.
Out of Memory (OOM) — The Most Common Error
Section titled “Out of Memory (OOM) — The Most Common Error”You’ll encounter CUDA out of memory more times than you can count. Here’s how to handle it:
Quick fixes (in order of preference):
- Reduce batch size — The most effective solution. If batch 64 fails, try 32, then 16.
- Enable gradient checkpointing — Trade compute for memory:
from torch.utils.checkpoint import checkpoint# Instead of: output = self.layer(x)output = checkpoint(self.layer, x) # Recomputes forward during backward
- Use mixed precision training — Cut memory usage nearly in half:
from torch.cuda.amp import autocast, GradScalerscaler = GradScaler()with autocast():output = model(input)loss = criterion(output, target)scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()
- Clear cache between batches — When desperate:
torch.cuda.empty_cache() # Frees cached memory, but slows training
Root causes to investigate:
- Storing intermediate activations unnecessarily (use
del tensorwhen done) - Accumulating gradients without stepping (check your training loop!)
- Large embedding tables eating memory
- Model too big for your GPU — consider model parallelism
Batch Size Tradeoffs
Section titled “Batch Size Tradeoffs”| Batch Size | Pros | Cons |
|---|---|---|
| Small (8-32) | Lower memory, noisier gradients act as regularization, better generalization | Slower training, GPU underutilized |
| Medium (64-256) | Balanced memory/speed, stable training | Sweet spot for most tasks |
| Large (512+) | Faster training, smoother gradients, better GPU utilization | High memory, may need LR warmup, can hurt generalization |
The learning rate scaling rule: When you increase batch size by N, increase learning rate by √N (or N with warmup). This keeps the effective update size similar.
# Example: doubling batch size from 32 to 64base_lr = 1e-3batch_multiplier = 64 / 32 # = 2new_lr = base_lr * (batch_multiplier ** 0.5) # = 1.4e-3Gradient Accumulation — Big Batches on Small GPUs
Section titled “Gradient Accumulation — Big Batches on Small GPUs”Can’t fit batch size 64 in memory? Use gradient accumulation to simulate it:
accumulation_steps = 4 # Accumulate 4 mini-batchesoptimizer.zero_grad()
for i, (inputs, targets) in enumerate(loader): outputs = model(inputs) loss = criterion(outputs, targets) / accumulation_steps # Scale loss loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()This gives you the gradient statistics of batch 64 while only using memory for batch 16.
Multi-GPU Training
Section titled “Multi-GPU Training”When one GPU isn’t enough:
| Strategy | Use Case | Complexity |
|---|---|---|
DataParallel | Quick & dirty multi-GPU | Low (1 line of code) |
DistributedDataParallel | Production training | Medium (requires setup) |
| Model Parallelism | Models larger than 1 GPU | High (manual splitting) |
| FSDP | Large models, efficient memory | Medium-High |
# DataParallel — easiest optionmodel = nn.DataParallel(model) # Uses all available GPUs
# DistributedDataParallel — better performance (requires proper init)model = nn.parallel.DistributedDataParallel(model)Did You Know? GPT-3 was trained on thousands of GPUs using tensor parallelism, where individual matrix multiplications are split across GPUs. The communication overhead was so high that they had to invent new parallelism strategies. Most practitioners will never need this level of scale — DataParallel is fine for 2-8 GPUs.
Performance Profiling
Section titled “Performance Profiling”Find the bottleneck before optimizing:
# Simple timingimport timestart = time.time()output = model(input)torch.cuda.synchronize() # Important! GPU ops are asyncprint(f"Forward: {time.time() - start:.3f}s")
# PyTorch profiler for detailed analysisfrom torch.profiler import profile, ProfilerActivitywith profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: output = model(input)print(prof.key_averages().table(sort_by="cuda_time_total"))Common bottlenecks:
- Data loading — Use
num_workers > 0andpin_memory=True - CPU-GPU transfer — Batch your transfers, avoid frequent small copies
- Synchronization — Minimize
.item()and.numpy()calls during training
Hands-On Practice
Section titled “Hands-On Practice”Exercise 1: Compare Initializations
Section titled “Exercise 1: Compare Initializations”Train the same network with different initializations and compare:
- Random uniform [-1, 1]
- Xavier/Glorot
- He/Kaiming
Measure: training speed, final accuracy, gradient magnitudes.
Exercise 2: Learning Rate Schedule Comparison
Section titled “Exercise 2: Learning Rate Schedule Comparison”Implement and compare:
- Constant learning rate
- Step decay
- Cosine annealing
- 1cycle
Plot learning rate over time and final accuracy.
Exercise 3: BatchNorm vs LayerNorm
Section titled “Exercise 3: BatchNorm vs LayerNorm”Train a simple network on MNIST with:
- No normalization
- BatchNorm
- LayerNorm
Vary batch size from 8 to 512 and measure the effect on each.
Production War Stories
Section titled “Production War Stories”The $2.3 Million Training Collapse
Section titled “The $2.3 Million Training Collapse”A major tech company trained a large language model for 6 weeks on expensive GPU clusters. At week 5, training loss suddenly spiked to infinity and never recovered. Root cause: No gradient clipping, and a rare data batch caused gradient explosion. They had to restart from scratch because their checkpoint from week 4 was corrupted.
Lesson learned: Always use gradient clipping (max_norm=1.0), checkpoint frequently (every 1000 steps), and validate checkpoint integrity.
The BatchNorm Batch Size Bug
Section titled “The BatchNorm Batch Size Bug”A computer vision team deployed a model that worked perfectly in training but gave random predictions in production. The model used BatchNorm, but production inference ran with batch_size=1. BatchNorm’s running statistics were wrong because they forgot to call model.eval().
# The bug that cost 3 weeks of debuggingmodel = load_model(checkpoint)predictions = model(batch) # WRONG: model still in train mode
# The fixmodel = load_model(checkpoint)model.eval() # Critical for BatchNorm and Dropout!with torch.no_grad(): predictions = model(batch)Lesson learned: Always verify model mode. Add assertions in production code:
assert not model.training, "Model must be in eval mode for inference"Common Mistakes and Fixes
Section titled “Common Mistakes and Fixes”1. Learning Rate Too High
Section titled “1. Learning Rate Too High”Symptom: Loss oscillates wildly or explodes to NaN
Fix: Use learning rate finder, start with 1e-4 for Adam, 1e-2 for SGD
2. Forgetting to Zero Gradients
Section titled “2. Forgetting to Zero Gradients”Symptom: Gradients accumulate, training diverges
# Bug: gradients accumulate across batchesfor batch in dataloader: loss = criterion(model(batch), targets) loss.backward() optimizer.step() # Gradients keep accumulating!
# Fix: zero gradients each stepfor batch in dataloader: optimizer.zero_grad() # Reset gradients loss = criterion(model(batch), targets) loss.backward() optimizer.step()3. Wrong Initialization for Activation
Section titled “3. Wrong Initialization for Activation”Symptom: Dead neurons (ReLU) or vanishing gradients (sigmoid/tanh)
# Wrong: Xavier for ReLUnn.init.xavier_uniform_(layer.weight) # Assumes linear activation
# Right: He/Kaiming for ReLUnn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')4. No Warmup for Large Learning Rates
Section titled “4. No Warmup for Large Learning Rates”Symptom: Training crashes in first few batches
# Add warmup: start low, ramp up over first 1000 stepswarmup_steps = 1000for step in range(total_steps): if step < warmup_steps: lr = base_lr * (step / warmup_steps) else: lr = base_lr for param_group in optimizer.param_groups: param_group['lr'] = lrInterview Preparation
Section titled “Interview Preparation”Q: What’s the difference between BatchNorm and LayerNorm? When would you use each?
BatchNorm normalizes across the batch dimension — great for CNNs with large batches (32+). LayerNorm normalizes across feature dimensions — essential for Transformers and when batch sizes vary. Use BatchNorm for computer vision, LayerNorm for NLP and variable batch sizes.
Q: Why does gradient clipping help training?
Gradient clipping prevents exploding gradients by capping the gradient norm. Without it, a single bad batch can produce huge gradients that destroy learned weights. It’s essential for RNNs and helpful for any deep network. Typical values: max_norm=1.0 for RNNs, max_norm=5.0 for Transformers.
Q: Explain the 1cycle learning rate policy.
1cycle starts with a low learning rate, ramps up to a maximum over 30% of training, then gradually decreases back down. It achieves “super-convergence” — training faster with better final accuracy than constant learning rate. The key insight is that high learning rates help escape local minima early in training.
Q: How would you debug a model that trains well but performs poorly on validation?
This is overfitting. Debugging steps: (1) Add/increase dropout, (2) Use data augmentation, (3) Add L2 regularization (weight decay), (4) Early stopping based on validation loss, (5) Reduce model capacity, (6) Get more training data. Monitor the gap between train and val loss — should be small.
Q: What learning rate would you start with for a new project?
For Adam optimizer, start with 1e-4 (0.0001) — it’s a safe default that works for most architectures. For SGD with momentum, try 1e-2 (0.01). Then use a learning rate finder: train for a few hundred steps while exponentially increasing LR from 1e-7 to 1. Plot loss vs LR and pick a value just before the loss starts climbing. Always add warmup for the first 5-10% of training steps.
Deliverables
Section titled “Deliverables”- Training Toolkit: A reusable training class with all best practices
- Initialization Comparison: Script comparing different initializations
- Learning Rate Finder: Implementation of LR range test
- Early Stopping: Production-ready early stopping implementation
- Checkpointing System: Complete save/load functionality
Success Criteria: Train a network to >98% accuracy on MNIST using all techniques.
Further Reading
Section titled “Further Reading”- “Batch Normalization: Accelerating Deep Network Training” - Ioffe & Szegedy (2015)
- “How Does Batch Normalization Help Optimization?” - Santurkar et al. (2018)
- “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” - Srivastava et al. (2014)
- “Understanding the difficulty of training deep feedforward neural networks” - Glorot & Bengio (2010)
- “Delving Deep into Rectifiers” - He et al. (2015)
- “Super-Convergence” - Leslie Smith (2018)
Key Takeaways
Section titled “Key Takeaways”- BatchNorm made deep networks trainable — use it for CNNs
- LayerNorm is the standard for Transformers and small batches
- Dropout prevents overfitting but remember train/eval modes
- Proper initialization (He for ReLU, Xavier for tanh) prevents gradient problems
- Learning rate schedules with warmup train faster and better
- Gradient clipping is essential for RNNs, helpful elsewhere
- Early stopping prevents overfitting for free
- Checkpointing is not optional for serious training
️ Next Steps
Section titled “️ Next Steps”You’ve mastered the art of training deep networks. Now it’s time to build specific architectures!
Module 29: Convolutional Neural Networks (CNNs) for images Module 30: Recurrent Neural Networks (RNNs/LSTMs) for sequences Module 31: Transformer Architecture from scratch
“Training deep networks is like raising children: you need patience, consistency, and knowing when to let go.” — Unknown ML practitioner