RNNs & Sequence Models
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Or: How AI Learned to See (And Why It Still Can’t Find Waldo)
Section titled “Or: How AI Learned to See (And Why It Still Can’t Find Waldo)”Reading Time: 6-8 hours Prerequisites: Module 28
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand how convolution operations extract features from images
- Know why CNNs revolutionized computer vision (and the stories behind it)
- Build custom CNN architectures from scratch
- Master pooling, striding, and receptive fields
- Navigate classic architectures (LeNet, AlexNet, VGG, ResNet)
- Apply transfer learning to get 95%+ accuracy with minimal training
- Understand when CNNs fail and why transformers are challenging them
The Computer Vision Revolution: From Impossible to Ubiquitous
Section titled “The Computer Vision Revolution: From Impossible to Ubiquitous”Let’s set the stage with a shocking fact: in 2010, computers were worse at recognizing objects than a four-year-old child.
The ImageNet challenge — a competition to classify images into 1,000 categories like “golden retriever,” “volcano,” and “espresso” — had error rates around 25%. That means one in four images was misclassified. For reference, humans make mistakes about 5% of the time.
Then, in 2012, everything changed.
Did You Know? At the 2012 ImageNet competition, a team from the University of Toronto submitted an entry called “AlexNet.” It won by a landslide — 15% error rate versus 26% for the runner-up. This wasn’t just winning; it was demolishing the competition by the largest margin in the contest’s history. The winning team? Alex Krizhevsky, Ilya Sutskever (later co-founder of OpenAI), and Geoffrey Hinton. This single result is often credited with starting the modern AI boom.
What made AlexNet so special? It was a convolutional neural network — a type of neural network specifically designed to process images. CNNs weren’t new in 2012 (they were invented in the 1980s), but AlexNet was the first to combine them with GPUs and massive datasets. The result changed the world.
Today, CNNs power:
- Your phone’s camera: Face detection, portrait mode, night sight
- Self-driving cars: Lane detection, obstacle recognition
- Medical imaging: Cancer detection, X-ray analysis
- Social media: Content moderation, auto-tagging
- Security systems: Facial recognition, anomaly detection
Let’s understand how they work.
The Biological Inspiration: How Vision Actually Works
Section titled “The Biological Inspiration: How Vision Actually Works”Before we dive into the math, let’s understand why CNNs look the way they do. They’re directly inspired by how the human visual cortex works.
The Nobel Prize Discovery
Section titled “The Nobel Prize Discovery”In 1959, David Hubel and Torsten Wiesel did something that sounds like mad science: they inserted electrodes into a cat’s brain and showed it different images. What they discovered won them the Nobel Prize in 1981.
Did You Know? Hubel and Wiesel discovered that neurons in the visual cortex respond to very specific patterns. Some neurons fire only when they see a vertical line. Others respond to horizontal lines. Some detect edges at 45-degree angles. The visual system doesn’t try to process the whole image at once — it builds understanding from simple features. This hierarchical processing is exactly what CNNs replicate.
They found two key types of cells:
- Simple cells: Respond to edges at specific orientations in specific locations
- Complex cells: Respond to the same features but are more tolerant of position
This is exactly what CNNs do:
- Convolutional layers: Detect simple features (like edge detectors)
- Pooling layers: Add position tolerance (like complex cells)
- Deeper layers: Combine simple features into complex ones
The analogy isn’t perfect, but it’s striking how well the architecture maps onto biology.
From Pixels to Features: The Convolution Operation
Section titled “From Pixels to Features: The Convolution Operation”Here’s the fundamental problem with using regular neural networks on images:
An image is 224×224×3 pixels (RGB). That’s 150,528 input values. If your first hidden layer has 1,000 neurons, you need 150 million weights — just for the first layer! This is:
- Computationally insane: Too many parameters to train
- Wasteful: Most of those connections don’t matter
- Missing the point: Images have spatial structure that fully-connected networks ignore
Convolutions solve all three problems elegantly.
What Is Convolution? (The Intuitive Version)
Section titled “What Is Convolution? (The Intuitive Version)”Imagine you’re looking for a specific pattern in an image — say, vertical edges. You have a small “stencil” that knows what vertical edges look like. You slide this stencil across every position in the image, and at each position, you compute how well the image matches the stencil.
The stencil is called a kernel or filter. The sliding operation is called convolution.
Here’s a concrete example. This 3×3 kernel detects vertical edges:
Vertical Edge Detector:[-1 0 1][-1 0 1][-1 0 1]When you slide this over an image:
- Where there’s a vertical edge (dark on left, light on right), you get a high positive value
- Where there’s no edge or a horizontal edge, you get near zero
- Where there’s a reverse edge (light on left, dark on right), you get a negative value
The brilliant insight is that the same kernel is applied everywhere in the image. This means:
- Parameter sharing: A 3×3 kernel has only 9 parameters, not millions
- Translation equivariance: A cat in the corner uses the same features as a cat in the center
- Hierarchy: Stack many layers to detect increasingly complex patterns
The Math: How Convolution Works
Section titled “The Math: How Convolution Works”Let’s make this concrete. For a single-channel image (grayscale), convolution computes:
Output[i,j] = Σ Σ Input[i+m, j+n] × Kernel[m, n] m nWhere the sums run over the kernel dimensions.
Worked Example: Let’s apply a 3×3 kernel to a 5×5 image:
Input (5×5): Kernel (3×3):[1 2 3 4 5] [1 0 -1][6 7 8 9 10] [1 0 -1][11 12 13 14 15] [1 0 -1][16 17 18 19 20][21 22 23 24 25]
Computing Output[0,0] (top-left corner of output):= 1×1 + 2×0 + 3×(-1) + 6×1 + 7×0 + 8×(-1) + 11×1 + 12×0 + 13×(-1)= 1 + 0 - 3 + 6 + 0 - 8 + 11 + 0 - 13= -6
Computing Output[0,1] (slide right by 1):= 2×1 + 3×0 + 4×(-1) + 7×1 + 8×0 + 9×(-1) + 12×1 + 13×0 + 14×(-1)= 2 + 0 - 4 + 7 + 0 - 9 + 12 + 0 - 14= -6Notice the output is consistent because the input has no edges (just a uniform gradient).
Output Size: The Formula You’ll Use Daily
Section titled “Output Size: The Formula You’ll Use Daily”The output size depends on several parameters:
Output Size = floor((Input - Kernel + 2×Padding) / Stride) + 1Let’s break this down:
- Input: The input dimension (e.g., 224)
- Kernel: The kernel size (e.g., 3)
- Padding: Extra zeros around the border (e.g., 1)
- Stride: How many pixels to skip between applications (e.g., 1)
Common configurations:
kernel=3, padding=1, stride=1: Output same size as input (most common)kernel=3, padding=0, stride=1: Output shrinks by 2 pixels each sidekernel=3, padding=1, stride=2: Output is half the input size (downsampling)
Worked Example:
Input: 224×224Conv2d(kernel=3, padding=1, stride=2)
Output = (224 - 3 + 2×1) / 2 + 1 = (224 - 3 + 2) / 2 + 1 = 223/2 + 1 = 111.5 + 1 = 112
Output: 112×112PyTorch Implementation
Section titled “PyTorch Implementation”Here’s how convolution looks in code. We’ll build intuition by comparing with manual computation:
import torchimport torch.nn as nn
# Create a simple 5x5 image (batch_size=1, channels=1)image = torch.tensor([ [1., 2., 3., 4., 5.], [6., 7., 8., 9., 10.], [11., 12., 13., 14., 15.], [16., 17., 18., 19., 20.], [21., 22., 23., 24., 25.]]).unsqueeze(0).unsqueeze(0) # Shape: (1, 1, 5, 5)
# Create a vertical edge detectorconv = nn.Conv2d( in_channels=1, out_channels=1, kernel_size=3, padding=0, # No padding, output will be 3x3 bias=False)
# Set the kernel manuallywith torch.no_grad(): conv.weight = nn.Parameter(torch.tensor([ [[[ 1., 0., -1.], [ 1., 0., -1.], [ 1., 0., -1.]]] ]))
output = conv(image)print(output.shape) # torch.Size([1, 1, 3, 3])print(output[0, 0]) # The 3x3 outputNotice how PyTorch handles all the sliding and summation for you. The key parameters are:
in_channels: Number of input channels (1 for grayscale, 3 for RGB)out_channels: Number of filters (each learns a different feature)kernel_size: Size of the filter (3×3 is most common)padding: Zeros to add around the borderstride: How many pixels to skip
Multiple Filters: Learning Different Features
Section titled “Multiple Filters: Learning Different Features”A single kernel detects one pattern. To detect many patterns, we use multiple filters:
# 32 filters, each learning a different featureconv_block = nn.Conv2d( in_channels=3, # RGB input out_channels=32, # 32 different filters kernel_size=3, padding=1)
rgb_image = torch.randn(1, 3, 224, 224) # Batch of 1 RGB imagefeatures = conv_block(rgb_image)print(features.shape) # torch.Size([1, 32, 224, 224])Each of those 32 output channels represents a different “feature map” — one might detect vertical edges, another horizontal, another corners, etc. The network learns what features to detect during training.
Did You Know? When researchers visualize what early convolutional layers learn, they consistently find edge detectors, color blob detectors, and gradient detectors — remarkably similar to what Hubel and Wiesel found in cats’ brains. The network “rediscovers” the visual features that evolution took millions of years to develop.
Pooling: Making Networks Translation-Invariant
Section titled “Pooling: Making Networks Translation-Invariant”You’ve detected that there’s a cat in the image. Do you care if it’s 3 pixels to the left? No! You want your network to be robust to small shifts in position.
This is what pooling provides.
Max Pooling: The Dominant Approach
Section titled “Max Pooling: The Dominant Approach”Max pooling is simple: divide the feature map into regions and keep only the maximum value in each region.
Input (4×4): Max Pool 2×2, stride 2:[1 3 2 4][5 6 7 8] → [6 8][9 1 3 4] [9 7][2 3 7 1]
Top-left output: max(1,3,5,6) = 6Top-right output: max(2,4,7,8) = 8Bottom-left output: max(9,1,2,3) = 9Bottom-right output: max(3,4,7,1) = 7Max pooling provides:
- Downsampling: Reduces spatial dimensions (2× with 2×2 pooling)
- Translation invariance: Small shifts don’t change the output
- Computation savings: Fewer values to process in later layers
- Feature selection: Keeps the strongest activation in each region
Average Pooling: The Alternative
Section titled “Average Pooling: The Alternative”Average pooling takes the mean instead of the maximum:
Same input with Average Pool 2×2:Top-left: (1+3+5+6)/4 = 3.75Top-right: (2+4+7+8)/4 = 5.25...Average pooling is smoother and doesn’t throw away information as aggressively. It’s often used at the very end of networks (Global Average Pooling) to reduce feature maps to a single value per channel.
PyTorch Pooling Layers
Section titled “PyTorch Pooling Layers”# Max pooling: keep strongest activationmax_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Average pooling: take the meanavg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# Global average pooling: reduce entire feature map to single value# Usually done with nn.AdaptiveAvgPool2d(1)gap = nn.AdaptiveAvgPool2d(1) # Output is 1×1 regardless of input size
features = torch.randn(1, 64, 14, 14) # Some feature mappooled = gap(features)print(pooled.shape) # torch.Size([1, 64, 1, 1])The Decline of Manual Pooling
Section titled “The Decline of Manual Pooling”Here’s an interesting trend: modern networks often skip explicit pooling layers and use strided convolutions instead.
Why? A strided convolution (stride=2) also reduces spatial dimensions, but unlike pooling, it’s learnable. The network can decide how to downsample rather than using fixed max/average rules.
# Traditional: Conv + Pooltraditional = nn.Sequential( nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2) # Fixed downsampling)
# Modern: Strided Convmodern = nn.Sequential( nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1), # Learnable downsampling nn.ReLU())Both reduce spatial dimensions by 2×, but the strided convolution approach is more flexible.
Receptive Fields: How CNNs See the Big Picture
Section titled “Receptive Fields: How CNNs See the Big Picture”Here’s a crucial concept that trips up beginners: a neuron in a deep layer doesn’t just “see” a 3×3 patch — it sees a much larger region called its receptive field.
Building Intuition
Section titled “Building Intuition”Consider three 3×3 convolutions stacked:
- Layer 1: Each neuron sees a 3×3 region of the input
- Layer 2: Each neuron sees a 3×3 region of Layer 1’s output. But each of those Layer 1 neurons saw 3×3, so effectively Layer 2 sees 5×5 of the original input
- Layer 3: Following the same logic, sees 7×7 of the original input
The Receptive Field Formula
Section titled “The Receptive Field Formula”For convolutions with kernel size k and no stride or dilation:
Receptive Field = 1 + L × (k - 1)Where L is the number of layers.
Example: 5 layers of 3×3 convolutions:
RF = 1 + 5 × (3 - 1) = 1 + 5 × 2 = 11With stride and pooling, receptive fields grow much faster.
Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling
Section titled “Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling”Sometimes you need a large receptive field but can’t afford to downsample (common in semantic segmentation where you need full-resolution output). Dilated convolutions solve this by inserting gaps in the kernel:
Standard 3×3 kernel: Dilated 3×3 (dilation=2):[× × ×] [× × ×][× × ×] → 3×3 RF [ ][× × ×] [× × ×] → 5×5 RF [ ] [× × ×]A 3×3 kernel with dilation=2 has a 5×5 receptive field but still only 9 parameters. With dilation=4, you get a 9×9 receptive field.
# Dilated convolution in PyTorchdilated_conv = nn.Conv2d( 64, 64, kernel_size=3, padding=2, # Adjust padding to maintain size dilation=2 # Insert 1 gap between kernel elements)Dilated convolutions are crucial for:
- Semantic segmentation (DeepLab uses them extensively)
- WaveNet (audio generation)
- Any task needing large receptive fields at high resolution
Why This Matters
Section titled “Why This Matters”Receptive fields explain several design choices:
- Deep > Wide: To “see” a 224×224 image, you need enough layers for the receptive field to cover it
- Stacking 3×3: Two 3×3 convolutions = 5×5 receptive field with fewer parameters than one 5×5 kernel
- Global operations: Later layers can make global decisions because they see the whole image
Did You Know? The VGGNet paper (2014) demonstrated that stacking small 3×3 filters is more effective than using larger filters. Three 3×3 layers have the same receptive field as one 7×7 layer but with fewer parameters (3×3×3 = 27 vs 7×7 = 49) and more non-linearities (three ReLUs vs one). This insight influenced almost all subsequent architectures.
The Classic Architectures: A Journey Through History
Section titled “The Classic Architectures: A Journey Through History”Understanding classic architectures gives you the vocabulary to discuss modern ones. Each breakthrough solved a specific problem.
LeNet-5 (1998): The Original CNN
Section titled “LeNet-5 (1998): The Original CNN”Yann LeCun at Bell Labs created LeNet-5 for reading handwritten digits on checks. It’s the blueprint all modern CNNs follow.
Architecture:
Input (32×32 grayscale) ↓Conv1 (6 filters, 5×5) → 28×28×6 ↓Pool1 (2×2) → 14×14×6 ↓Conv2 (16 filters, 5×5) → 10×10×16 ↓Pool2 (2×2) → 5×5×16 ↓Flatten → 400 ↓FC1 → 120 ↓FC2 → 84 ↓Output → 10 (digits 0-9)Key innovations:
- Convolution + pooling pattern
- Local connectivity (not fully connected)
- Learned features instead of hand-crafted
Limitation: Only worked on simple, centered images. Real-world images were still too hard.
AlexNet (2012): The Deep Learning Big Bang
Section titled “AlexNet (2012): The Deep Learning Big Bang”AlexNet won ImageNet 2012 by a huge margin and kickstarted the deep learning revolution.
What made it work:
- Depth: 8 layers (considered “deep” at the time)
- ReLU: Replaced sigmoid/tanh, enabling faster training
- Dropout: Prevented overfitting on 60M parameters
- GPUs: Trained on two GTX 580s (3GB each!) in parallel
- Data augmentation: Translations, reflections, color jittering
- Huge data: 1.2 million training images
Architecture Sketch:
Input: 224×224×3Conv1 (96, 11×11, stride 4) → 55×55×96MaxPool → 27×27×96Conv2 (256, 5×5) → 27×27×256MaxPool → 13×13×256Conv3 (384, 3×3) → 13×13×384Conv4 (384, 3×3) → 13×13×384Conv5 (256, 3×3) → 13×13×256MaxPool → 6×6×256FC1 → 4096 (with dropout)FC2 → 4096 (with dropout)Output → 1000 classesDid You Know? AlexNet was trained for about 6 days on two GPUs. Today, the same model can be trained in under an hour on modern hardware. But here’s the crazy part: the original GPUs only had 3GB of VRAM, so the network was literally split across two GPUs with layers communicating between them. This “model parallelism” was a necessity, not a choice!
VGGNet (2014): The Power of Simplicity
Section titled “VGGNet (2014): The Power of Simplicity”Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what if we just go deeper with 3×3 convolutions?
The insight: Stack 3×3 convolutions instead of using larger kernels. Two 3×3 = 5×5 receptive field. Three 3×3 = 7×7. Same receptive field, fewer parameters, more non-linearities.
VGG-16 Architecture:
Input: 224×224×3
Block 1: 2× Conv3-64 + Pool → 112×112×64Block 2: 2× Conv3-128 + Pool → 56×56×128Block 3: 3× Conv3-256 + Pool → 28×28×256Block 4: 3× Conv3-512 + Pool → 14×14×512Block 5: 3× Conv3-512 + Pool → 7×7×512
FC1 → 4096FC2 → 4096Output → 1000
Total: 138 million parametersLegacy: VGG’s feature extractor is still widely used for style transfer, perceptual losses, and feature matching because its features are interpretable and hierarchical.
Limitation: 138M parameters is a lot. Most are in the fully-connected layers. Modern networks avoid this.
GoogLeNet/Inception (2014): Going Wider, Not Just Deeper
Section titled “GoogLeNet/Inception (2014): Going Wider, Not Just Deeper”While everyone was stacking layers, Google took a different approach: what if we use multiple filter sizes in parallel?
The Inception Module:
Input ┌───────┼───────┬───────┐ │ │ │ │ 1×1 3×3 5×5 Pool │ │ │ │ └───────┴───────┴───────┘ ConcatEach branch uses a different kernel size, then outputs are concatenated. The network can choose which scale of features to use.
The 1×1 convolution trick: Before the 3×3 and 5×5 convolutions, 1×1 convolutions reduce the channel count, massively reducing computation. A 1×1 convolution is basically a per-pixel fully-connected layer.
Impact: GoogLeNet achieved similar accuracy to VGG with only 4 million parameters (vs VGG’s 138 million). 22 layers deep.
Did You Know? The name “Inception” came from the 2010 movie where characters enter dreams within dreams. The Google team thought it fit perfectly — their module was networks within networks. The famous “we need to go deeper” meme from the movie became an inside joke in the deep learning community during 2014-2015.
ResNet (2015): The Depth Breakthrough
Section titled “ResNet (2015): The Depth Breakthrough”This is the most important architecture for understanding modern deep learning. Kaiming He and colleagues at Microsoft Research solved the degradation problem.
The Problem: When you make networks deeper, training accuracy gets worse, not better. This wasn’t overfitting — training accuracy degraded. Something was fundamentally wrong.
The Solution: Skip connections (residual connections).
Instead of learning H(x), learn F(x) = H(x) - x, then add x back:
Output = F(x) + xThis is called a residual block:
Input (x) │ ┌───┴───┐ │ │ Conv-BN │ │ │ ReLU │ │ │ Conv-BN │ │ │ └───┬───┘ + ←── skip connection │ ReLU │ OutputWhy it works: The skip connection provides a “gradient highway” that lets gradients flow directly through the network. Even if the learned transformation F(x) has vanishing gradients, the skip connection preserves the gradient from the loss.
Impact: ResNets can be 100+ layers deep without degradation. ResNet-152 won ImageNet 2015 with 3.6% error — better than humans (5%)!
Did You Know? ResNets were so influential that the original paper has over 200,000 citations. To put that in perspective, that’s more than any physics paper ever published, including Einstein’s foundational work. The skip connection is one of the most important ideas in deep learning.
The ResNet Family
Section titled “The ResNet Family”import torchimport torchvision.models as models
# Standard ResNets (increasingly deeper)resnet18 = models.resnet18(pretrained=True) # 11M paramsresnet34 = models.resnet34(pretrained=True) # 21M paramsresnet50 = models.resnet50(pretrained=True) # 25M paramsresnet101 = models.resnet101(pretrained=True) # 44M paramsresnet152 = models.resnet152(pretrained=True) # 60M paramsEfficientNet (2019): Optimal Scaling
Section titled “EfficientNet (2019): Optimal Scaling”Google’s EfficientNet asked: what’s the optimal way to scale a network? More layers? More channels? Higher resolution?
The insight: All three should scale together with a compound coefficient.
depth = α^φwidth = β^φresolution = γ^φWhere α, β, γ are found by neural architecture search and φ is the scaling coefficient.
Result: EfficientNet-B0 matches ResNet-50 accuracy with 5× fewer parameters. EfficientNet-B7 achieves state-of-the-art with reasonable compute.
Key technique: EfficientNet uses depthwise separable convolutions (from MobileNet). Instead of one convolution that mixes spatial and channel information together, it uses two steps:
- Depthwise: One filter per input channel (spatial only)
- Pointwise: 1×1 convolution (channel mixing only)
This reduces computation by ~8-9× with minimal accuracy loss. If you see “MBConv” in architecture diagrams, that’s a mobile inverted bottleneck block using this technique.
Building a CNN from Scratch
Section titled “Building a CNN from Scratch”Let’s build a modern CNN architecture incorporating everything we’ve learned:
import torchimport torch.nn as nn
class ConvBlock(nn.Module): """ A standard convolutional block: Conv -> BatchNorm -> ReLU. Optionally includes residual connection. """ def __init__( self, in_channels: int, out_channels: int, kernel_size: int = 3, stride: int = 1, residual: bool = False ): super().__init__() self.residual = residual
# Calculate padding to maintain spatial dimensions (when stride=1) padding = kernel_size // 2
self.conv = nn.Conv2d( in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=False # BatchNorm handles the bias ) self.bn = nn.BatchNorm2d(out_channels) self.relu = nn.ReLU(inplace=True)
# Shortcut for residual connection when dimensions change self.shortcut = nn.Identity() if residual and (in_channels != out_channels or stride != 1): self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, 1, stride, bias=False), nn.BatchNorm2d(out_channels) )
def forward(self, x): identity = x
out = self.conv(x) out = self.bn(out)
if self.residual: out = out + self.shortcut(identity)
out = self.relu(out) return out
class SimpleCNN(nn.Module): """ A modern CNN incorporating best practices: - BatchNorm after every conv - Residual connections in deeper blocks - Global average pooling instead of flattening - Dropout for regularization """ def __init__(self, num_classes: int = 10, in_channels: int = 3): super().__init__()
# Initial convolution self.stem = ConvBlock(in_channels, 32, kernel_size=3)
# Feature extraction blocks self.block1 = nn.Sequential( ConvBlock(32, 64, stride=2), # Downsample ConvBlock(64, 64, residual=True) )
self.block2 = nn.Sequential( ConvBlock(64, 128, stride=2), # Downsample ConvBlock(128, 128, residual=True), ConvBlock(128, 128, residual=True) )
self.block3 = nn.Sequential( ConvBlock(128, 256, stride=2), # Downsample ConvBlock(256, 256, residual=True), ConvBlock(256, 256, residual=True) )
# Classification head self.global_pool = nn.AdaptiveAvgPool2d(1) self.dropout = nn.Dropout(0.5) self.fc = nn.Linear(256, num_classes)
# Initialize weights properly self._init_weights()
def _init_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') elif isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0) elif isinstance(m, nn.Linear): nn.init.normal_(m.weight, 0, 0.01) nn.init.constant_(m.bias, 0)
def forward(self, x): # Feature extraction x = self.stem(x) # 32 channels x = self.block1(x) # 64 channels, 1/2 spatial x = self.block2(x) # 128 channels, 1/4 spatial x = self.block3(x) # 256 channels, 1/8 spatial
# Classification x = self.global_pool(x) # 256×1×1 x = x.view(x.size(0), -1) # Flatten to 256 x = self.dropout(x) x = self.fc(x) return x
# Test the architecturemodel = SimpleCNN(num_classes=10, in_channels=3)dummy_input = torch.randn(2, 3, 32, 32) # CIFAR-10 sizeoutput = model(dummy_input)print(f"Input shape: {dummy_input.shape}")print(f"Output shape: {output.shape}")print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")Key design choices explained:
-
bias=Falsein Conv2d: BatchNorm has its own learnable bias (β), so the conv bias is redundant -
Strided convolutions for downsampling: More flexible than max pooling, and we save parameters
-
Global average pooling: Instead of flattening (which would give 256×4×4=4096 values for 32×32 input), we pool to 256 values. This:
- Dramatically reduces parameters
- Works with any input size
- Acts as regularization
-
Residual connections in later blocks: Help gradient flow in deeper parts
-
Kaiming initialization: Proper initialization for ReLU networks
A Note on BatchNorm Placement: There’s an ongoing debate about whether to place BatchNorm before or after ReLU:
# Original ResNet (BN after conv, before ReLU)Conv → BatchNorm → ReLU
# "Pre-activation" ResNet (BN before conv)BatchNorm → ReLU → ConvThe original placement is most common, but pre-activation can work better for very deep networks. In practice, both work well — the difference is usually small. Most pretrained models use the original placement, so stick with that unless you have a specific reason to change.
Transfer Learning: Standing on Giants’ Shoulders
Section titled “Transfer Learning: Standing on Giants’ Shoulders”Training a CNN from scratch requires:
- Millions of labeled images
- Days of GPU time
- Expertise in hyperparameter tuning
Transfer learning says: don’t start from scratch.
The Key Insight
Section titled “The Key Insight”Early layers of CNNs learn general features — edges, textures, patterns. These are useful for almost any image task! Later layers learn task-specific features.
Strategy:
- Take a network pretrained on ImageNet (1.2M images, 1000 classes)
- Remove the final classification layer
- Add your own classification layer
- Fine-tune (optionally freeze early layers)
Did You Know? Transfer learning is so effective that it’s now the default approach for almost all computer vision tasks. In a 2020 study, researchers found that pretrained ImageNet models transfer well to tasks as different as medical imaging, satellite imagery, and art classification — domains with completely different visual statistics than the natural photos in ImageNet. The learned features are surprisingly universal.
Transfer Learning in PyTorch
Section titled “Transfer Learning in PyTorch”import torchimport torch.nn as nnimport torchvision.models as modelsfrom torchvision import transforms
def create_transfer_model(num_classes: int, freeze_backbone: bool = True): """ Create a model using transfer learning from ResNet-18. """ # Load pretrained ResNet-18 model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze backbone if requested (faster training, prevents overfitting) if freeze_backbone: for param in model.parameters(): param.requires_grad = False
# Replace the final fully connected layer # ResNet-18's fc layer is: Linear(512, 1000) # We replace it with our own num_features = model.fc.in_features model.fc = nn.Sequential( nn.Dropout(0.5), nn.Linear(num_features, num_classes) )
return model
# Example: Fine-tune for a 5-class flower classification taskmodel = create_transfer_model(num_classes=5, freeze_backbone=True)
# Check which layers will be trainedtrainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)total_params = sum(p.numel() for p in model.parameters())print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.1f}%)")# Output: Trainable: 2,565 / 11,179,077 (0.0%)With a frozen backbone, you’re only training 2,565 parameters instead of 11 million! This:
- Trains in minutes instead of hours
- Works with tiny datasets (even 100 images)
- Rarely overfits
The Transfer Learning Recipe
Section titled “The Transfer Learning Recipe”For small datasets (<1,000 images per class):
- Freeze the entire backbone
- Train only the new classification head
- Use data augmentation aggressively
For medium datasets (1,000-10,000 per class):
- Start with frozen backbone
- Train classification head until convergence
- Unfreeze last few layers
- Fine-tune with very low learning rate (10× lower than head)
For large datasets (>10,000 per class):
- Initialize with pretrained weights (still helps!)
- Train entire network
- Use lower learning rate for early layers (discriminative learning rates)
Data Augmentation for Transfer Learning
Section titled “Data Augmentation for Transfer Learning”Pretrained models expect ImageNet-like preprocessing. Always include:
# Standard ImageNet normalizationnormalize = transforms.Normalize( mean=[0.485, 0.456, 0.406], # ImageNet statistics std=[0.229, 0.224, 0.225])
# Training transforms (with augmentation)train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), # Crop and resize to 224 transforms.RandomHorizontalFlip(), transforms.RandomRotation(15), transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2), transforms.ToTensor(), normalize])
# Validation transforms (no augmentation)val_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), normalize])Memory & Performance Considerations
Section titled “Memory & Performance Considerations”CNNs are memory-intensive. Understanding the costs helps you make practical decisions.
Memory Breakdown
Section titled “Memory Breakdown”For a single forward pass, memory is consumed by:
- Weights: ~4 bytes per parameter (float32)
- Activations: Output of every layer (needed for backprop)
- Gradients: Same size as weights (during training)
ResNet-50 example:
- Weights: 25M params × 4 bytes = 100MB
- Activations at 224×224, batch_size=32: ~2.5GB
- Gradients: ~100MB
Total: ~2.7GB minimum. Mixed precision (FP16) roughly halves this.
Batch Size vs. GPU Memory
Section titled “Batch Size vs. GPU Memory”Activation memory scales linearly with batch size. If batch_size=32 uses 4GB:
- batch_size=16 → ~2.5GB
- batch_size=64 → ~6.5GB
Quick estimation: Measure memory at batch_size=1, then scale linearly.
Common OOM Solutions
Section titled “Common OOM Solutions”- Reduce batch size: First thing to try
- Mixed precision training: Uses FP16 for activations
- Gradient checkpointing: Recompute activations during backward pass
- Smaller input size: 224→192 reduces memory ~25%
- Smaller model: ResNet-18 vs ResNet-50
# Mixed precision training (PyTorch 1.6+)from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, targets in dataloader: optimizer.zero_grad()
# Forward pass in FP16 with autocast(): outputs = model(inputs) loss = criterion(outputs, targets)
# Backward pass with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()Input Size Trade-offs
Section titled “Input Size Trade-offs”| Resolution | Memory | Speed | Accuracy |
|---|---|---|---|
| 224×224 | Baseline | Baseline | Baseline |
| 192×192 | 0.73× | 1.36× | -1% |
| 160×160 | 0.51× | 1.96× | -2% |
| 320×320 | 2.04× | 0.49× | +1% |
For prototyping, use smaller resolutions. For final training, use 224 or higher.
When CNNs Fail: Understanding Limitations
Section titled “When CNNs Fail: Understanding Limitations”CNNs aren’t magic. Understanding their failures helps you know when to use alternatives.
The Texture Bias Problem
Section titled “The Texture Bias Problem”Did You Know? In 2019, researchers discovered something disturbing: CNNs classify images primarily by texture, not shape. They showed that a cat with elephant skin texture gets classified as “elephant” by ImageNet-trained CNNs. Humans, by contrast, rely primarily on shape. This explains why CNNs are fooled by adversarial examples — small texture changes that are invisible to humans but completely fool the network.
Limited Receptive Field
Section titled “Limited Receptive Field”CNNs process images through local windows. Very large objects or global relationships can be missed. This is why Vision Transformers (ViT) are gaining popularity — they can attend to the entire image at once.
Lack of Rotation Invariance
Section titled “Lack of Rotation Invariance”CNNs are translation-invariant (a cat anywhere is still a cat) but not rotation-invariant. An upside-down cat looks different to a CNN. Data augmentation helps, but doesn’t fully solve this.
Poor Sample Efficiency
Section titled “Poor Sample Efficiency”CNNs need thousands of examples to learn. Humans can learn from one or two examples (“one-shot learning”). This is an active research area.
The Rise of Vision Transformers
Section titled “The Rise of Vision Transformers”Starting in 2020, Vision Transformers (ViT) began challenging CNN dominance. They:
- Use self-attention instead of convolution
- Can attend to global relationships
- Scale better with data and compute
Did You Know? The original Vision Transformer paper (Dosovitskiy et al., 2020) had a surprising result: ViT underperformed CNNs when trained on ImageNet alone. But when pretrained on larger datasets (JFT-300M with 300 million images), it dramatically outperformed every CNN. The conclusion? Transformers need more data to learn good visual features, but once they have enough, they scale better. This sparked a race to build larger vision datasets and hybrid CNN-transformer architectures like Swin Transformer.
But CNNs aren’t dead! For smaller datasets and edge deployment, CNNs still dominate. The future is likely hybrid architectures that combine CNN’s inductive biases (locality, translation equivariance) with the global attention of transformers.
Common Pitfalls
Section titled “Common Pitfalls”1. Forgetting to Normalize Inputs
Section titled “1. Forgetting to Normalize Inputs”Pretrained models expect ImageNet normalization. Without it:
# Wrong - raw pixel values [0, 255]outputs = model(raw_images) # Garbage predictions
# Wrong - [0, 1] but not normalizedoutputs = model(images / 255.0) # Still wrong
# Correct - ImageNet normalizationnormalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])outputs = model(normalize(images / 255.0)) # Correct!2. Wrong Input Dimensions
Section titled “2. Wrong Input Dimensions”PyTorch expects (batch, channels, height, width). Common mistakes:
# Wrong - missing batch dimensionimage = torch.randn(3, 224, 224)output = model(image) # Error!
# Correct - add batch dimensionoutput = model(image.unsqueeze(0)) # Shape: (1, 3, 224, 224)
# Wrong - channels last (common in numpy/PIL)image = torch.randn(1, 224, 224, 3) # HWC formatoutput = model(image) # Wrong results!
# Correct - convert to channels firstoutput = model(image.permute(0, 3, 1, 2)) # CHW format3. Forgetting model.eval() for Inference
Section titled “3. Forgetting model.eval() for Inference”BatchNorm and Dropout behave differently in training vs evaluation:
# Wrong - using training mode for inferencemodel.train() # Default modepredictions = model(test_images) # BatchNorm uses batch statistics, dropout active
# Correctmodel.eval()with torch.no_grad(): predictions = model(test_images)4. Using the Wrong Pretrained Weights
Section titled “4. Using the Wrong Pretrained Weights”# Wrong - ImageNet V2 weights have different preprocessingmodel = models.resnet50(weights='IMAGENET1K_V2') # Uses different transforms!
# Check what transforms your weights expectweights = models.ResNet50_Weights.IMAGENET1K_V1preprocess = weights.transforms() # Use this!5. Overfitting on Small Datasets
Section titled “5. Overfitting on Small Datasets”With transfer learning on small datasets:
# Wrong - training everything from the startmodel = models.resnet50(weights='IMAGENET1K_V1')optimizer = optim.Adam(model.parameters(), lr=0.001) # All params!
# Correct - freeze backbone, train head firstfor param in model.parameters(): param.requires_grad = Falsemodel.fc = nn.Linear(2048, num_classes)optimizer = optim.Adam(model.fc.parameters(), lr=0.001) # Only headQuiz: Test Your Understanding
Section titled “Quiz: Test Your Understanding”Q1: Why do we use 3×3 kernels instead of larger ones like 7×7?
Answer
Three stacked 3×3 kernels have the same receptive field as one 7×7 kernel, but with fewer parameters (3×9=27 vs 49) and more non-linearities (3 ReLUs vs 1). This makes the network more expressive and easier to train. The VGGNet paper demonstrated this principle in 2014.Q2: Calculate the output shape: Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1) on input (B, 64, 56, 56).
Answer
Output spatial size = floor((56 - 3 + 2×1) / 2) + 1 = floor(55/2) + 1 = 27 + 1 = 28Output shape: (B, 128, 28, 28)
The number of output channels becomes 128, and spatial dimensions are halved (rounded).
Q3: Why does ResNet’s skip connection help training?
Answer
Skip connections provide a "gradient highway" that allows gradients to flow directly from the loss to early layers without being multiplied through many weight matrices. Even if the learned function F(x) has vanishing gradients, the identity mapping x preserves gradient magnitude. Additionally, skip connections make the optimization landscape smoother, making it easier to find good minima.Q4: When should you freeze the backbone in transfer learning?
Answer
Freeze the backbone when: - You have a small dataset (<1000 images per class) - Your images are similar to ImageNet (natural photos) - You want fast training - You're doing initial experimentsUnfreeze gradually when:
- You have more data
- Your images differ significantly from ImageNet (medical images, satellite imagery)
- You’ve already trained the head and want better performance
Q5: Your CNN gets 95% accuracy on training data but only 60% on validation. What’s happening and how do you fix it?
Answer
This is classic **overfitting**. The model memorized training data instead of learning generalizable features.Fixes to try:
- Data augmentation: More variety in training data
- Dropout: Add or increase dropout rate
- Weight decay: Add L2 regularization
- Simpler model: Fewer layers or channels
- Early stopping: Stop before overfitting
- Transfer learning: Use pretrained weights (better features)
- More data: If possible, collect more training examples
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- “ImageNet Classification with Deep Convolutional Neural Networks” (Krizhevsky et al., 2012) - The AlexNet paper that started it all
- “Very Deep Convolutional Networks for Large-Scale Image Recognition” (Simonyan & Zisserman, 2014) - VGGNet
- “Deep Residual Learning for Image Recognition” (He et al., 2016) - ResNet
- “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” (Tan & Le, 2019)
- “ImageNet-trained CNNs are biased towards textures” (Geirhos et al., 2019) - Understanding CNN failures
Online Resources
Section titled “Online Resources”- PyTorch Vision Models - All pretrained models
- Papers With Code - Image Classification - State-of-the-art benchmarks
- CS231n: CNNs for Visual Recognition - Stanford’s famous course
Summary
Section titled “Summary”You’ve learned:
- Convolution: Sliding filters extract features efficiently with parameter sharing
- Pooling: Downsampling adds translation invariance
- Receptive fields: Deep stacks of small kernels see large regions
- Classic architectures: LeNet → AlexNet → VGG → GoogLeNet → ResNet
- Transfer learning: Pretrained features + new head = fast, accurate models
- Practical tips: Normalization, input dimensions, eval mode, memory management
CNNs are the backbone of computer vision. Even as transformers gain ground, understanding CNNs is essential — they’re still the right tool for many tasks and provide the foundation for understanding more advanced architectures.
Next Steps
Section titled “Next Steps”Move on to Module 30: Transformers & Attention where you’ll learn:
- Why “Attention Is All You Need” changed everything
- How self-attention works (and why it’s O(n²))
- Building a transformer from scratch
- Vision Transformers: CNNs’ newest challenger
This is where CNNs and transformers meet — and where you’ll understand why modern AI is what it is.
Last updated: 2025-11-27 Status: Complete