Skip to content

RNNs & Sequence Models

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: How AI Learned to See (And Why It Still Can’t Find Waldo)

Section titled “Or: How AI Learned to See (And Why It Still Can’t Find Waldo)”

Reading Time: 6-8 hours Prerequisites: Module 28


By the end of this module, you will:

  • Understand how convolution operations extract features from images
  • Know why CNNs revolutionized computer vision (and the stories behind it)
  • Build custom CNN architectures from scratch
  • Master pooling, striding, and receptive fields
  • Navigate classic architectures (LeNet, AlexNet, VGG, ResNet)
  • Apply transfer learning to get 95%+ accuracy with minimal training
  • Understand when CNNs fail and why transformers are challenging them

The Computer Vision Revolution: From Impossible to Ubiquitous

Section titled “The Computer Vision Revolution: From Impossible to Ubiquitous”

Let’s set the stage with a shocking fact: in 2010, computers were worse at recognizing objects than a four-year-old child.

The ImageNet challenge — a competition to classify images into 1,000 categories like “golden retriever,” “volcano,” and “espresso” — had error rates around 25%. That means one in four images was misclassified. For reference, humans make mistakes about 5% of the time.

Then, in 2012, everything changed.

Did You Know? At the 2012 ImageNet competition, a team from the University of Toronto submitted an entry called “AlexNet.” It won by a landslide — 15% error rate versus 26% for the runner-up. This wasn’t just winning; it was demolishing the competition by the largest margin in the contest’s history. The winning team? Alex Krizhevsky, Ilya Sutskever (later co-founder of OpenAI), and Geoffrey Hinton. This single result is often credited with starting the modern AI boom.

What made AlexNet so special? It was a convolutional neural network — a type of neural network specifically designed to process images. CNNs weren’t new in 2012 (they were invented in the 1980s), but AlexNet was the first to combine them with GPUs and massive datasets. The result changed the world.

Today, CNNs power:

  • Your phone’s camera: Face detection, portrait mode, night sight
  • Self-driving cars: Lane detection, obstacle recognition
  • Medical imaging: Cancer detection, X-ray analysis
  • Social media: Content moderation, auto-tagging
  • Security systems: Facial recognition, anomaly detection

Let’s understand how they work.


The Biological Inspiration: How Vision Actually Works

Section titled “The Biological Inspiration: How Vision Actually Works”

Before we dive into the math, let’s understand why CNNs look the way they do. They’re directly inspired by how the human visual cortex works.

In 1959, David Hubel and Torsten Wiesel did something that sounds like mad science: they inserted electrodes into a cat’s brain and showed it different images. What they discovered won them the Nobel Prize in 1981.

Did You Know? Hubel and Wiesel discovered that neurons in the visual cortex respond to very specific patterns. Some neurons fire only when they see a vertical line. Others respond to horizontal lines. Some detect edges at 45-degree angles. The visual system doesn’t try to process the whole image at once — it builds understanding from simple features. This hierarchical processing is exactly what CNNs replicate.

They found two key types of cells:

  1. Simple cells: Respond to edges at specific orientations in specific locations
  2. Complex cells: Respond to the same features but are more tolerant of position

This is exactly what CNNs do:

  • Convolutional layers: Detect simple features (like edge detectors)
  • Pooling layers: Add position tolerance (like complex cells)
  • Deeper layers: Combine simple features into complex ones

The analogy isn’t perfect, but it’s striking how well the architecture maps onto biology.


From Pixels to Features: The Convolution Operation

Section titled “From Pixels to Features: The Convolution Operation”

Here’s the fundamental problem with using regular neural networks on images:

An image is 224×224×3 pixels (RGB). That’s 150,528 input values. If your first hidden layer has 1,000 neurons, you need 150 million weights — just for the first layer! This is:

  1. Computationally insane: Too many parameters to train
  2. Wasteful: Most of those connections don’t matter
  3. Missing the point: Images have spatial structure that fully-connected networks ignore

Convolutions solve all three problems elegantly.

What Is Convolution? (The Intuitive Version)

Section titled “What Is Convolution? (The Intuitive Version)”

Imagine you’re looking for a specific pattern in an image — say, vertical edges. You have a small “stencil” that knows what vertical edges look like. You slide this stencil across every position in the image, and at each position, you compute how well the image matches the stencil.

The stencil is called a kernel or filter. The sliding operation is called convolution.

Here’s a concrete example. This 3×3 kernel detects vertical edges:

Vertical Edge Detector:
[-1 0 1]
[-1 0 1]
[-1 0 1]

When you slide this over an image:

  • Where there’s a vertical edge (dark on left, light on right), you get a high positive value
  • Where there’s no edge or a horizontal edge, you get near zero
  • Where there’s a reverse edge (light on left, dark on right), you get a negative value

The brilliant insight is that the same kernel is applied everywhere in the image. This means:

  1. Parameter sharing: A 3×3 kernel has only 9 parameters, not millions
  2. Translation equivariance: A cat in the corner uses the same features as a cat in the center
  3. Hierarchy: Stack many layers to detect increasingly complex patterns

Let’s make this concrete. For a single-channel image (grayscale), convolution computes:

Output[i,j] = Σ Σ Input[i+m, j+n] × Kernel[m, n]
m n

Where the sums run over the kernel dimensions.

Worked Example: Let’s apply a 3×3 kernel to a 5×5 image:

Input (5×5): Kernel (3×3):
[1 2 3 4 5] [1 0 -1]
[6 7 8 9 10] [1 0 -1]
[11 12 13 14 15] [1 0 -1]
[16 17 18 19 20]
[21 22 23 24 25]
Computing Output[0,0] (top-left corner of output):
= 1×1 + 2×0 + 3×(-1) + 6×1 + 7×0 + 8×(-1) + 11×1 + 12×0 + 13×(-1)
= 1 + 0 - 3 + 6 + 0 - 8 + 11 + 0 - 13
= -6
Computing Output[0,1] (slide right by 1):
= 2×1 + 3×0 + 4×(-1) + 7×1 + 8×0 + 9×(-1) + 12×1 + 13×0 + 14×(-1)
= 2 + 0 - 4 + 7 + 0 - 9 + 12 + 0 - 14
= -6

Notice the output is consistent because the input has no edges (just a uniform gradient).

Output Size: The Formula You’ll Use Daily

Section titled “Output Size: The Formula You’ll Use Daily”

The output size depends on several parameters:

Output Size = floor((Input - Kernel + 2×Padding) / Stride) + 1

Let’s break this down:

  • Input: The input dimension (e.g., 224)
  • Kernel: The kernel size (e.g., 3)
  • Padding: Extra zeros around the border (e.g., 1)
  • Stride: How many pixels to skip between applications (e.g., 1)

Common configurations:

  • kernel=3, padding=1, stride=1: Output same size as input (most common)
  • kernel=3, padding=0, stride=1: Output shrinks by 2 pixels each side
  • kernel=3, padding=1, stride=2: Output is half the input size (downsampling)

Worked Example:

Input: 224×224
Conv2d(kernel=3, padding=1, stride=2)
Output = (224 - 3 + 2×1) / 2 + 1 = (224 - 3 + 2) / 2 + 1 = 223/2 + 1 = 111.5 + 1 = 112
Output: 112×112

Here’s how convolution looks in code. We’ll build intuition by comparing with manual computation:

import torch
import torch.nn as nn
# Create a simple 5x5 image (batch_size=1, channels=1)
image = torch.tensor([
[1., 2., 3., 4., 5.],
[6., 7., 8., 9., 10.],
[11., 12., 13., 14., 15.],
[16., 17., 18., 19., 20.],
[21., 22., 23., 24., 25.]
]).unsqueeze(0).unsqueeze(0) # Shape: (1, 1, 5, 5)
# Create a vertical edge detector
conv = nn.Conv2d(
in_channels=1,
out_channels=1,
kernel_size=3,
padding=0, # No padding, output will be 3x3
bias=False
)
# Set the kernel manually
with torch.no_grad():
conv.weight = nn.Parameter(torch.tensor([
[[[ 1., 0., -1.],
[ 1., 0., -1.],
[ 1., 0., -1.]]]
]))
output = conv(image)
print(output.shape) # torch.Size([1, 1, 3, 3])
print(output[0, 0]) # The 3x3 output

Notice how PyTorch handles all the sliding and summation for you. The key parameters are:

  • in_channels: Number of input channels (1 for grayscale, 3 for RGB)
  • out_channels: Number of filters (each learns a different feature)
  • kernel_size: Size of the filter (3×3 is most common)
  • padding: Zeros to add around the border
  • stride: How many pixels to skip

Multiple Filters: Learning Different Features

Section titled “Multiple Filters: Learning Different Features”

A single kernel detects one pattern. To detect many patterns, we use multiple filters:

# 32 filters, each learning a different feature
conv_block = nn.Conv2d(
in_channels=3, # RGB input
out_channels=32, # 32 different filters
kernel_size=3,
padding=1
)
rgb_image = torch.randn(1, 3, 224, 224) # Batch of 1 RGB image
features = conv_block(rgb_image)
print(features.shape) # torch.Size([1, 32, 224, 224])

Each of those 32 output channels represents a different “feature map” — one might detect vertical edges, another horizontal, another corners, etc. The network learns what features to detect during training.

Did You Know? When researchers visualize what early convolutional layers learn, they consistently find edge detectors, color blob detectors, and gradient detectors — remarkably similar to what Hubel and Wiesel found in cats’ brains. The network “rediscovers” the visual features that evolution took millions of years to develop.


Pooling: Making Networks Translation-Invariant

Section titled “Pooling: Making Networks Translation-Invariant”

You’ve detected that there’s a cat in the image. Do you care if it’s 3 pixels to the left? No! You want your network to be robust to small shifts in position.

This is what pooling provides.

Max pooling is simple: divide the feature map into regions and keep only the maximum value in each region.

Input (4×4): Max Pool 2×2, stride 2:
[1 3 2 4]
[5 6 7 8] → [6 8]
[9 1 3 4] [9 7]
[2 3 7 1]
Top-left output: max(1,3,5,6) = 6
Top-right output: max(2,4,7,8) = 8
Bottom-left output: max(9,1,2,3) = 9
Bottom-right output: max(3,4,7,1) = 7

Max pooling provides:

  1. Downsampling: Reduces spatial dimensions (2× with 2×2 pooling)
  2. Translation invariance: Small shifts don’t change the output
  3. Computation savings: Fewer values to process in later layers
  4. Feature selection: Keeps the strongest activation in each region

Average pooling takes the mean instead of the maximum:

Same input with Average Pool 2×2:
Top-left: (1+3+5+6)/4 = 3.75
Top-right: (2+4+7+8)/4 = 5.25
...

Average pooling is smoother and doesn’t throw away information as aggressively. It’s often used at the very end of networks (Global Average Pooling) to reduce feature maps to a single value per channel.

# Max pooling: keep strongest activation
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Average pooling: take the mean
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# Global average pooling: reduce entire feature map to single value
# Usually done with nn.AdaptiveAvgPool2d(1)
gap = nn.AdaptiveAvgPool2d(1) # Output is 1×1 regardless of input size
features = torch.randn(1, 64, 14, 14) # Some feature map
pooled = gap(features)
print(pooled.shape) # torch.Size([1, 64, 1, 1])

Here’s an interesting trend: modern networks often skip explicit pooling layers and use strided convolutions instead.

Why? A strided convolution (stride=2) also reduces spatial dimensions, but unlike pooling, it’s learnable. The network can decide how to downsample rather than using fixed max/average rules.

# Traditional: Conv + Pool
traditional = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2) # Fixed downsampling
)
# Modern: Strided Conv
modern = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1), # Learnable downsampling
nn.ReLU()
)

Both reduce spatial dimensions by 2×, but the strided convolution approach is more flexible.


Receptive Fields: How CNNs See the Big Picture

Section titled “Receptive Fields: How CNNs See the Big Picture”

Here’s a crucial concept that trips up beginners: a neuron in a deep layer doesn’t just “see” a 3×3 patch — it sees a much larger region called its receptive field.

Consider three 3×3 convolutions stacked:

  • Layer 1: Each neuron sees a 3×3 region of the input
  • Layer 2: Each neuron sees a 3×3 region of Layer 1’s output. But each of those Layer 1 neurons saw 3×3, so effectively Layer 2 sees 5×5 of the original input
  • Layer 3: Following the same logic, sees 7×7 of the original input

For convolutions with kernel size k and no stride or dilation:

Receptive Field = 1 + L × (k - 1)

Where L is the number of layers.

Example: 5 layers of 3×3 convolutions:

RF = 1 + 5 × (3 - 1) = 1 + 5 × 2 = 11

With stride and pooling, receptive fields grow much faster.

Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling

Section titled “Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling”

Sometimes you need a large receptive field but can’t afford to downsample (common in semantic segmentation where you need full-resolution output). Dilated convolutions solve this by inserting gaps in the kernel:

Standard 3×3 kernel: Dilated 3×3 (dilation=2):
[× × ×] [× × ×]
[× × ×] → 3×3 RF [ ]
[× × ×] [× × ×] → 5×5 RF
[ ]
[× × ×]

A 3×3 kernel with dilation=2 has a 5×5 receptive field but still only 9 parameters. With dilation=4, you get a 9×9 receptive field.

# Dilated convolution in PyTorch
dilated_conv = nn.Conv2d(
64, 64, kernel_size=3,
padding=2, # Adjust padding to maintain size
dilation=2 # Insert 1 gap between kernel elements
)

Dilated convolutions are crucial for:

  • Semantic segmentation (DeepLab uses them extensively)
  • WaveNet (audio generation)
  • Any task needing large receptive fields at high resolution

Receptive fields explain several design choices:

  1. Deep > Wide: To “see” a 224×224 image, you need enough layers for the receptive field to cover it
  2. Stacking 3×3: Two 3×3 convolutions = 5×5 receptive field with fewer parameters than one 5×5 kernel
  3. Global operations: Later layers can make global decisions because they see the whole image

Did You Know? The VGGNet paper (2014) demonstrated that stacking small 3×3 filters is more effective than using larger filters. Three 3×3 layers have the same receptive field as one 7×7 layer but with fewer parameters (3×3×3 = 27 vs 7×7 = 49) and more non-linearities (three ReLUs vs one). This insight influenced almost all subsequent architectures.


The Classic Architectures: A Journey Through History

Section titled “The Classic Architectures: A Journey Through History”

Understanding classic architectures gives you the vocabulary to discuss modern ones. Each breakthrough solved a specific problem.

Yann LeCun at Bell Labs created LeNet-5 for reading handwritten digits on checks. It’s the blueprint all modern CNNs follow.

Architecture:

Input (32×32 grayscale)
Conv1 (6 filters, 5×5) → 28×28×6
Pool1 (2×2) → 14×14×6
Conv2 (16 filters, 5×5) → 10×10×16
Pool2 (2×2) → 5×5×16
Flatten → 400
FC1 → 120
FC2 → 84
Output → 10 (digits 0-9)

Key innovations:

  • Convolution + pooling pattern
  • Local connectivity (not fully connected)
  • Learned features instead of hand-crafted

Limitation: Only worked on simple, centered images. Real-world images were still too hard.

AlexNet (2012): The Deep Learning Big Bang

Section titled “AlexNet (2012): The Deep Learning Big Bang”

AlexNet won ImageNet 2012 by a huge margin and kickstarted the deep learning revolution.

What made it work:

  1. Depth: 8 layers (considered “deep” at the time)
  2. ReLU: Replaced sigmoid/tanh, enabling faster training
  3. Dropout: Prevented overfitting on 60M parameters
  4. GPUs: Trained on two GTX 580s (3GB each!) in parallel
  5. Data augmentation: Translations, reflections, color jittering
  6. Huge data: 1.2 million training images

Architecture Sketch:

Input: 224×224×3
Conv1 (96, 11×11, stride 4) → 55×55×96
MaxPool → 27×27×96
Conv2 (256, 5×5) → 27×27×256
MaxPool → 13×13×256
Conv3 (384, 3×3) → 13×13×384
Conv4 (384, 3×3) → 13×13×384
Conv5 (256, 3×3) → 13×13×256
MaxPool → 6×6×256
FC1 → 4096 (with dropout)
FC2 → 4096 (with dropout)
Output → 1000 classes

Did You Know? AlexNet was trained for about 6 days on two GPUs. Today, the same model can be trained in under an hour on modern hardware. But here’s the crazy part: the original GPUs only had 3GB of VRAM, so the network was literally split across two GPUs with layers communicating between them. This “model parallelism” was a necessity, not a choice!

Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what if we just go deeper with 3×3 convolutions?

The insight: Stack 3×3 convolutions instead of using larger kernels. Two 3×3 = 5×5 receptive field. Three 3×3 = 7×7. Same receptive field, fewer parameters, more non-linearities.

VGG-16 Architecture:

Input: 224×224×3
Block 1: 2× Conv3-64 + Pool → 112×112×64
Block 2: 2× Conv3-128 + Pool → 56×56×128
Block 3: 3× Conv3-256 + Pool → 28×28×256
Block 4: 3× Conv3-512 + Pool → 14×14×512
Block 5: 3× Conv3-512 + Pool → 7×7×512
FC1 → 4096
FC2 → 4096
Output → 1000
Total: 138 million parameters

Legacy: VGG’s feature extractor is still widely used for style transfer, perceptual losses, and feature matching because its features are interpretable and hierarchical.

Limitation: 138M parameters is a lot. Most are in the fully-connected layers. Modern networks avoid this.

GoogLeNet/Inception (2014): Going Wider, Not Just Deeper

Section titled “GoogLeNet/Inception (2014): Going Wider, Not Just Deeper”

While everyone was stacking layers, Google took a different approach: what if we use multiple filter sizes in parallel?

The Inception Module:

Input
┌───────┼───────┬───────┐
│ │ │ │
1×1 3×3 5×5 Pool
│ │ │ │
└───────┴───────┴───────┘
Concat

Each branch uses a different kernel size, then outputs are concatenated. The network can choose which scale of features to use.

The 1×1 convolution trick: Before the 3×3 and 5×5 convolutions, 1×1 convolutions reduce the channel count, massively reducing computation. A 1×1 convolution is basically a per-pixel fully-connected layer.

Impact: GoogLeNet achieved similar accuracy to VGG with only 4 million parameters (vs VGG’s 138 million). 22 layers deep.

Did You Know? The name “Inception” came from the 2010 movie where characters enter dreams within dreams. The Google team thought it fit perfectly — their module was networks within networks. The famous “we need to go deeper” meme from the movie became an inside joke in the deep learning community during 2014-2015.

This is the most important architecture for understanding modern deep learning. Kaiming He and colleagues at Microsoft Research solved the degradation problem.

The Problem: When you make networks deeper, training accuracy gets worse, not better. This wasn’t overfitting — training accuracy degraded. Something was fundamentally wrong.

The Solution: Skip connections (residual connections).

Instead of learning H(x), learn F(x) = H(x) - x, then add x back:

Output = F(x) + x

This is called a residual block:

Input (x)
┌───┴───┐
│ │
Conv-BN │
│ │
ReLU │
│ │
Conv-BN │
│ │
└───┬───┘
+ ←── skip connection
ReLU
Output

Why it works: The skip connection provides a “gradient highway” that lets gradients flow directly through the network. Even if the learned transformation F(x) has vanishing gradients, the skip connection preserves the gradient from the loss.

Impact: ResNets can be 100+ layers deep without degradation. ResNet-152 won ImageNet 2015 with 3.6% error — better than humans (5%)!

Did You Know? ResNets were so influential that the original paper has over 200,000 citations. To put that in perspective, that’s more than any physics paper ever published, including Einstein’s foundational work. The skip connection is one of the most important ideas in deep learning.

import torch
import torchvision.models as models
# Standard ResNets (increasingly deeper)
resnet18 = models.resnet18(pretrained=True) # 11M params
resnet34 = models.resnet34(pretrained=True) # 21M params
resnet50 = models.resnet50(pretrained=True) # 25M params
resnet101 = models.resnet101(pretrained=True) # 44M params
resnet152 = models.resnet152(pretrained=True) # 60M params

Google’s EfficientNet asked: what’s the optimal way to scale a network? More layers? More channels? Higher resolution?

The insight: All three should scale together with a compound coefficient.

depth = α^φ
width = β^φ
resolution = γ^φ

Where α, β, γ are found by neural architecture search and φ is the scaling coefficient.

Result: EfficientNet-B0 matches ResNet-50 accuracy with 5× fewer parameters. EfficientNet-B7 achieves state-of-the-art with reasonable compute.

Key technique: EfficientNet uses depthwise separable convolutions (from MobileNet). Instead of one convolution that mixes spatial and channel information together, it uses two steps:

  1. Depthwise: One filter per input channel (spatial only)
  2. Pointwise: 1×1 convolution (channel mixing only)

This reduces computation by ~8-9× with minimal accuracy loss. If you see “MBConv” in architecture diagrams, that’s a mobile inverted bottleneck block using this technique.


Let’s build a modern CNN architecture incorporating everything we’ve learned:

import torch
import torch.nn as nn
class ConvBlock(nn.Module):
"""
A standard convolutional block: Conv -> BatchNorm -> ReLU.
Optionally includes residual connection.
"""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int = 3,
stride: int = 1,
residual: bool = False
):
super().__init__()
self.residual = residual
# Calculate padding to maintain spatial dimensions (when stride=1)
padding = kernel_size // 2
self.conv = nn.Conv2d(
in_channels, out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
bias=False # BatchNorm handles the bias
)
self.bn = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
# Shortcut for residual connection when dimensions change
self.shortcut = nn.Identity()
if residual and (in_channels != out_channels or stride != 1):
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
identity = x
out = self.conv(x)
out = self.bn(out)
if self.residual:
out = out + self.shortcut(identity)
out = self.relu(out)
return out
class SimpleCNN(nn.Module):
"""
A modern CNN incorporating best practices:
- BatchNorm after every conv
- Residual connections in deeper blocks
- Global average pooling instead of flattening
- Dropout for regularization
"""
def __init__(self, num_classes: int = 10, in_channels: int = 3):
super().__init__()
# Initial convolution
self.stem = ConvBlock(in_channels, 32, kernel_size=3)
# Feature extraction blocks
self.block1 = nn.Sequential(
ConvBlock(32, 64, stride=2), # Downsample
ConvBlock(64, 64, residual=True)
)
self.block2 = nn.Sequential(
ConvBlock(64, 128, stride=2), # Downsample
ConvBlock(128, 128, residual=True),
ConvBlock(128, 128, residual=True)
)
self.block3 = nn.Sequential(
ConvBlock(128, 256, stride=2), # Downsample
ConvBlock(256, 256, residual=True),
ConvBlock(256, 256, residual=True)
)
# Classification head
self.global_pool = nn.AdaptiveAvgPool2d(1)
self.dropout = nn.Dropout(0.5)
self.fc = nn.Linear(256, num_classes)
# Initialize weights properly
self._init_weights()
def _init_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.constant_(m.bias, 0)
def forward(self, x):
# Feature extraction
x = self.stem(x) # 32 channels
x = self.block1(x) # 64 channels, 1/2 spatial
x = self.block2(x) # 128 channels, 1/4 spatial
x = self.block3(x) # 256 channels, 1/8 spatial
# Classification
x = self.global_pool(x) # 256×1×1
x = x.view(x.size(0), -1) # Flatten to 256
x = self.dropout(x)
x = self.fc(x)
return x
# Test the architecture
model = SimpleCNN(num_classes=10, in_channels=3)
dummy_input = torch.randn(2, 3, 32, 32) # CIFAR-10 size
output = model(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Key design choices explained:

  1. bias=False in Conv2d: BatchNorm has its own learnable bias (β), so the conv bias is redundant

  2. Strided convolutions for downsampling: More flexible than max pooling, and we save parameters

  3. Global average pooling: Instead of flattening (which would give 256×4×4=4096 values for 32×32 input), we pool to 256 values. This:

    • Dramatically reduces parameters
    • Works with any input size
    • Acts as regularization
  4. Residual connections in later blocks: Help gradient flow in deeper parts

  5. Kaiming initialization: Proper initialization for ReLU networks

A Note on BatchNorm Placement: There’s an ongoing debate about whether to place BatchNorm before or after ReLU:

# Original ResNet (BN after conv, before ReLU)
Conv → BatchNorm → ReLU
# "Pre-activation" ResNet (BN before conv)
BatchNorm → ReLU → Conv

The original placement is most common, but pre-activation can work better for very deep networks. In practice, both work well — the difference is usually small. Most pretrained models use the original placement, so stick with that unless you have a specific reason to change.


Transfer Learning: Standing on Giants’ Shoulders

Section titled “Transfer Learning: Standing on Giants’ Shoulders”

Training a CNN from scratch requires:

  • Millions of labeled images
  • Days of GPU time
  • Expertise in hyperparameter tuning

Transfer learning says: don’t start from scratch.

Early layers of CNNs learn general features — edges, textures, patterns. These are useful for almost any image task! Later layers learn task-specific features.

Strategy:

  1. Take a network pretrained on ImageNet (1.2M images, 1000 classes)
  2. Remove the final classification layer
  3. Add your own classification layer
  4. Fine-tune (optionally freeze early layers)

Did You Know? Transfer learning is so effective that it’s now the default approach for almost all computer vision tasks. In a 2020 study, researchers found that pretrained ImageNet models transfer well to tasks as different as medical imaging, satellite imagery, and art classification — domains with completely different visual statistics than the natural photos in ImageNet. The learned features are surprisingly universal.

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision import transforms
def create_transfer_model(num_classes: int, freeze_backbone: bool = True):
"""
Create a model using transfer learning from ResNet-18.
"""
# Load pretrained ResNet-18
model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze backbone if requested (faster training, prevents overfitting)
if freeze_backbone:
for param in model.parameters():
param.requires_grad = False
# Replace the final fully connected layer
# ResNet-18's fc layer is: Linear(512, 1000)
# We replace it with our own
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(num_features, num_classes)
)
return model
# Example: Fine-tune for a 5-class flower classification task
model = create_transfer_model(num_classes=5, freeze_backbone=True)
# Check which layers will be trained
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.1f}%)")
# Output: Trainable: 2,565 / 11,179,077 (0.0%)

With a frozen backbone, you’re only training 2,565 parameters instead of 11 million! This:

  • Trains in minutes instead of hours
  • Works with tiny datasets (even 100 images)
  • Rarely overfits

For small datasets (<1,000 images per class):

  1. Freeze the entire backbone
  2. Train only the new classification head
  3. Use data augmentation aggressively

For medium datasets (1,000-10,000 per class):

  1. Start with frozen backbone
  2. Train classification head until convergence
  3. Unfreeze last few layers
  4. Fine-tune with very low learning rate (10× lower than head)

For large datasets (>10,000 per class):

  1. Initialize with pretrained weights (still helps!)
  2. Train entire network
  3. Use lower learning rate for early layers (discriminative learning rates)

Pretrained models expect ImageNet-like preprocessing. Always include:

# Standard ImageNet normalization
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet statistics
std=[0.229, 0.224, 0.225]
)
# Training transforms (with augmentation)
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224), # Crop and resize to 224
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.ToTensor(),
normalize
])
# Validation transforms (no augmentation)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize
])

CNNs are memory-intensive. Understanding the costs helps you make practical decisions.

For a single forward pass, memory is consumed by:

  1. Weights: ~4 bytes per parameter (float32)
  2. Activations: Output of every layer (needed for backprop)
  3. Gradients: Same size as weights (during training)

ResNet-50 example:

  • Weights: 25M params × 4 bytes = 100MB
  • Activations at 224×224, batch_size=32: ~2.5GB
  • Gradients: ~100MB

Total: ~2.7GB minimum. Mixed precision (FP16) roughly halves this.

Activation memory scales linearly with batch size. If batch_size=32 uses 4GB:

  • batch_size=16 → ~2.5GB
  • batch_size=64 → ~6.5GB

Quick estimation: Measure memory at batch_size=1, then scale linearly.

  1. Reduce batch size: First thing to try
  2. Mixed precision training: Uses FP16 for activations
  3. Gradient checkpointing: Recompute activations during backward pass
  4. Smaller input size: 224→192 reduces memory ~25%
  5. Smaller model: ResNet-18 vs ResNet-50
# Mixed precision training (PyTorch 1.6+)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, targets in dataloader:
optimizer.zero_grad()
# Forward pass in FP16
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
ResolutionMemorySpeedAccuracy
224×224BaselineBaselineBaseline
192×1920.73×1.36×-1%
160×1600.51×1.96×-2%
320×3202.04×0.49×+1%

For prototyping, use smaller resolutions. For final training, use 224 or higher.


CNNs aren’t magic. Understanding their failures helps you know when to use alternatives.

Did You Know? In 2019, researchers discovered something disturbing: CNNs classify images primarily by texture, not shape. They showed that a cat with elephant skin texture gets classified as “elephant” by ImageNet-trained CNNs. Humans, by contrast, rely primarily on shape. This explains why CNNs are fooled by adversarial examples — small texture changes that are invisible to humans but completely fool the network.

CNNs process images through local windows. Very large objects or global relationships can be missed. This is why Vision Transformers (ViT) are gaining popularity — they can attend to the entire image at once.

CNNs are translation-invariant (a cat anywhere is still a cat) but not rotation-invariant. An upside-down cat looks different to a CNN. Data augmentation helps, but doesn’t fully solve this.

CNNs need thousands of examples to learn. Humans can learn from one or two examples (“one-shot learning”). This is an active research area.

Starting in 2020, Vision Transformers (ViT) began challenging CNN dominance. They:

  • Use self-attention instead of convolution
  • Can attend to global relationships
  • Scale better with data and compute

Did You Know? The original Vision Transformer paper (Dosovitskiy et al., 2020) had a surprising result: ViT underperformed CNNs when trained on ImageNet alone. But when pretrained on larger datasets (JFT-300M with 300 million images), it dramatically outperformed every CNN. The conclusion? Transformers need more data to learn good visual features, but once they have enough, they scale better. This sparked a race to build larger vision datasets and hybrid CNN-transformer architectures like Swin Transformer.

But CNNs aren’t dead! For smaller datasets and edge deployment, CNNs still dominate. The future is likely hybrid architectures that combine CNN’s inductive biases (locality, translation equivariance) with the global attention of transformers.


Pretrained models expect ImageNet normalization. Without it:

# Wrong - raw pixel values [0, 255]
outputs = model(raw_images) # Garbage predictions
# Wrong - [0, 1] but not normalized
outputs = model(images / 255.0) # Still wrong
# Correct - ImageNet normalization
normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
outputs = model(normalize(images / 255.0)) # Correct!

PyTorch expects (batch, channels, height, width). Common mistakes:

# Wrong - missing batch dimension
image = torch.randn(3, 224, 224)
output = model(image) # Error!
# Correct - add batch dimension
output = model(image.unsqueeze(0)) # Shape: (1, 3, 224, 224)
# Wrong - channels last (common in numpy/PIL)
image = torch.randn(1, 224, 224, 3) # HWC format
output = model(image) # Wrong results!
# Correct - convert to channels first
output = model(image.permute(0, 3, 1, 2)) # CHW format

BatchNorm and Dropout behave differently in training vs evaluation:

# Wrong - using training mode for inference
model.train() # Default mode
predictions = model(test_images) # BatchNorm uses batch statistics, dropout active
# Correct
model.eval()
with torch.no_grad():
predictions = model(test_images)
# Wrong - ImageNet V2 weights have different preprocessing
model = models.resnet50(weights='IMAGENET1K_V2') # Uses different transforms!
# Check what transforms your weights expect
weights = models.ResNet50_Weights.IMAGENET1K_V1
preprocess = weights.transforms() # Use this!

With transfer learning on small datasets:

# Wrong - training everything from the start
model = models.resnet50(weights='IMAGENET1K_V1')
optimizer = optim.Adam(model.parameters(), lr=0.001) # All params!
# Correct - freeze backbone, train head first
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
optimizer = optim.Adam(model.fc.parameters(), lr=0.001) # Only head

Q1: Why do we use 3×3 kernels instead of larger ones like 7×7?

Answer Three stacked 3×3 kernels have the same receptive field as one 7×7 kernel, but with fewer parameters (3×9=27 vs 49) and more non-linearities (3 ReLUs vs 1). This makes the network more expressive and easier to train. The VGGNet paper demonstrated this principle in 2014.

Q2: Calculate the output shape: Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1) on input (B, 64, 56, 56).

Answer Output spatial size = floor((56 - 3 + 2×1) / 2) + 1 = floor(55/2) + 1 = 27 + 1 = 28

Output shape: (B, 128, 28, 28)

The number of output channels becomes 128, and spatial dimensions are halved (rounded).

Q3: Why does ResNet’s skip connection help training?

Answer Skip connections provide a "gradient highway" that allows gradients to flow directly from the loss to early layers without being multiplied through many weight matrices. Even if the learned function F(x) has vanishing gradients, the identity mapping x preserves gradient magnitude. Additionally, skip connections make the optimization landscape smoother, making it easier to find good minima.

Q4: When should you freeze the backbone in transfer learning?

Answer Freeze the backbone when: - You have a small dataset (<1000 images per class) - Your images are similar to ImageNet (natural photos) - You want fast training - You're doing initial experiments

Unfreeze gradually when:

  • You have more data
  • Your images differ significantly from ImageNet (medical images, satellite imagery)
  • You’ve already trained the head and want better performance

Q5: Your CNN gets 95% accuracy on training data but only 60% on validation. What’s happening and how do you fix it?

Answer This is classic **overfitting**. The model memorized training data instead of learning generalizable features.

Fixes to try:

  1. Data augmentation: More variety in training data
  2. Dropout: Add or increase dropout rate
  3. Weight decay: Add L2 regularization
  4. Simpler model: Fewer layers or channels
  5. Early stopping: Stop before overfitting
  6. Transfer learning: Use pretrained weights (better features)
  7. More data: If possible, collect more training examples

  1. “ImageNet Classification with Deep Convolutional Neural Networks” (Krizhevsky et al., 2012) - The AlexNet paper that started it all
  2. “Very Deep Convolutional Networks for Large-Scale Image Recognition” (Simonyan & Zisserman, 2014) - VGGNet
  3. “Deep Residual Learning for Image Recognition” (He et al., 2016) - ResNet
  4. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” (Tan & Le, 2019)
  5. “ImageNet-trained CNNs are biased towards textures” (Geirhos et al., 2019) - Understanding CNN failures

You’ve learned:

  1. Convolution: Sliding filters extract features efficiently with parameter sharing
  2. Pooling: Downsampling adds translation invariance
  3. Receptive fields: Deep stacks of small kernels see large regions
  4. Classic architectures: LeNet → AlexNet → VGG → GoogLeNet → ResNet
  5. Transfer learning: Pretrained features + new head = fast, accurate models
  6. Practical tips: Normalization, input dimensions, eval mode, memory management

CNNs are the backbone of computer vision. Even as transformers gain ground, understanding CNNs is essential — they’re still the right tool for many tasks and provide the foundation for understanding more advanced architectures.


Move on to Module 30: Transformers & Attention where you’ll learn:

  • Why “Attention Is All You Need” changed everything
  • How self-attention works (and why it’s O(n²))
  • Building a transformer from scratch
  • Vision Transformers: CNNs’ newest challenger

This is where CNNs and transformers meet — and where you’ll understand why modern AI is what it is.


Last updated: 2025-11-27 Status: Complete