RNNs & Sequence Models

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: How AI Learned to See (And Why It Still Can’t Find Waldo)

Reading Time: 6-8 hours Prerequisites: Module 28

What You’ll Be Able to Do

By the end of this module, you will:

Understand how convolution operations extract features from images
Know why CNNs revolutionized computer vision (and the stories behind it)
Build custom CNN architectures from scratch
Master pooling, striding, and receptive fields
Navigate classic architectures (LeNet, AlexNet, VGG, ResNet)
Apply transfer learning to get 95%+ accuracy with minimal training
Understand when CNNs fail and why transformers are challenging them

The Computer Vision Revolution: From Impossible to Ubiquitous

Let’s set the stage with a shocking fact: in 2010, computers were worse at recognizing objects than a four-year-old child.

The ImageNet challenge — a competition to classify images into 1,000 categories like “golden retriever,” “volcano,” and “espresso” — had error rates around 25%. That means one in four images was misclassified. For reference, humans make mistakes about 5% of the time.

Then, in 2012, everything changed.

Did You Know? At the 2012 ImageNet competition, a team from the University of Toronto submitted an entry called “AlexNet.” It won by a landslide — 15% error rate versus 26% for the runner-up. This wasn’t just winning; it was demolishing the competition by the largest margin in the contest’s history. The winning team? Alex Krizhevsky, Ilya Sutskever (later co-founder of OpenAI), and Geoffrey Hinton. This single result is often credited with starting the modern AI boom.

What made AlexNet so special? It was a convolutional neural network — a type of neural network specifically designed to process images. CNNs weren’t new in 2012 (they were invented in the 1980s), but AlexNet was the first to combine them with GPUs and massive datasets. The result changed the world.

Today, CNNs power:

Your phone’s camera: Face detection, portrait mode, night sight
Self-driving cars: Lane detection, obstacle recognition
Medical imaging: Cancer detection, X-ray analysis
Social media: Content moderation, auto-tagging
Security systems: Facial recognition, anomaly detection

Let’s understand how they work.

The Biological Inspiration: How Vision Actually Works

Before we dive into the math, let’s understand why CNNs look the way they do. They’re directly inspired by how the human visual cortex works.

The Nobel Prize Discovery

In 1959, David Hubel and Torsten Wiesel did something that sounds like mad science: they inserted electrodes into a cat’s brain and showed it different images. What they discovered won them the Nobel Prize in 1981.

Did You Know? Hubel and Wiesel discovered that neurons in the visual cortex respond to very specific patterns. Some neurons fire only when they see a vertical line. Others respond to horizontal lines. Some detect edges at 45-degree angles. The visual system doesn’t try to process the whole image at once — it builds understanding from simple features. This hierarchical processing is exactly what CNNs replicate.

They found two key types of cells:

Simple cells: Respond to edges at specific orientations in specific locations
Complex cells: Respond to the same features but are more tolerant of position

This is exactly what CNNs do:

Convolutional layers: Detect simple features (like edge detectors)
Pooling layers: Add position tolerance (like complex cells)
Deeper layers: Combine simple features into complex ones

The analogy isn’t perfect, but it’s striking how well the architecture maps onto biology.

From Pixels to Features: The Convolution Operation

Here’s the fundamental problem with using regular neural networks on images:

An image is 224×224×3 pixels (RGB). That’s 150,528 input values. If your first hidden layer has 1,000 neurons, you need 150 million weights — just for the first layer! This is:

Computationally insane: Too many parameters to train
Wasteful: Most of those connections don’t matter
Missing the point: Images have spatial structure that fully-connected networks ignore

Convolutions solve all three problems elegantly.

What Is Convolution? (The Intuitive Version)

Imagine you’re looking for a specific pattern in an image — say, vertical edges. You have a small “stencil” that knows what vertical edges look like. You slide this stencil across every position in the image, and at each position, you compute how well the image matches the stencil.

The stencil is called a kernel or filter. The sliding operation is called convolution.

Here’s a concrete example. This 3×3 kernel detects vertical edges:

Vertical Edge Detector:
[-1  0  1]
[-1  0  1]
[-1  0  1]

When you slide this over an image:

Where there’s a vertical edge (dark on left, light on right), you get a high positive value
Where there’s no edge or a horizontal edge, you get near zero
Where there’s a reverse edge (light on left, dark on right), you get a negative value

The brilliant insight is that the same kernel is applied everywhere in the image. This means:

Parameter sharing: A 3×3 kernel has only 9 parameters, not millions
Translation equivariance: A cat in the corner uses the same features as a cat in the center
Hierarchy: Stack many layers to detect increasingly complex patterns

The Math: How Convolution Works

Let’s make this concrete. For a single-channel image (grayscale), convolution computes:

Output[i,j] = Σ Σ Input[i+m, j+n] × Kernel[m, n]
             m n

Where the sums run over the kernel dimensions.

Worked Example: Let’s apply a 3×3 kernel to a 5×5 image:

Input (5×5):                    Kernel (3×3):
[1  2  3  4  5]                 [1  0 -1]
[6  7  8  9  10]                [1  0 -1]
[11 12 13 14 15]                [1  0 -1]
[16 17 18 19 20]
[21 22 23 24 25]

Computing Output[0,0] (top-left corner of output):
= 1×1 + 2×0 + 3×(-1) + 6×1 + 7×0 + 8×(-1) + 11×1 + 12×0 + 13×(-1)
= 1 + 0 - 3 + 6 + 0 - 8 + 11 + 0 - 13
= -6

Computing Output[0,1] (slide right by 1):
= 2×1 + 3×0 + 4×(-1) + 7×1 + 8×0 + 9×(-1) + 12×1 + 13×0 + 14×(-1)
= 2 + 0 - 4 + 7 + 0 - 9 + 12 + 0 - 14
= -6

Notice the output is consistent because the input has no edges (just a uniform gradient).

Output Size: The Formula You’ll Use Daily

The output size depends on several parameters:

Output Size = floor((Input - Kernel + 2×Padding) / Stride) + 1

Let’s break this down:

Input: The input dimension (e.g., 224)
Kernel: The kernel size (e.g., 3)
Padding: Extra zeros around the border (e.g., 1)
Stride: How many pixels to skip between applications (e.g., 1)

Common configurations:

kernel=3, padding=1, stride=1: Output same size as input (most common)
kernel=3, padding=0, stride=1: Output shrinks by 2 pixels each side
kernel=3, padding=1, stride=2: Output is half the input size (downsampling)

Worked Example:

Input: 224×224
Conv2d(kernel=3, padding=1, stride=2)

Output = (224 - 3 + 2×1) / 2 + 1 = (224 - 3 + 2) / 2 + 1 = 223/2 + 1 = 111.5 + 1 = 112

Output: 112×112

PyTorch Implementation

Here’s how convolution looks in code. We’ll build intuition by comparing with manual computation:

import torch
import torch.nn as nn

# Create a simple 5x5 image (batch_size=1, channels=1)
image = torch.tensor([
    [1., 2., 3., 4., 5.],
    [6., 7., 8., 9., 10.],
    [11., 12., 13., 14., 15.],
    [16., 17., 18., 19., 20.],
    [21., 22., 23., 24., 25.]
]).unsqueeze(0).unsqueeze(0)  # Shape: (1, 1, 5, 5)

# Create a vertical edge detector
conv = nn.Conv2d(
    in_channels=1,
    out_channels=1,
    kernel_size=3,
    padding=0,  # No padding, output will be 3x3
    bias=False
)

# Set the kernel manually
with torch.no_grad():
    conv.weight = nn.Parameter(torch.tensor([
        [[[ 1.,  0., -1.],
          [ 1.,  0., -1.],
          [ 1.,  0., -1.]]]
    ]))

output = conv(image)
print(output.shape)  # torch.Size([1, 1, 3, 3])
print(output[0, 0])  # The 3x3 output

Notice how PyTorch handles all the sliding and summation for you. The key parameters are:

in_channels: Number of input channels (1 for grayscale, 3 for RGB)
out_channels: Number of filters (each learns a different feature)
kernel_size: Size of the filter (3×3 is most common)
padding: Zeros to add around the border
stride: How many pixels to skip

Multiple Filters: Learning Different Features

A single kernel detects one pattern. To detect many patterns, we use multiple filters:

# 32 filters, each learning a different feature
conv_block = nn.Conv2d(
    in_channels=3,     # RGB input
    out_channels=32,   # 32 different filters
    kernel_size=3,
    padding=1
)

rgb_image = torch.randn(1, 3, 224, 224)  # Batch of 1 RGB image
features = conv_block(rgb_image)
print(features.shape)  # torch.Size([1, 32, 224, 224])

Each of those 32 output channels represents a different “feature map” — one might detect vertical edges, another horizontal, another corners, etc. The network learns what features to detect during training.

Did You Know? When researchers visualize what early convolutional layers learn, they consistently find edge detectors, color blob detectors, and gradient detectors — remarkably similar to what Hubel and Wiesel found in cats’ brains. The network “rediscovers” the visual features that evolution took millions of years to develop.

Pooling: Making Networks Translation-Invariant

You’ve detected that there’s a cat in the image. Do you care if it’s 3 pixels to the left? No! You want your network to be robust to small shifts in position.

This is what pooling provides.

Max Pooling: The Dominant Approach

Max pooling is simple: divide the feature map into regions and keep only the maximum value in each region.

Input (4×4):              Max Pool 2×2, stride 2:
[1  3  2  4]
[5  6  7  8]     →        [6   8]
[9  1  3  4]              [9   7]
[2  3  7  1]

Top-left output: max(1,3,5,6) = 6
Top-right output: max(2,4,7,8) = 8
Bottom-left output: max(9,1,2,3) = 9
Bottom-right output: max(3,4,7,1) = 7

Max pooling provides:

Downsampling: Reduces spatial dimensions (2× with 2×2 pooling)
Translation invariance: Small shifts don’t change the output
Computation savings: Fewer values to process in later layers
Feature selection: Keeps the strongest activation in each region

Average Pooling: The Alternative

Average pooling takes the mean instead of the maximum:

Same input with Average Pool 2×2:
Top-left: (1+3+5+6)/4 = 3.75
Top-right: (2+4+7+8)/4 = 5.25
...

Average pooling is smoother and doesn’t throw away information as aggressively. It’s often used at the very end of networks (Global Average Pooling) to reduce feature maps to a single value per channel.

PyTorch Pooling Layers

# Max pooling: keep strongest activation
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling: take the mean
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling: reduce entire feature map to single value
# Usually done with nn.AdaptiveAvgPool2d(1)
gap = nn.AdaptiveAvgPool2d(1)  # Output is 1×1 regardless of input size

features = torch.randn(1, 64, 14, 14)  # Some feature map
pooled = gap(features)
print(pooled.shape)  # torch.Size([1, 64, 1, 1])

The Decline of Manual Pooling

Here’s an interesting trend: modern networks often skip explicit pooling layers and use strided convolutions instead.

Why? A strided convolution (stride=2) also reduces spatial dimensions, but unlike pooling, it’s learnable. The network can decide how to downsample rather than using fixed max/average rules.

# Traditional: Conv + Pool
traditional = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2)  # Fixed downsampling
)

# Modern: Strided Conv
modern = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),  # Learnable downsampling
    nn.ReLU()
)

Both reduce spatial dimensions by 2×, but the strided convolution approach is more flexible.

Receptive Fields: How CNNs See the Big Picture

Here’s a crucial concept that trips up beginners: a neuron in a deep layer doesn’t just “see” a 3×3 patch — it sees a much larger region called its receptive field.

Building Intuition

Consider three 3×3 convolutions stacked:

Layer 1: Each neuron sees a 3×3 region of the input
Layer 2: Each neuron sees a 3×3 region of Layer 1’s output. But each of those Layer 1 neurons saw 3×3, so effectively Layer 2 sees 5×5 of the original input
Layer 3: Following the same logic, sees 7×7 of the original input

The Receptive Field Formula

For convolutions with kernel size k and no stride or dilation:

Receptive Field = 1 + L × (k - 1)

Where L is the number of layers.

Example: 5 layers of 3×3 convolutions:

RF = 1 + 5 × (3 - 1) = 1 + 5 × 2 = 11

With stride and pooling, receptive fields grow much faster.

Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling

Sometimes you need a large receptive field but can’t afford to downsample (common in semantic segmentation where you need full-resolution output). Dilated convolutions solve this by inserting gaps in the kernel:

Standard 3×3 kernel:        Dilated 3×3 (dilation=2):
[× × ×]                     [×   ×   ×]
[× × ×]  → 3×3 RF           [         ]
[× × ×]                     [×   ×   ×]  → 5×5 RF
                            [         ]
                            [×   ×   ×]

A 3×3 kernel with dilation=2 has a 5×5 receptive field but still only 9 parameters. With dilation=4, you get a 9×9 receptive field.

# Dilated convolution in PyTorch
dilated_conv = nn.Conv2d(
    64, 64, kernel_size=3,
    padding=2,      # Adjust padding to maintain size
    dilation=2      # Insert 1 gap between kernel elements
)

Dilated convolutions are crucial for:

Semantic segmentation (DeepLab uses them extensively)
WaveNet (audio generation)
Any task needing large receptive fields at high resolution

Why This Matters

Receptive fields explain several design choices:

Deep > Wide: To “see” a 224×224 image, you need enough layers for the receptive field to cover it
Stacking 3×3: Two 3×3 convolutions = 5×5 receptive field with fewer parameters than one 5×5 kernel
Global operations: Later layers can make global decisions because they see the whole image

Did You Know? The VGGNet paper (2014) demonstrated that stacking small 3×3 filters is more effective than using larger filters. Three 3×3 layers have the same receptive field as one 7×7 layer but with fewer parameters (3×3×3 = 27 vs 7×7 = 49) and more non-linearities (three ReLUs vs one). This insight influenced almost all subsequent architectures.

The Classic Architectures: A Journey Through History

Understanding classic architectures gives you the vocabulary to discuss modern ones. Each breakthrough solved a specific problem.

LeNet-5 (1998): The Original CNN

Yann LeCun at Bell Labs created LeNet-5 for reading handwritten digits on checks. It’s the blueprint all modern CNNs follow.

Architecture:

Input (32×32 grayscale)
    ↓
Conv1 (6 filters, 5×5) → 28×28×6
    ↓
Pool1 (2×2) → 14×14×6
    ↓
Conv2 (16 filters, 5×5) → 10×10×16
    ↓
Pool2 (2×2) → 5×5×16
    ↓
Flatten → 400
    ↓
FC1 → 120
    ↓
FC2 → 84
    ↓
Output → 10 (digits 0-9)

Key innovations:

Convolution + pooling pattern
Local connectivity (not fully connected)
Learned features instead of hand-crafted

Limitation: Only worked on simple, centered images. Real-world images were still too hard.

AlexNet (2012): The Deep Learning Big Bang

AlexNet won ImageNet 2012 by a huge margin and kickstarted the deep learning revolution.

What made it work:

Depth: 8 layers (considered “deep” at the time)
ReLU: Replaced sigmoid/tanh, enabling faster training
Dropout: Prevented overfitting on 60M parameters
GPUs: Trained on two GTX 580s (3GB each!) in parallel
Data augmentation: Translations, reflections, color jittering
Huge data: 1.2 million training images

Architecture Sketch:

Input: 224×224×3
Conv1 (96, 11×11, stride 4) → 55×55×96
MaxPool → 27×27×96
Conv2 (256, 5×5) → 27×27×256
MaxPool → 13×13×256
Conv3 (384, 3×3) → 13×13×384
Conv4 (384, 3×3) → 13×13×384
Conv5 (256, 3×3) → 13×13×256
MaxPool → 6×6×256
FC1 → 4096 (with dropout)
FC2 → 4096 (with dropout)
Output → 1000 classes

Did You Know? AlexNet was trained for about 6 days on two GPUs. Today, the same model can be trained in under an hour on modern hardware. But here’s the crazy part: the original GPUs only had 3GB of VRAM, so the network was literally split across two GPUs with layers communicating between them. This “model parallelism” was a necessity, not a choice!

VGGNet (2014): The Power of Simplicity

Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what if we just go deeper with 3×3 convolutions?

The insight: Stack 3×3 convolutions instead of using larger kernels. Two 3×3 = 5×5 receptive field. Three 3×3 = 7×7. Same receptive field, fewer parameters, more non-linearities.

VGG-16 Architecture:

Input: 224×224×3

Block 1: 2× Conv3-64 + Pool → 112×112×64
Block 2: 2× Conv3-128 + Pool → 56×56×128
Block 3: 3× Conv3-256 + Pool → 28×28×256
Block 4: 3× Conv3-512 + Pool → 14×14×512
Block 5: 3× Conv3-512 + Pool → 7×7×512

FC1 → 4096
FC2 → 4096
Output → 1000

Total: 138 million parameters

Legacy: VGG’s feature extractor is still widely used for style transfer, perceptual losses, and feature matching because its features are interpretable and hierarchical.

Limitation: 138M parameters is a lot. Most are in the fully-connected layers. Modern networks avoid this.

GoogLeNet/Inception (2014): Going Wider, Not Just Deeper

While everyone was stacking layers, Google took a different approach: what if we use multiple filter sizes in parallel?

The Inception Module:

            Input
     ┌───────┼───────┬───────┐
     │       │       │       │
   1×1     3×3     5×5    Pool
     │       │       │       │
     └───────┴───────┴───────┘
              Concat

Each branch uses a different kernel size, then outputs are concatenated. The network can choose which scale of features to use.

The 1×1 convolution trick: Before the 3×3 and 5×5 convolutions, 1×1 convolutions reduce the channel count, massively reducing computation. A 1×1 convolution is basically a per-pixel fully-connected layer.

Impact: GoogLeNet achieved similar accuracy to VGG with only 4 million parameters (vs VGG’s 138 million). 22 layers deep.

Did You Know? The name “Inception” came from the 2010 movie where characters enter dreams within dreams. The Google team thought it fit perfectly — their module was networks within networks. The famous “we need to go deeper” meme from the movie became an inside joke in the deep learning community during 2014-2015.

ResNet (2015): The Depth Breakthrough

This is the most important architecture for understanding modern deep learning. Kaiming He and colleagues at Microsoft Research solved the degradation problem.

The Problem: When you make networks deeper, training accuracy gets worse, not better. This wasn’t overfitting — training accuracy degraded. Something was fundamentally wrong.

The Solution: Skip connections (residual connections).

Instead of learning H(x), learn F(x) = H(x) - x, then add x back:

Output = F(x) + x

This is called a residual block:

     Input (x)
        │
    ┌───┴───┐
    │       │
  Conv-BN   │
    │       │
  ReLU      │
    │       │
  Conv-BN   │
    │       │
    └───┬───┘
        +  ←── skip connection
        │
      ReLU
        │
     Output

Why it works: The skip connection provides a “gradient highway” that lets gradients flow directly through the network. Even if the learned transformation F(x) has vanishing gradients, the skip connection preserves the gradient from the loss.

Impact: ResNets can be 100+ layers deep without degradation. ResNet-152 won ImageNet 2015 with 3.6% error — better than humans (5%)!

Did You Know? ResNets were so influential that the original paper has over 200,000 citations. To put that in perspective, that’s more than any physics paper ever published, including Einstein’s foundational work. The skip connection is one of the most important ideas in deep learning.

The ResNet Family

import torch
import torchvision.models as models

# Standard ResNets (increasingly deeper)
resnet18 = models.resnet18(pretrained=True)   # 11M params
resnet34 = models.resnet34(pretrained=True)   # 21M params
resnet50 = models.resnet50(pretrained=True)   # 25M params
resnet101 = models.resnet101(pretrained=True) # 44M params
resnet152 = models.resnet152(pretrained=True) # 60M params

EfficientNet (2019): Optimal Scaling

Google’s EfficientNet asked: what’s the optimal way to scale a network? More layers? More channels? Higher resolution?

The insight: All three should scale together with a compound coefficient.

depth = α^φ
width = β^φ
resolution = γ^φ

Where α, β, γ are found by neural architecture search and φ is the scaling coefficient.

Result: EfficientNet-B0 matches ResNet-50 accuracy with 5× fewer parameters. EfficientNet-B7 achieves state-of-the-art with reasonable compute.

Key technique: EfficientNet uses depthwise separable convolutions (from MobileNet). Instead of one convolution that mixes spatial and channel information together, it uses two steps:

Depthwise: One filter per input channel (spatial only)
Pointwise: 1×1 convolution (channel mixing only)

This reduces computation by ~8-9× with minimal accuracy loss. If you see “MBConv” in architecture diagrams, that’s a mobile inverted bottleneck block using this technique.

Building a CNN from Scratch

Let’s build a modern CNN architecture incorporating everything we’ve learned:

import torch
import torch.nn as nn

class ConvBlock(nn.Module):
    """
    A standard convolutional block: Conv -> BatchNorm -> ReLU.
    Optionally includes residual connection.
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        residual: bool = False
    ):
        super().__init__()
        self.residual = residual

        # Calculate padding to maintain spatial dimensions (when stride=1)
        padding = kernel_size // 2

        self.conv = nn.Conv2d(
            in_channels, out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            bias=False  # BatchNorm handles the bias
        )
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Shortcut for residual connection when dimensions change
        self.shortcut = nn.Identity()
        if residual and (in_channels != out_channels or stride != 1):
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = x

        out = self.conv(x)
        out = self.bn(out)

        if self.residual:
            out = out + self.shortcut(identity)

        out = self.relu(out)
        return out


class SimpleCNN(nn.Module):
    """
    A modern CNN incorporating best practices:
    - BatchNorm after every conv
    - Residual connections in deeper blocks
    - Global average pooling instead of flattening
    - Dropout for regularization
    """
    def __init__(self, num_classes: int = 10, in_channels: int = 3):
        super().__init__()

        # Initial convolution
        self.stem = ConvBlock(in_channels, 32, kernel_size=3)

        # Feature extraction blocks
        self.block1 = nn.Sequential(
            ConvBlock(32, 64, stride=2),  # Downsample
            ConvBlock(64, 64, residual=True)
        )

        self.block2 = nn.Sequential(
            ConvBlock(64, 128, stride=2),  # Downsample
            ConvBlock(128, 128, residual=True),
            ConvBlock(128, 128, residual=True)
        )

        self.block3 = nn.Sequential(
            ConvBlock(128, 256, stride=2),  # Downsample
            ConvBlock(256, 256, residual=True),
            ConvBlock(256, 256, residual=True)
        )

        # Classification head
        self.global_pool = nn.AdaptiveAvgPool2d(1)
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(256, num_classes)

        # Initialize weights properly
        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Feature extraction
        x = self.stem(x)      # 32 channels
        x = self.block1(x)    # 64 channels, 1/2 spatial
        x = self.block2(x)    # 128 channels, 1/4 spatial
        x = self.block3(x)    # 256 channels, 1/8 spatial

        # Classification
        x = self.global_pool(x)  # 256×1×1
        x = x.view(x.size(0), -1)  # Flatten to 256
        x = self.dropout(x)
        x = self.fc(x)
        return x


# Test the architecture
model = SimpleCNN(num_classes=10, in_channels=3)
dummy_input = torch.randn(2, 3, 32, 32)  # CIFAR-10 size
output = model(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Key design choices explained:

bias=False in Conv2d: BatchNorm has its own learnable bias (β), so the conv bias is redundant
Strided convolutions for downsampling: More flexible than max pooling, and we save parameters
Global average pooling: Instead of flattening (which would give 256×4×4=4096 values for 32×32 input), we pool to 256 values. This:
- Dramatically reduces parameters
- Works with any input size
- Acts as regularization
Residual connections in later blocks: Help gradient flow in deeper parts
Kaiming initialization: Proper initialization for ReLU networks

A Note on BatchNorm Placement: There’s an ongoing debate about whether to place BatchNorm before or after ReLU:

# Original ResNet (BN after conv, before ReLU)
Conv → BatchNorm → ReLU

# "Pre-activation" ResNet (BN before conv)
BatchNorm → ReLU → Conv

The original placement is most common, but pre-activation can work better for very deep networks. In practice, both work well — the difference is usually small. Most pretrained models use the original placement, so stick with that unless you have a specific reason to change.

Transfer Learning: Standing on Giants’ Shoulders

Training a CNN from scratch requires:

Millions of labeled images
Days of GPU time
Expertise in hyperparameter tuning

Transfer learning says: don’t start from scratch.

The Key Insight

Early layers of CNNs learn general features — edges, textures, patterns. These are useful for almost any image task! Later layers learn task-specific features.

Strategy:

Take a network pretrained on ImageNet (1.2M images, 1000 classes)
Remove the final classification layer
Add your own classification layer
Fine-tune (optionally freeze early layers)

Did You Know? Transfer learning is so effective that it’s now the default approach for almost all computer vision tasks. In a 2020 study, researchers found that pretrained ImageNet models transfer well to tasks as different as medical imaging, satellite imagery, and art classification — domains with completely different visual statistics than the natural photos in ImageNet. The learned features are surprisingly universal.

Transfer Learning in PyTorch

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision import transforms

def create_transfer_model(num_classes: int, freeze_backbone: bool = True):
    """
    Create a model using transfer learning from ResNet-18.
    """
    # Load pretrained ResNet-18
    model = models.resnet18(weights='IMAGENET1K_V1')

    # Freeze backbone if requested (faster training, prevents overfitting)
    if freeze_backbone:
        for param in model.parameters():
            param.requires_grad = False

    # Replace the final fully connected layer
    # ResNet-18's fc layer is: Linear(512, 1000)
    # We replace it with our own
    num_features = model.fc.in_features
    model.fc = nn.Sequential(
        nn.Dropout(0.5),
        nn.Linear(num_features, num_classes)
    )

    return model


# Example: Fine-tune for a 5-class flower classification task
model = create_transfer_model(num_classes=5, freeze_backbone=True)

# Check which layers will be trained
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / {total_params:,} ({100*trainable_params/total_params:.1f}%)")
# Output: Trainable: 2,565 / 11,179,077 (0.0%)

With a frozen backbone, you’re only training 2,565 parameters instead of 11 million! This:

Trains in minutes instead of hours
Works with tiny datasets (even 100 images)
Rarely overfits

The Transfer Learning Recipe

For small datasets (<1,000 images per class):

Freeze the entire backbone
Train only the new classification head
Use data augmentation aggressively

For medium datasets (1,000-10,000 per class):

Start with frozen backbone
Train classification head until convergence
Unfreeze last few layers
Fine-tune with very low learning rate (10× lower than head)

For large datasets (>10,000 per class):

Initialize with pretrained weights (still helps!)
Train entire network
Use lower learning rate for early layers (discriminative learning rates)

Data Augmentation for Transfer Learning

Pretrained models expect ImageNet-like preprocessing. Always include:

# Standard ImageNet normalization
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],  # ImageNet statistics
    std=[0.229, 0.224, 0.225]
)

# Training transforms (with augmentation)
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),  # Crop and resize to 224
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    normalize
])

# Validation transforms (no augmentation)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize
])

Memory & Performance Considerations

CNNs are memory-intensive. Understanding the costs helps you make practical decisions.

Memory Breakdown

For a single forward pass, memory is consumed by:

Weights: ~4 bytes per parameter (float32)
Activations: Output of every layer (needed for backprop)
Gradients: Same size as weights (during training)

ResNet-50 example:

Weights: 25M params × 4 bytes = 100MB
Activations at 224×224, batch_size=32: ~2.5GB
Gradients: ~100MB

Total: ~2.7GB minimum. Mixed precision (FP16) roughly halves this.

Batch Size vs. GPU Memory

Activation memory scales linearly with batch size. If batch_size=32 uses 4GB:

batch_size=16 → ~2.5GB
batch_size=64 → ~6.5GB

Quick estimation: Measure memory at batch_size=1, then scale linearly.

Common OOM Solutions

Reduce batch size: First thing to try
Mixed precision training: Uses FP16 for activations
Gradient checkpointing: Recompute activations during backward pass
Smaller input size: 224→192 reduces memory ~25%
Smaller model: ResNet-18 vs ResNet-50

# Mixed precision training (PyTorch 1.6+)
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, targets in dataloader:
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Input Size Trade-offs

Resolution	Memory	Speed	Accuracy
224×224	Baseline	Baseline	Baseline
192×192	0.73×	1.36×	-1%
160×160	0.51×	1.96×	-2%
320×320	2.04×	0.49×	+1%

For prototyping, use smaller resolutions. For final training, use 224 or higher.

When CNNs Fail: Understanding Limitations

CNNs aren’t magic. Understanding their failures helps you know when to use alternatives.

The Texture Bias Problem

Did You Know? In 2019, researchers discovered something disturbing: CNNs classify images primarily by texture, not shape. They showed that a cat with elephant skin texture gets classified as “elephant” by ImageNet-trained CNNs. Humans, by contrast, rely primarily on shape. This explains why CNNs are fooled by adversarial examples — small texture changes that are invisible to humans but completely fool the network.

Limited Receptive Field

CNNs process images through local windows. Very large objects or global relationships can be missed. This is why Vision Transformers (ViT) are gaining popularity — they can attend to the entire image at once.

Lack of Rotation Invariance

CNNs are translation-invariant (a cat anywhere is still a cat) but not rotation-invariant. An upside-down cat looks different to a CNN. Data augmentation helps, but doesn’t fully solve this.

Poor Sample Efficiency

CNNs need thousands of examples to learn. Humans can learn from one or two examples (“one-shot learning”). This is an active research area.

The Rise of Vision Transformers

Starting in 2020, Vision Transformers (ViT) began challenging CNN dominance. They:

Use self-attention instead of convolution
Can attend to global relationships
Scale better with data and compute

Did You Know? The original Vision Transformer paper (Dosovitskiy et al., 2020) had a surprising result: ViT underperformed CNNs when trained on ImageNet alone. But when pretrained on larger datasets (JFT-300M with 300 million images), it dramatically outperformed every CNN. The conclusion? Transformers need more data to learn good visual features, but once they have enough, they scale better. This sparked a race to build larger vision datasets and hybrid CNN-transformer architectures like Swin Transformer.

But CNNs aren’t dead! For smaller datasets and edge deployment, CNNs still dominate. The future is likely hybrid architectures that combine CNN’s inductive biases (locality, translation equivariance) with the global attention of transformers.

Common Pitfalls

1. Forgetting to Normalize Inputs

Pretrained models expect ImageNet normalization. Without it:

# Wrong - raw pixel values [0, 255]
outputs = model(raw_images)  # Garbage predictions

# Wrong - [0, 1] but not normalized
outputs = model(images / 255.0)  # Still wrong

# Correct - ImageNet normalization
normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
outputs = model(normalize(images / 255.0))  # Correct!

2. Wrong Input Dimensions

PyTorch expects (batch, channels, height, width). Common mistakes:

# Wrong - missing batch dimension
image = torch.randn(3, 224, 224)
output = model(image)  # Error!

# Correct - add batch dimension
output = model(image.unsqueeze(0))  # Shape: (1, 3, 224, 224)

# Wrong - channels last (common in numpy/PIL)
image = torch.randn(1, 224, 224, 3)  # HWC format
output = model(image)  # Wrong results!

# Correct - convert to channels first
output = model(image.permute(0, 3, 1, 2))  # CHW format

3. Forgetting model.eval() for Inference

BatchNorm and Dropout behave differently in training vs evaluation:

# Wrong - using training mode for inference
model.train()  # Default mode
predictions = model(test_images)  # BatchNorm uses batch statistics, dropout active

# Correct
model.eval()
with torch.no_grad():
    predictions = model(test_images)

4. Using the Wrong Pretrained Weights

# Wrong - ImageNet V2 weights have different preprocessing
model = models.resnet50(weights='IMAGENET1K_V2')  # Uses different transforms!

# Check what transforms your weights expect
weights = models.ResNet50_Weights.IMAGENET1K_V1
preprocess = weights.transforms()  # Use this!

5. Overfitting on Small Datasets

With transfer learning on small datasets:

# Wrong - training everything from the start
model = models.resnet50(weights='IMAGENET1K_V1')
optimizer = optim.Adam(model.parameters(), lr=0.001)  # All params!

# Correct - freeze backbone, train head first
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(2048, num_classes)
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)  # Only head

Quiz: Test Your Understanding

Q1: Why do we use 3×3 kernels instead of larger ones like 7×7?

Answer

Three stacked 3×3 kernels have the same receptive field as one 7×7 kernel, but with fewer parameters (3×9=27 vs 49) and more non-linearities (3 ReLUs vs 1). This makes the network more expressive and easier to train. The VGGNet paper demonstrated this principle in 2014.

Q2: Calculate the output shape: Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1) on input (B, 64, 56, 56).

Answer

Output spatial size = floor((56 - 3 + 2×1) / 2) + 1 = floor(55/2) + 1 = 27 + 1 = 28

Output shape: (B, 128, 28, 28)

The number of output channels becomes 128, and spatial dimensions are halved (rounded).

Q3: Why does ResNet’s skip connection help training?

Answer

Skip connections provide a "gradient highway" that allows gradients to flow directly from the loss to early layers without being multiplied through many weight matrices. Even if the learned function F(x) has vanishing gradients, the identity mapping x preserves gradient magnitude. Additionally, skip connections make the optimization landscape smoother, making it easier to find good minima.

Q4: When should you freeze the backbone in transfer learning?

Answer

Freeze the backbone when: - You have a small dataset (<1000 images per class) - Your images are similar to ImageNet (natural photos) - You want fast training - You're doing initial experiments

Unfreeze gradually when:

You have more data
Your images differ significantly from ImageNet (medical images, satellite imagery)
You’ve already trained the head and want better performance

Q5: Your CNN gets 95% accuracy on training data but only 60% on validation. What’s happening and how do you fix it?

Answer

This is classic **overfitting**. The model memorized training data instead of learning generalizable features.

Fixes to try:

Data augmentation: More variety in training data
Dropout: Add or increase dropout rate
Weight decay: Add L2 regularization
Simpler model: Fewer layers or channels
Early stopping: Stop before overfitting
Transfer learning: Use pretrained weights (better features)
More data: If possible, collect more training examples

Summary

You’ve learned:

Convolution: Sliding filters extract features efficiently with parameter sharing
Pooling: Downsampling adds translation invariance
Receptive fields: Deep stacks of small kernels see large regions
Classic architectures: LeNet → AlexNet → VGG → GoogLeNet → ResNet
Transfer learning: Pretrained features + new head = fast, accurate models
Practical tips: Normalization, input dimensions, eval mode, memory management

CNNs are the backbone of computer vision. Even as transformers gain ground, understanding CNNs is essential — they’re still the right tool for many tasks and provide the foundation for understanding more advanced architectures.

Next Steps

Move on to Module 30: Transformers & Attention where you’ll learn:

Why “Attention Is All You Need” changed everything
How self-attention works (and why it’s O(n²))
Building a transformer from scratch
Vision Transformers: CNNs’ newest challenger

This is where CNNs and transformers meet — and where you’ll understand why modern AI is what it is.

Last updated: 2025-11-27 Status: Complete

RNNs & Sequence Models

Or: How AI Learned to See (And Why It Still Can’t Find Waldo)

What You’ll Be Able to Do

The Computer Vision Revolution: From Impossible to Ubiquitous

The Biological Inspiration: How Vision Actually Works

The Nobel Prize Discovery

From Pixels to Features: The Convolution Operation

What Is Convolution? (The Intuitive Version)

The Math: How Convolution Works

Output Size: The Formula You’ll Use Daily

PyTorch Implementation

Multiple Filters: Learning Different Features

Pooling: Making Networks Translation-Invariant

Max Pooling: The Dominant Approach

Average Pooling: The Alternative

PyTorch Pooling Layers

The Decline of Manual Pooling

Receptive Fields: How CNNs See the Big Picture

Building Intuition

The Receptive Field Formula

Dilated (Atrous) Convolutions: Expanding Receptive Fields Without Downsampling

Why This Matters

The Classic Architectures: A Journey Through History

LeNet-5 (1998): The Original CNN

AlexNet (2012): The Deep Learning Big Bang

VGGNet (2014): The Power of Simplicity

GoogLeNet/Inception (2014): Going Wider, Not Just Deeper

ResNet (2015): The Depth Breakthrough

The ResNet Family

EfficientNet (2019): Optimal Scaling

Building a CNN from Scratch

Transfer Learning: Standing on Giants’ Shoulders

The Key Insight

Transfer Learning in PyTorch

The Transfer Learning Recipe

Data Augmentation for Transfer Learning

Memory & Performance Considerations

Memory Breakdown

Batch Size vs. GPU Memory

Common OOM Solutions

Input Size Trade-offs

When CNNs Fail: Understanding Limitations

The Texture Bias Problem

Limited Receptive Field

Lack of Rotation Invariance

Poor Sample Efficiency

The Rise of Vision Transformers

Common Pitfalls

1. Forgetting to Normalize Inputs

2. Wrong Input Dimensions

3. Forgetting model.eval() for Inference

4. Using the Wrong Pretrained Weights

5. Overfitting on Small Datasets

Quiz: Test Your Understanding

Further Reading

Papers

Online Resources

Summary

Next Steps