Перейти до вмісту

Backpropagation Deep Dive

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)

Section titled “Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)”

Reading Time: 8-10 hours Prerequisites: Module 29

This is a Heureka Moment module.


When eight researchers at Google discovered in 2017 that they could throw away two decades of sequence modeling wisdom and build something better, the AI community was skeptical. After all, recurrent neural networks had dominated for years. But within three years, their “Attention Is All You Need” paper would become the most cited work in AI history, and every major language model—gpt-5, Claude, Gemini—would be built on their foundation.


Mountain View, California. June 12, 2017. 2:34 AM.

Ashish Vaswani and his Google Brain colleagues huddled around a laptop in a cramped conference room. They had been training their experimental model for days, and the results were finally in.

“That can’t be right,” Jakob Uszkoreit muttered, staring at the screen.

They had just achieved state-of-the-art performance on machine translation—beating every existing system. But that wasn’t what shocked them. It was what they had removed to get there.

No recurrent connections. No convolutions. Just attention. “Attention is all you need,” Vaswani typed as the paper’s title, half-joking. But he wasn’t joking.

“We expected attention to be helpful alongside RNNs, not to replace them entirely. When we removed the recurrence and the model got better, we knew we had found something fundamental.” — Ashish Vaswani, Transformer co-inventor

That paper, with its eight co-authors and deceptively simple title, would go on to become the most cited paper in AI history. Every modern language model—gpt-5, Claude, Gemini, LLaMA—is a direct descendant of what they built that night.


By the end of this module, you will:

  • Understand why sequential models (RNNs/LSTMs) hit a wall
  • Master the self-attention mechanism from intuition to implementation
  • Build multi-head attention from scratch
  • Understand positional encoding and why it’s necessary
  • Implement a complete transformer encoder
  • Know why “Attention Is All You Need” changed everything
  • See how Vision Transformers apply attention to images
  • Grasp the O(n²) complexity and its implications

The Heureka Moment: What You’re About to Discover

Section titled “The Heureka Moment: What You’re About to Discover”

Here’s the transformative insight you’ll gain from this module:

Before transformers, neural networks processed sequences one step at a time — like reading a book word by word, unable to look back or ahead without significant effort.

After transformers, neural networks can look at everything simultaneously and learn what to pay attention to — like being able to see an entire book at once and instantly knowing which parts are relevant to each other.

This single architectural change is why we have gpt-5, Claude, Gemini, and virtually every modern AI system. By the end of this module, you’ll understand exactly how it works.


Before 2017, if you wanted to process sequences (text, audio, time series), you used Recurrent Neural Networks (RNNs) or their improved variant, LSTMs.

An RNN processes sequences one element at a time, maintaining a “hidden state” that carries information forward:

Input: "The cat sat on the mat"
↓ ↓ ↓ ↓ ↓ ↓
h₀ → h₁ → h₂ → h₃ → h₄ → h₅ → h₆
Output

Each hidden state h_t depends on the previous state h_{t-1} and the current input. Information flows forward through time.

1. Sequential Processing (Slow)

RNNs must process tokens one at a time. You can’t parallelize because h₃ depends on h₂ which depends on h₁. Training on a 1000-token sequence takes 1000 sequential steps.

Modern GPUs have thousands of cores, but RNNs can only use one at a time
for each sequence position. It's like having a 1000-lane highway but
being forced to drive in a single lane.

2. Vanishing/Exploding Gradients (Again)

Information must travel through many timesteps. For a 100-word sentence, the gradient for word 1 must backpropagate through 99 steps. Remember Module 28? The same multiplication problem kills gradients in RNNs.

Did You Know? LSTMs (Long Short-Term Memory) were invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997 specifically to address the vanishing gradient problem. Their “gating mechanism” allows information to flow unchanged through time — similar to ResNet’s skip connections. LSTMs dominated sequence modeling for 20 years until transformers arrived.

3. Long-Range Dependencies (Difficult)

For the sentence: “The cat, which had been sleeping on the warm sunny windowsill for most of the afternoon, finally woke up and stretched.”

What does “stretched” refer to? The cat! But there are 16 words between “cat” and “stretched.” RNNs struggle to maintain this connection over long distances.

LSTMs add “gates” that control information flow:

┌─────────────────────────────────────────────┐
│ LSTM Cell │
│ │
│ forget_gate = σ(W_f · [h_{t-1}, x_t]) │ ← What to forget
│ input_gate = σ(W_i · [h_{t-1}, x_t]) │ ← What to add
│ output_gate = σ(W_o · [h_{t-1}, x_t]) │ ← What to output
│ │
│ cell_state = forget_gate * c_{t-1} │
│ + input_gate * tanh(W_c · [h_{t-1}, x_t])
│ │
│ h_t = output_gate * tanh(cell_state) │
└─────────────────────────────────────────────┘

LSTMs helped, but didn’t solve the fundamental problem: you still have to process sequentially.


The Revolution: “Attention Is All You Need”

Section titled “The Revolution: “Attention Is All You Need””

In June 2017, eight researchers at Google published a paper that would reshape AI: “Attention Is All You Need.”

Did You Know? The transformer paper has been cited over 130,000 times — more than almost any computer science paper ever written. The eight authors (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin) are now scattered across Google, Google DeepMind, and various AI startups. Noam Shazeer later co-founded Character.AI and then returned to Google. The paper’s title was intentionally provocative — they were claiming that attention alone, without any recurrence, could match or beat RNN/LSTM models.

The key insight was radical: throw away recurrence entirely. Don’t process sequences one step at a time. Instead, let every token look at every other token simultaneously and learn which connections matter.

Think of the difference like reading a book. An RNN is like reading with a tiny flashlight that only illuminates one word at a time—you have to remember what you read before. A transformer is like turning on all the lights at once and seeing the entire page, instantly noticing which words relate to each other. This parallel vision is what makes transformers so powerful.

This is the attention mechanism.


Think of attention as a soft database lookup.

In a traditional database, you have:

  • A query (what you’re looking for)
  • Keys (labels on stored items)
  • Values (the actual stored content)

You find exact matches: SELECT value WHERE key = query

Attention does the same thing, but with soft matching:

  • Instead of exact matches, compute similarity scores
  • Instead of returning one result, return a weighted combination of all values
  • The weights are learned, not hand-coded

Imagine you’re reading the sentence: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal or the street?

As a human, you instantly know it’s “the animal” — streets don’t get tired. How did you know? You looked at “it,” then looked at all other words, and found that “animal” + “tired” makes semantic sense while “street” + “tired” doesn’t.

This is exactly what self-attention does. For each word:

  1. Create a “query” representing “what am I looking for?”
  2. Create “keys” for all words representing “what can I offer?”
  3. Compare query to all keys (similarity scores)
  4. Use scores to weight the “values” (actual content)

Given an input sequence of embeddings X (shape: [seq_len, d_model]):

Q = X @ W_Q # Queries: what each position is looking for
K = X @ W_K # Keys: what each position can be found by
V = X @ W_V # Values: what each position contains
# Compute attention scores
scores = Q @ K.T / sqrt(d_k) # [seq_len, seq_len]
# Convert to probabilities
attention_weights = softmax(scores) # Each row sums to 1
# Weighted sum of values
output = attention_weights @ V # [seq_len, d_model]

Let’s break this down:

Step 1: Project to Q, K, V

Each input embedding gets transformed into three different representations:

  • Query (Q): “When I need context, what should I look for?”
  • Key (K): “What kind of context can I provide to others?”
  • Value (V): “What information do I actually contain?”

These are separate because what you’re looking for (Q) might be different from what you offer (K). For example, a pronoun like “it” is looking for its antecedent, but offers very little information itself.

Step 2: Compute Similarity (Q @ K.T)

For each query, compute dot product with all keys. High dot product = high similarity.

For "it" (query) comparing to:
"animal" (key): score = 0.8 ← high similarity
"street" (key): score = 0.2 ← low similarity
"tired" (key): score = 0.1

Step 3: Scale by sqrt(d_k)

Why divide by sqrt(d_k)? Without it, dot products grow with dimension size, pushing softmax into saturated regions where gradients vanish.

d_k = 64 (typical)
sqrt(64) = 8
Without scaling: scores might be [-100, 150, -80]
softmax([-100, 150, -80]) ≈ [0, 1, 0] # One-hot, no gradient
With scaling: scores become [-12.5, 18.75, -10]
softmax([-12.5, 18.75, -10]) ≈ [0.001, 0.998, 0.001] # Still peaked but gradient exists

Step 4: Softmax (Probabilities)

Convert scores to probabilities. Each position’s attention weights sum to 1.

Step 5: Weighted Sum of Values

The output for each position is a weighted average of all values, where weights come from attention scores.

Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V

This single formula is the heart of modern AI.

Think of this formula like a spotlight operator at a concert. The Query is what you’re trying to illuminate. The Keys are labels on different performers. The dot product tells you how relevant each performer is to what you’re looking for. Softmax makes sure your spotlight energy (attention) is distributed as probabilities. And Values are what you actually see when you shine the light there—the content you get back.

Let’s trace through with a tiny example. Sequence: [“I”, “saw”, “it”] with d_model = 4.

# Input embeddings (simplified)
X = [
[1.0, 0.0, 0.5, 0.2], # "I"
[0.3, 0.8, 0.1, 0.9], # "saw"
[0.5, 0.5, 0.5, 0.5], # "it"
]
# Weight matrices (learned during training)
W_Q = [[...]] # 4×4 matrix
W_K = [[...]] # 4×4 matrix
W_V = [[...]] # 4×4 matrix
# Project to Q, K, V
Q = X @ W_Q # [3, 4]
K = X @ W_K # [3, 4]
V = X @ W_V # [3, 4]
# Compute attention scores for position 2 ("it")
# Q[2] @ K.T gives scores for "it" attending to all positions
q_it = Q[2] # [4]
scores_it = q_it @ K.T # [3] - one score per position
# Let's say this gives: [0.1, 0.3, 0.6]
# Scale by sqrt(d_k)
d_k = 4
scaled_scores = scores_it / sqrt(4) # [0.05, 0.15, 0.3]
# Softmax
attention_weights = softmax([0.05, 0.15, 0.3]) # ≈ [0.28, 0.33, 0.39]
# Weighted sum of values
# "it" gets 28% of "I"'s value, 33% of "saw"'s value, 39% of its own value
output_it = 0.28 * V[0] + 0.33 * V[1] + 0.39 * V[2]
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class SelfAttention(nn.Module):
"""
Single-head self-attention.
This is the core building block of transformers.
"""
def __init__(self, d_model: int):
super().__init__()
self.d_model = d_model
# Linear projections for Q, K, V
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
"""
Args:
x: Input tensor [batch, seq_len, d_model]
mask: Optional attention mask [batch, seq_len, seq_len]
Returns:
output: Attended values [batch, seq_len, d_model]
attention_weights: Attention patterns [batch, seq_len, seq_len]
"""
# Project to Q, K, V
Q = self.W_q(x) # [batch, seq_len, d_model]
K = self.W_k(x)
V = self.W_v(x)
# Compute attention scores
# Q @ K.T: [batch, seq_len, d_model] @ [batch, d_model, seq_len]
# = [batch, seq_len, seq_len]
scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(self.d_model)
# Apply mask if provided (for causal/padding masking)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over the last dimension (keys)
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.bmm(attention_weights, V)
return output, attention_weights
# Test it
d_model = 64
seq_len = 10
batch_size = 2
attention = SelfAttention(d_model)
x = torch.randn(batch_size, seq_len, d_model)
output, weights = attention(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Weights sum per query: {weights.sum(dim=-1)}") # Should be all 1s

Notice how the output shape matches the input shape — attention doesn’t change dimensions, it just mixes information between positions. The attention weights sum to 1 for each query position (thanks to softmax), making them interpretable as “how much attention goes to each key.”

The attention weights form a [seq_len × seq_len] matrix showing how much each position attends to each other position:

"The" "cat" "sat" "on" "it"
"The" [ 0.8 0.1 0.05 0.03 0.02 ]
"cat" [ 0.3 0.5 0.1 0.05 0.05 ]
"sat" [ 0.1 0.6 0.2 0.05 0.05 ]
"on" [ 0.1 0.2 0.3 0.3 0.1 ]
"it" [ 0.1 0.7 0.1 0.05 0.05 ] ← "it" attends strongly to "cat"

This is why attention is so interpretable — you can literally see what the model is “paying attention to.”


Multi-Head Attention: Multiple Perspectives

Section titled “Multi-Head Attention: Multiple Perspectives”

Single-head attention has a limitation: each position can only compute one weighted average of values. But language has many types of relationships!

Consider: “The cat that I saw yesterday chased the mouse.”

Different aspects we might want to capture:

  • Syntactic: “cat” → “chased” (subject-verb)
  • Semantic: “cat” → “mouse” (predator-prey)
  • Coreference: “that” → “cat” (relative clause)

Multi-head attention runs multiple attention operations in parallel, each learning different relationship types.

Input X
├──────┬──────┬──────┬──────► (split into heads)
│ │ │ │
Head₁ Head₂ Head₃ Head₄
│ │ │ │
└──────┴──────┴──────┘
Concat
Linear (W_O)
Output

Instead of one attention with d_model dimensions, use h heads each with d_k = d_model/h dimensions.

class MultiHeadAttention(nn.Module):
"""
Multi-head attention mechanism.
Each head learns different attention patterns, then
results are concatenated and projected.
"""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Projections for all heads at once (more efficient)
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
batch_size, seq_len, _ = x.shape
# Project to Q, K, V
Q = self.W_q(x) # [batch, seq_len, d_model]
K = self.W_k(x)
V = self.W_v(x)
# Reshape for multi-head: [batch, seq_len, d_model]
# → [batch, num_heads, seq_len, d_k]
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Compute attention for all heads in parallel
# [batch, heads, seq_len, d_k] @ [batch, heads, d_k, seq_len]
# = [batch, heads, seq_len, seq_len]
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = F.softmax(scores, dim=-1)
# Apply attention to values
# [batch, heads, seq_len, seq_len] @ [batch, heads, seq_len, d_k]
# = [batch, heads, seq_len, d_k]
attended = torch.matmul(attention_weights, V)
# Concatenate heads: [batch, heads, seq_len, d_k]
# → [batch, seq_len, d_model]
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, seq_len, self.d_model)
# Final linear projection
output = self.W_o(attended)
return output, attention_weights
# Test multi-head attention
mha = MultiHeadAttention(d_model=64, num_heads=8)
x = torch.randn(2, 10, 64)
output, weights = mha(x)
print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
print(f"Attention weights: {weights.shape}") # [batch, heads, seq_len, seq_len]

Notice how each head operates on d_model/num_heads dimensions (64/8 = 8 in this case). The heads run in parallel, each learning its own attention pattern, then the results are concatenated and projected back to d_model dimensions.

Did You Know? In the original transformer paper, the authors used 8 attention heads. When researchers later visualized what each head learned, they found remarkable specialization: some heads tracked syntactic dependencies (subject-verb), others tracked positional relationships (adjacent words), and some captured semantic similarity. The model “discovered” linguistics without being taught!


Positional Encoding: Teaching Order to an Orderless System

Section titled “Positional Encoding: Teaching Order to an Orderless System”

Here’s a subtle but critical problem with self-attention: it’s permutation invariant.

If you shuffle the input tokens, the attention computation doesn’t change (each token still attends to all others). But word order matters! “Dog bites man” ≠ “Man bites dog.”

The solution is positional encoding: add information about position to each token.

The original paper used sine and cosine functions of different frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates a unique pattern for each position that the model can learn to interpret.

Worked Example: Computing Positional Encodings

Section titled “Worked Example: Computing Positional Encodings”

Let’s compute actual values for d_model = 8 (tiny, for illustration):

Position 0, dimensions 0-7:
dim 0: sin(0 / 10000^(0/8)) = sin(0) = 0.000
dim 1: cos(0 / 10000^(0/8)) = cos(0) = 1.000
dim 2: sin(0 / 10000^(2/8)) = sin(0) = 0.000
dim 3: cos(0 / 10000^(2/8)) = cos(0) = 1.000
...
PE[0] = [0.000, 1.000, 0.000, 1.000, 0.000, 1.000, 0.000, 1.000]
Position 1, dimensions 0-7:
dim 0: sin(1 / 10000^(0/8)) = sin(1.0) = 0.841
dim 1: cos(1 / 10000^(0/8)) = cos(1.0) = 0.540
dim 2: sin(1 / 10000^(2/8)) = sin(0.1) = 0.100
dim 3: cos(1 / 10000^(2/8)) = cos(0.1) = 0.995
...
PE[1] = [0.841, 0.540, 0.100, 0.995, 0.010, 1.000, 0.001, 1.000]
Position 100:
PE[100] = [-0.506, 0.863, -0.999, 0.045, 0.842, 0.540, 0.100, 0.995]

Notice how:

  • Low dimensions (0, 1) oscillate rapidly — they change significantly between adjacent positions
  • High dimensions (6, 7) oscillate slowly — they’re almost constant for nearby positions
  • Position 0 is special — all sine terms are 0, all cosine terms are 1
  • Each position gets a unique fingerprint that the model learns to decode

Why sinusoids?

  1. Bounded values: Always between -1 and 1
  2. Unique patterns: Each position has a distinct encoding
  3. Relative positions: The model can learn to compute PE(pos+k) from PE(pos)
  4. Generalization: Works for sequences longer than training data
class PositionalEncoding(nn.Module):
"""
Sinusoidal positional encoding from "Attention Is All You Need."
Adds position information to token embeddings so the model
knows word order.
"""
def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Create positional encoding matrix
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
# Compute the divisor term
div_term = torch.exp(
torch.arange(0, d_model, 2).float() *
(-math.log(10000.0) / d_model)
)
# Apply sin to even indices, cos to odd indices
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension and register as buffer (not a parameter)
pe = pe.unsqueeze(0) # [1, max_len, d_model]
self.register_buffer('pe', pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: Token embeddings [batch, seq_len, d_model]
Returns:
x + positional encoding
"""
x = x + self.pe[:, :x.size(1)]
return self.dropout(x)
# Visualize positional encodings
pe = PositionalEncoding(d_model=64, max_len=100)
positions = pe.pe[0, :50, :].numpy() # First 50 positions
# Each row is a position, each column is a dimension
# You'd see wave patterns at different frequencies

Notice how the encoding is computed once and stored as a buffer (not a parameter). This is because positional encodings are fixed — they don’t change during training. The register_buffer call ensures they’re saved with the model but not updated by the optimizer.

Modern models often use learned positional embeddings instead:

class LearnedPositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_len: int = 5000):
super().__init__()
self.position_embeddings = nn.Embedding(max_len, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
seq_len = x.size(1)
positions = torch.arange(seq_len, device=x.device)
return x + self.position_embeddings(positions)

Notice how learned embeddings use nn.Embedding — the same mechanism used for token embeddings. The position indices (0, 1, 2, …) are looked up in a learned table. This is simpler than sinusoidal but limited to the maximum length seen during training.

BERT uses learned positions. GPT uses learned positions. The original transformer used sinusoidal. Both work well.

Did You Know? RoPE (Rotary Position Embedding), invented by Jianlin Su in 2021, is now used in most modern LLMs including LLaMA, Mistral, and many others. Instead of adding positional information, it rotates the query and key vectors based on position. This elegant approach allows models to better capture relative positions and generalize to longer sequences than seen during training.


Now let’s assemble the complete transformer encoder block. Each block contains:

  1. Multi-Head Self-Attention (with residual connection + layer norm)
  2. Feed-Forward Network (with residual connection + layer norm)
Input
├──────────────────────┐
│ │
Multi-Head │
Attention │
│ │
└──────► Add ◄─────────┘
LayerNorm
├──────────────────────┐
│ │
Feed-Forward │
Network │
│ │
└──────► Add ◄─────────┘
LayerNorm
Output

A simple two-layer MLP applied independently to each position:

class FeedForward(nn.Module):
"""
Position-wise feed-forward network.
Applied independently to each position (no interaction between positions).
Typically expands to 4x d_model then contracts back.
"""
def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.1):
super().__init__()
d_ff = d_ff or 4 * d_model # Default: 4x expansion
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(), # Modern choice; original used ReLU
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
class TransformerEncoderBlock(nn.Module):
"""
A single transformer encoder block.
Contains multi-head attention and feed-forward network,
each with residual connections and layer normalization.
"""
def __init__(
self,
d_model: int,
num_heads: int,
d_ff: int = None,
dropout: float = 0.1
):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ff = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
# Self-attention with residual connection
attended, attention_weights = self.attention(x, mask)
x = self.norm1(x + self.dropout(attended))
# Feed-forward with residual connection
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out))
return x, attention_weights

Stack multiple blocks:

class TransformerEncoder(nn.Module):
"""
Complete transformer encoder.
Stacks multiple encoder blocks with shared positional encoding.
"""
def __init__(
self,
vocab_size: int,
d_model: int = 512,
num_heads: int = 8,
num_layers: int = 6,
d_ff: int = 2048,
max_len: int = 5000,
dropout: float = 0.1
):
super().__init__()
self.d_model = d_model
# Token embedding
self.token_embedding = nn.Embedding(vocab_size, d_model)
# Positional encoding
self.positional_encoding = PositionalEncoding(d_model, max_len, dropout)
# Stack of encoder blocks
self.layers = nn.ModuleList([
TransformerEncoderBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Final layer norm
self.norm = nn.LayerNorm(d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
# Embed tokens and add positional encoding
x = self.token_embedding(x) * math.sqrt(self.d_model)
x = self.positional_encoding(x)
# Pass through encoder blocks
attention_weights = []
for layer in self.layers:
x, attn = layer(x, mask)
attention_weights.append(attn)
x = self.norm(x)
return x, attention_weights
# Create a small transformer encoder
encoder = TransformerEncoder(
vocab_size=10000,
d_model=256,
num_heads=8,
num_layers=4,
d_ff=1024
)
# Test with random token IDs
tokens = torch.randint(0, 10000, (2, 50)) # Batch of 2, length 50
output, attentions = encoder(tokens)
print(f"Input tokens: {tokens.shape}")
print(f"Output embeddings: {output.shape}")
print(f"Number of attention matrices: {len(attentions)}")
print(f"Each attention matrix: {attentions[0].shape}")

Notice how the encoder returns both the output embeddings and all attention weights. The attention weights are invaluable for interpretability — you can visualize exactly what each layer and head is “paying attention to.” This transparency is one of the reasons transformers became so popular in research.


Encoder vs. Decoder: Two Flavors of Transformer

Section titled “Encoder vs. Decoder: Two Flavors of Transformer”

The original transformer paper described an encoder-decoder architecture for translation. Let’s understand the difference:

  • Sees the entire input at once
  • Each position attends to all other positions
  • Used for: classification, named entity recognition, embeddings
Input: [CLS] The cat sat [SEP]
↓ ↓ ↓ ↓ ↓
←───────────────────→ (bidirectional attention)
↓ ↓ ↓ ↓ ↓
Output: h_cls h_1 h_2 h_3 h_sep
  • Generates tokens one at a time
  • Each position can only attend to previous positions (causal masking)
  • Used for: text generation, chat, code completion
Input: The cat sat
↓ ↓ ↓
←─ ←── ←─── (causal attention: can only look back)
↓ ↓ ↓
Output: cat sat on

To prevent the decoder from “cheating” by looking at future tokens:

def create_causal_mask(seq_len: int) -> torch.Tensor:
"""
Create a causal mask for decoder self-attention.
Lower triangular matrix: position i can attend to positions 0...i
"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask
# Example for seq_len=5
mask = create_causal_mask(5)
print(mask)
# tensor([[1., 0., 0., 0., 0.],
# [1., 1., 0., 0., 0.],
# [1., 1., 1., 0., 0.],
# [1., 1., 1., 1., 0.],
# [1., 1., 1., 1., 1.]])
# Position 0 can only see position 0
# Position 1 can see positions 0, 1
# Position 4 can see positions 0, 1, 2, 3, 4

Notice how torch.tril (lower triangular) gives us exactly what we need — each row allows attention only to that position and earlier ones. Where the mask is 0, we set the attention score to negative infinity before softmax, making those attention weights effectively zero.

  • Encoder processes source sequence (bidirectional)
  • Decoder generates target sequence (causal)
  • Decoder also attends to encoder outputs (cross-attention)
  • Used for: translation, summarization
Source: "Le chat" → Encoder → encoded representations
Target: "The cat" → Decoder ← cross-attention
Output: "The cat"

Did You Know? GPT-3 and all subsequent GPT models are decoder-only. BERT is encoder-only. The original transformer was encoder-decoder. Surprisingly, decoder-only models turned out to be the best for general language understanding, despite being designed for generation. This was unexpected — many researchers thought bidirectional context (encoder-style) would always be superior.


Self-attention has a critical limitation: quadratic complexity in sequence length.

For a sequence of length n:

  • Attention scores matrix: n × n
  • Memory: O(n²)
  • Computation: O(n²)

This means:

  • 1000 tokens: 1 million attention computations
  • 10,000 tokens: 100 million attention computations
  • 100,000 tokens: 10 billion attention computations
Sequence LengthMemory (FP16)Relative Cost
512~1 MB
2,048~16 MB16×
8,192~256 MB256×
32,768~4 GB4,096×
131,072~64 GB65,536×

This is why early transformers were limited to 512-2048 tokens, and why research into efficient attention is so active.

Many techniques exist to reduce complexity:

  1. Sparse Attention (BigBird, Longformer): Only attend to subset of positions
  2. Linear Attention (Performer, Linear Transformer): Approximate attention with O(n)
  3. Flash Attention: Same math, but optimized memory access patterns (much faster!)
  4. Sliding Window: Only attend to local context (Mistral uses this)

Did You Know? Flash Attention, created by Tri Dao at Stanford in 2022, doesn’t change the attention math at all — it computes exactly the same result. But by optimizing how data moves between GPU memory levels (exploiting the memory hierarchy), it achieves 2-4× speedup and uses 5-20× less memory. This is why modern LLMs can handle 128K+ context windows. The insight was that attention is memory-bound, not compute-bound.


In 2020, Google showed that transformers could match or beat CNNs on images: the Vision Transformer (ViT).

Instead of processing pixels, split the image into patches and treat each patch as a token:

224×224 image with 16×16 patches = 14×14 = 196 "tokens"
┌────┬────┬────┬────┐
│ P1 │ P2 │ P3 │... │
├────┼────┼────┼────┤
│ P15│ P16│ P17│... │
├────┼────┼────┼────┤
│... │... │... │... │
└────┴────┴────┴────┘
Each patch (16×16×3 = 768 values) becomes a token embedding
class PatchEmbedding(nn.Module):
"""
Convert image into sequence of patch embeddings.
"""
def __init__(
self,
image_size: int = 224,
patch_size: int = 16,
in_channels: int = 3,
d_model: int = 768
):
super().__init__()
self.patch_size = patch_size
self.num_patches = (image_size // patch_size) ** 2
# Conv2d is an efficient way to extract and project patches
self.projection = nn.Conv2d(
in_channels, d_model,
kernel_size=patch_size,
stride=patch_size
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [batch, channels, height, width]
x = self.projection(x) # [batch, d_model, h/p, w/p]
x = x.flatten(2) # [batch, d_model, num_patches]
x = x.transpose(1, 2) # [batch, num_patches, d_model]
return x
Image → Patch Embedding → [CLS] + Patches → Position Embedding
Transformer Encoder (×12)
CLS token
Classification Head

The [CLS] token is a learnable embedding prepended to the patch sequence. After passing through the transformer, it contains information about the entire image.

Did You Know? The original ViT paper (Dosovitskiy et al., 2020) had a surprising finding: ViT underperformed CNNs when trained on ImageNet alone, but dramatically outperformed them when pretrained on larger datasets (JFT-300M with 300 million images). This suggested that transformers need more data to learn good visual features, but once they have enough data, they scale better than CNNs.


Let’s revisit the Heureka Moment. Why did transformers take over AI?

RNNs: Sequential (slow to train) Transformers: Fully parallel (fast to train)

This meant transformers could be trained on much more data in the same time.

RNNs: Information degrades over distance Transformers: Every token directly attends to every other token

A word 500 positions away is just as easy to attend to as an adjacent word.

The “scaling laws” discovered by OpenAI showed that transformers improve predictably with:

  • More parameters
  • More data
  • More compute

This predictability enabled massive investment in scaling (GPT-3, gpt-5, etc.).

Pretrained transformers transfer remarkably well:

  • BERT: Pretrain on text → fine-tune for any NLP task
  • GPT: Pretrain on text → few-shot learning on any task
  • ViT: Pretrain on images → transfer to any vision task

Attention weights are somewhat interpretable. You can visualize what the model is “looking at.” This isn’t possible with RNNs or CNNs in the same way.


# Wrong - gradients will vanish in softmax
scores = Q @ K.T
attention = softmax(scores)
# Correct - scale by sqrt(d_k)
scores = Q @ K.T / math.sqrt(d_k)
attention = softmax(scores)
# Wrong - mask should be [batch, 1, seq_len, seq_len] or broadcastable
mask = torch.ones(batch_size, seq_len, seq_len) # Missing head dimension
# Correct for multi-head attention
mask = torch.ones(batch_size, 1, seq_len, seq_len) # Broadcasts over heads
# Wrong - no position information!
x = token_embedding(tokens)
output = transformer(x)
# Correct
x = token_embedding(tokens)
x = x + positional_encoding(x) # Add position information
output = transformer(x)
# For batches with different sequence lengths, you need padding masks
# to prevent attending to padding tokens
def create_padding_mask(seq, pad_token=0):
"""Create mask where padding positions are 0."""
return (seq != pad_token).unsqueeze(1).unsqueeze(2)
# Dangerous - might OOM
long_sequence = torch.randn(1, 10000, 512)
attention = MultiHeadAttention(512, 8)
output = attention(long_sequence) # 10000×10000 attention matrix!
# Safer - use flash attention or limit sequence length
# Or use a model designed for long contexts (Longformer, etc.)

Q1: Why do we divide attention scores by sqrt(d_k)?

Answer Without scaling, dot products grow with dimension size (approximately proportional to d_k). Large dot products push softmax into saturation where gradients approach zero. Dividing by sqrt(d_k) keeps the variance of scores roughly constant regardless of dimension, maintaining healthy gradients.

Q2: What’s the difference between encoder and decoder transformers?

Answer **Encoder** (BERT-style): Bidirectional attention — each position attends to all positions. Used for understanding/classification.

Decoder (GPT-style): Causal/autoregressive attention — each position only attends to previous positions. Used for generation.

The key difference is the attention mask: encoders use no mask (or just padding mask), while decoders use a causal mask that prevents looking ahead.

Q3: Why do transformers need positional encoding?

Answer Self-attention is permutation invariant — if you shuffle the input, the computation doesn't change (ignoring position encoding). But word order matters: "dog bites man" ≠ "man bites dog."

Positional encoding adds position information to each token embedding, allowing the model to distinguish between “cat at position 1” and “cat at position 10.”

Q4: Why is attention O(n²) and why does it matter?

Answer Each of n positions computes attention scores with all n positions, giving n×n scores. This means: - Memory grows quadratically with sequence length - Compute grows quadratically with sequence length

A 10× longer sequence needs 100× more memory/compute. This limits context windows and makes very long documents challenging. It’s why there’s active research on efficient attention (linear attention, sparse attention, Flash Attention).

Q5: What are the three weight matrices in attention (W_Q, W_K, W_V) for?

Answer - **W_Q (Query)**: Projects each token into "what am I looking for" space - **W_K (Key)**: Projects each token into "what can I be found by" space - **W_V (Value)**: Projects each token into "what information do I contain" space

Separating these allows different transformations for matching (Q·K) versus content retrieval (V). A pronoun like “it” needs to find its antecedent (high Q similarity with nouns) but contains little information itself (low V magnitude).


Transformers revolutionized AI but at significant computational cost:

Training Costs (Estimated):

ModelParametersTraining CostGPU Hours
GPT-21.5B~$50K~1 week
GPT-3175B~$4.6M~3 months
gpt-5~1.7T~$100M~6 months
Claude 3Unknown~$50-100MUnknown

Inference Costs per 1M Tokens:

Model SizeInput CostOutput Cost
7B params$0.10-0.50$0.30-1.00
70B params$0.50-2.00$2.00-5.00
400B+ params$2.00-10.00$10.00-30.00

The O(n²) attention cost means context length has outsized impact:

Context LengthRelative MemoryRelative Compute
4K tokens
32K tokens64×64×
128K tokens1,024×1,024×

Flash Attention’s Impact: Reduces memory by 5-20× and speeds up by 2-4×, making 100K+ context practical.

Career impact data (from industry surveys):

  • Transformer expertise salary premium: +$20,000-40,000/year
  • AI/ML roles requiring transformer knowledge: 85%+
  • Time to proficiency: 2-4 weeks of dedicated study
  • Most in-demand sub-skills: attention visualization, efficient inference, fine-tuning

Q1: “Walk me through the self-attention mechanism.”

Strong Answer: “Self-attention allows each position in a sequence to gather information from all other positions. We project input embeddings into three representations: Query (what am I looking for), Key (what can I offer), and Value (what information do I contain). We compute attention scores by taking the dot product of Query with all Keys, scaled by sqrt(d_k) to prevent gradient vanishing. After softmax normalization, these scores weight the Values to produce an output that’s a learned combination of all positions. The key insight is that attention patterns are learned—the model discovers what relationships matter.”

Q2: “Why did transformers replace RNNs?”

Strong Answer: “Three fundamental reasons. First, parallelization: RNNs process sequentially (h_t depends on h_{t-1}), while transformers process all positions simultaneously, enabling massive GPU parallelism. Second, long-range dependencies: RNNs suffer from vanishing gradients over distance, but transformer attention gives each position direct access to every other position. Third, scalability: transformers follow predictable scaling laws—more parameters and data reliably improve performance—enabling systematic investment in larger models.”

Q3: “What’s the role of positional encoding?”

Strong Answer: “Self-attention is permutation invariant—shuffling inputs doesn’t change outputs. But word order matters: ‘dog bites man’ differs from ‘man bites dog.’ Positional encoding injects position information into embeddings. The original paper used sinusoidal functions: different frequencies across dimensions create unique fingerprints per position. Modern models often use learned embeddings instead, or RoPE (Rotary Position Embedding) which rotates vectors based on position for better relative position handling and length generalization.”

Q4: “Explain multi-head attention and why it’s useful.”

Strong Answer: “Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows capturing different relationship types simultaneously: one head might learn syntactic patterns (subject-verb), another semantic relationships (cause-effect), another positional patterns (adjacent words). Each head operates on d_model/num_heads dimensions, then outputs are concatenated and projected. It’s like having multiple experts each focusing on different aspects of the relationships in text.”

Q5: “How would you debug a transformer that’s not learning?”

Strong Answer: “Systematic approach: First, verify attention patterns—visualize attention weights to check if the model is learning meaningful patterns or just attending uniformly. Second, check gradient flow—attention scores should have healthy magnitude (not too large pre-softmax). Third, verify positional encoding is being added correctly. Fourth, check masking—decoder models need causal masks, and padding tokens should be masked. Fifth, start with a tiny dataset and overfit—if the model can’t memorize a few examples, there’s a fundamental bug. Sixth, compare against a known working implementation on the same data.”

Q: “Design a transformer-based document search system.”

Strong Answer Structure:

  1. Architecture Choice: “Use a bi-encoder approach with BERT-style encoder. Documents are embedded offline, queries are embedded at search time. This enables sub-second search over millions of documents.”

  2. Embedding Strategy: “Use mean pooling or CLS token for document representation. Consider chunking long documents (512 token limit) and taking max or mean over chunks. Fine-tune on in-domain data if available.”

  3. Index Structure: “Store embeddings in a vector database (Pinecone, Milvus, FAISS). Use approximate nearest neighbor search for scalability. Combine with keyword search (BM25) in a hybrid approach for best results.”

  4. Efficiency Considerations: “Quantize embeddings (float32→int8) to reduce storage 4×. Use dimensionality reduction if latency-critical. Consider caching frequent queries.”

  5. Quality Improvements: “Add cross-encoder reranking on top-K results for higher precision. Fine-tune on click data for relevance signals. A/B test embedding models.”


Exercise 1: Implement Attention from Scratch

Section titled “Exercise 1: Implement Attention from Scratch”

Build your own attention mechanism without using PyTorch’s built-in functions:

import torch
import math
def manual_attention(Q, K, V, mask=None):
"""
Implement scaled dot-product attention from scratch.
Args:
Q: Queries [batch, seq_len, d_k]
K: Keys [batch, seq_len, d_k]
V: Values [batch, seq_len, d_v]
mask: Optional attention mask
Returns:
output: Attended values
attention_weights: Attention patterns
"""
# YOUR CODE HERE:
# 1. Compute Q @ K.T
# 2. Scale by sqrt(d_k)
# 3. Apply mask if provided
# 4. Softmax
# 5. Multiply by V
pass
# Test your implementation
Q = torch.randn(2, 10, 64)
K = torch.randn(2, 10, 64)
V = torch.randn(2, 10, 64)
output, weights = manual_attention(Q, K, V)
# Verify: attention weights should sum to 1 per query
assert torch.allclose(weights.sum(dim=-1), torch.ones(2, 10))

Success Criteria: Attention weights sum to 1, output shape matches input.

Create attention visualizations for real text:

from transformers import AutoTokenizer, AutoModel
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention(model, tokenizer, text):
"""
Visualize attention patterns for a given text.
1. Tokenize the input
2. Run through model with output_attentions=True
3. Extract attention weights from a specific layer/head
4. Create heatmap visualization
"""
# YOUR CODE HERE
pass
# Test with sentences that have interesting attention patterns
sentences = [
"The cat sat on the mat because it was tired.",
"The trophy didn't fit in the suitcase because it was too big.",
"The python ate the mouse quickly and then slept.",
]
# Visualize attention for pronouns resolving to their antecedents

Deliverable: Heatmaps showing attention patterns with clear pronoun resolution.

Implement a minimal GPT-style model:

class MiniGPT(nn.Module):
"""
Minimal GPT-style decoder-only transformer.
Components needed:
1. Token embedding
2. Positional encoding
3. Causal masked multi-head attention
4. Feed-forward network
5. Output projection to vocabulary
"""
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len):
super().__init__()
# YOUR CODE HERE
pass
def forward(self, x):
# Remember to apply causal mask!
pass
def generate(self, prompt_tokens, max_new_tokens=50):
"""Autoregressive generation."""
# YOUR CODE HERE
pass
# Train on a small corpus (Shakespeare, code, etc.)
# Verify it can generate coherent text

Success Criteria: Model generates somewhat coherent text after training.

Benchmark standard attention vs optimized versions:

import time
import torch
def benchmark_attention(seq_lengths, d_model=512, num_heads=8, num_trials=10):
"""
Measure attention time and memory for different sequence lengths.
1. Standard attention
2. torch.nn.functional.scaled_dot_product_attention (if available)
3. Flash attention (if installed)
Return timing and memory data for plotting.
"""
results = []
for seq_len in seq_lengths:
# YOUR CODE HERE
pass
return results
# Test with seq_lengths = [512, 1024, 2048, 4096, 8192]
# Plot the O(n²) growth
# Compare with optimized implementations

Deliverable: Graph showing quadratic growth of attention and efficiency of optimizations.


In 2019, Richard Sutton wrote “The Bitter Lesson,” arguing that general methods leveraging computation (like search and learning) ultimately beat approaches encoding human knowledge.

Transformers proved him right spectacularly:

  • They don’t encode linguistic rules—they learn them
  • They don’t have hand-crafted features—just attention
  • They scale predictably with compute

The “bitter” part? Decades of NLP research on parsing, syntax trees, and linguistic features became largely obsolete overnight. The winning strategy was: simple architecture + massive scale.

“70 years of AI research, and the answer turns out to be: matrix multiplication and gradient descent, but lots of it.” — Anonymous ML researcher on Twitter


Original Transformer Authors:

  • Ashish Vaswani (@ashaborali) - Co-founder of Essential AI
  • Noam Shazeer - Co-founder of Character.AI, returned to Google
  • Jakob Uszkoreit - Co-founder of Inceptive (RNA design with transformers)

Modern Practitioners:

  • Andrej Karpathy (@karpathy) - Former Tesla AI Director, incredible educational content
  • Jay Alammar - Author of “The Illustrated Transformer”
  • Tri Dao - Flash Attention creator, now at Together AI
  • Sebastian Raschka (@rasbt) - Excellent books and papers on LLMs

Efficiency:

  • Mixture of Experts (MoE) - Use only a fraction of parameters per forward pass
  • State Space Models (Mamba) - Linear complexity alternative to attention
  • Speculative Decoding - Faster inference with draft models

Scale:

  • Constitutional AI - Training models to follow principles
  • RLHF and DPO - Aligning models with human preferences
  • Multimodal - Single models for text, image, audio, video

Understanding:

  • Mechanistic Interpretability - Understanding what transformers actually learn internally
  • Emergent Abilities - Capabilities that appear suddenly at scale
  • In-Context Learning - How transformers learn from examples in the prompt

  1. “Attention Is All You Need” (Vaswani et al., 2017) — The original transformer paper
  2. “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018) — Encoder-only
  3. “Language Models are Few-Shot Learners” (Brown et al., 2020) — GPT-3, scaling laws
  4. “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020) — Vision Transformer
  5. “FlashAttention” (Dao et al., 2022) — Efficient attention implementation

You’ve learned:

  1. Why RNNs failed: Sequential processing, vanishing gradients, long-range difficulties
  2. Self-attention: Q, K, V projections → scaled dot product → softmax → weighted sum
  3. Multi-head attention: Multiple attention patterns in parallel
  4. Positional encoding: Adding position information to an orderless system
  5. Transformer architecture: Attention + FFN + residuals + layer norm
  6. Encoder vs decoder: Bidirectional vs causal attention
  7. The O(n²) problem: Why long sequences are expensive
  8. Vision Transformers: Patches as tokens

The transformer architecture is the foundation of modern AI. GPT, BERT, Claude, Gemini, Stable Diffusion, DALL-E — they all build on the ideas in this module.


Attention is all you need — and now you understand why.

The insight is elegant: instead of processing sequences step by step, let every position look at every other position simultaneously. Let the model learn what to pay attention to through training.

This simple change:

  • Enabled massive parallelization (faster training)
  • Eliminated the information bottleneck of sequential processing
  • Made scaling predictable and efficient
  • Created a universal architecture for text, images, audio, video, and more

When you use ChatGPT, Claude, or any modern AI system, you’re using transformers. You now understand the core mechanism that makes them work.


Move on to Module 31: Backpropagation Deep Dive where you’ll learn:

  • How gradients actually flow through networks
  • The chain rule in detail
  • Implementing autograd from scratch
  • Debugging gradient issues

This completes the “how neural networks learn” trilogy: forward pass (Module 26), training techniques (Module 28), and now the math of learning itself.


Last updated: 2025-12-11 Status: Complete