Backpropagation Deep Dive
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)
Section titled “Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)”Reading Time: 8-10 hours Prerequisites: Module 29
This is a Heureka Moment module.
When eight researchers at Google discovered in 2017 that they could throw away two decades of sequence modeling wisdom and build something better, the AI community was skeptical. After all, recurrent neural networks had dominated for years. But within three years, their “Attention Is All You Need” paper would become the most cited work in AI history, and every major language model—gpt-5, Claude, Gemini—would be built on their foundation.
The Paper That Changed Everything
Section titled “The Paper That Changed Everything”Mountain View, California. June 12, 2017. 2:34 AM.
Ashish Vaswani and his Google Brain colleagues huddled around a laptop in a cramped conference room. They had been training their experimental model for days, and the results were finally in.
“That can’t be right,” Jakob Uszkoreit muttered, staring at the screen.
They had just achieved state-of-the-art performance on machine translation—beating every existing system. But that wasn’t what shocked them. It was what they had removed to get there.
No recurrent connections. No convolutions. Just attention. “Attention is all you need,” Vaswani typed as the paper’s title, half-joking. But he wasn’t joking.
“We expected attention to be helpful alongside RNNs, not to replace them entirely. When we removed the recurrence and the model got better, we knew we had found something fundamental.” — Ashish Vaswani, Transformer co-inventor
That paper, with its eight co-authors and deceptively simple title, would go on to become the most cited paper in AI history. Every modern language model—gpt-5, Claude, Gemini, LLaMA—is a direct descendant of what they built that night.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand why sequential models (RNNs/LSTMs) hit a wall
- Master the self-attention mechanism from intuition to implementation
- Build multi-head attention from scratch
- Understand positional encoding and why it’s necessary
- Implement a complete transformer encoder
- Know why “Attention Is All You Need” changed everything
- See how Vision Transformers apply attention to images
- Grasp the O(n²) complexity and its implications
The Heureka Moment: What You’re About to Discover
Section titled “The Heureka Moment: What You’re About to Discover”Here’s the transformative insight you’ll gain from this module:
Before transformers, neural networks processed sequences one step at a time — like reading a book word by word, unable to look back or ahead without significant effort.
After transformers, neural networks can look at everything simultaneously and learn what to pay attention to — like being able to see an entire book at once and instantly knowing which parts are relevant to each other.
This single architectural change is why we have gpt-5, Claude, Gemini, and virtually every modern AI system. By the end of this module, you’ll understand exactly how it works.
The Problem: Why RNNs Hit a Wall
Section titled “The Problem: Why RNNs Hit a Wall”Before 2017, if you wanted to process sequences (text, audio, time series), you used Recurrent Neural Networks (RNNs) or their improved variant, LSTMs.
How RNNs Work
Section titled “How RNNs Work”An RNN processes sequences one element at a time, maintaining a “hidden state” that carries information forward:
Input: "The cat sat on the mat" ↓ ↓ ↓ ↓ ↓ ↓ h₀ → h₁ → h₂ → h₃ → h₄ → h₅ → h₆ ↓ OutputEach hidden state h_t depends on the previous state h_{t-1} and the current input. Information flows forward through time.
The Three Fatal Flaws
Section titled “The Three Fatal Flaws”1. Sequential Processing (Slow)
RNNs must process tokens one at a time. You can’t parallelize because h₃ depends on h₂ which depends on h₁. Training on a 1000-token sequence takes 1000 sequential steps.
Modern GPUs have thousands of cores, but RNNs can only use one at a timefor each sequence position. It's like having a 1000-lane highway butbeing forced to drive in a single lane.2. Vanishing/Exploding Gradients (Again)
Information must travel through many timesteps. For a 100-word sentence, the gradient for word 1 must backpropagate through 99 steps. Remember Module 28? The same multiplication problem kills gradients in RNNs.
Did You Know? LSTMs (Long Short-Term Memory) were invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997 specifically to address the vanishing gradient problem. Their “gating mechanism” allows information to flow unchanged through time — similar to ResNet’s skip connections. LSTMs dominated sequence modeling for 20 years until transformers arrived.
3. Long-Range Dependencies (Difficult)
For the sentence: “The cat, which had been sleeping on the warm sunny windowsill for most of the afternoon, finally woke up and stretched.”
What does “stretched” refer to? The cat! But there are 16 words between “cat” and “stretched.” RNNs struggle to maintain this connection over long distances.
The LSTM Partial Solution
Section titled “The LSTM Partial Solution”LSTMs add “gates” that control information flow:
┌─────────────────────────────────────────────┐│ LSTM Cell ││ ││ forget_gate = σ(W_f · [h_{t-1}, x_t]) │ ← What to forget│ input_gate = σ(W_i · [h_{t-1}, x_t]) │ ← What to add│ output_gate = σ(W_o · [h_{t-1}, x_t]) │ ← What to output│ ││ cell_state = forget_gate * c_{t-1} ││ + input_gate * tanh(W_c · [h_{t-1}, x_t])│ ││ h_t = output_gate * tanh(cell_state) │└─────────────────────────────────────────────┘LSTMs helped, but didn’t solve the fundamental problem: you still have to process sequentially.
The Revolution: “Attention Is All You Need”
Section titled “The Revolution: “Attention Is All You Need””In June 2017, eight researchers at Google published a paper that would reshape AI: “Attention Is All You Need.”
Did You Know? The transformer paper has been cited over 130,000 times — more than almost any computer science paper ever written. The eight authors (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin) are now scattered across Google, Google DeepMind, and various AI startups. Noam Shazeer later co-founded Character.AI and then returned to Google. The paper’s title was intentionally provocative — they were claiming that attention alone, without any recurrence, could match or beat RNN/LSTM models.
The key insight was radical: throw away recurrence entirely. Don’t process sequences one step at a time. Instead, let every token look at every other token simultaneously and learn which connections matter.
Think of the difference like reading a book. An RNN is like reading with a tiny flashlight that only illuminates one word at a time—you have to remember what you read before. A transformer is like turning on all the lights at once and seeing the entire page, instantly noticing which words relate to each other. This parallel vision is what makes transformers so powerful.
This is the attention mechanism.
Self-Attention: The Core Innovation
Section titled “Self-Attention: The Core Innovation”The Intuition: A Different Kind of Search
Section titled “The Intuition: A Different Kind of Search”Think of attention as a soft database lookup.
In a traditional database, you have:
- A query (what you’re looking for)
- Keys (labels on stored items)
- Values (the actual stored content)
You find exact matches: SELECT value WHERE key = query
Attention does the same thing, but with soft matching:
- Instead of exact matches, compute similarity scores
- Instead of returning one result, return a weighted combination of all values
- The weights are learned, not hand-coded
The Analogy: Finding Relevant Context
Section titled “The Analogy: Finding Relevant Context”Imagine you’re reading the sentence: “The animal didn’t cross the street because it was too tired.”
What does “it” refer to? The animal or the street?
As a human, you instantly know it’s “the animal” — streets don’t get tired. How did you know? You looked at “it,” then looked at all other words, and found that “animal” + “tired” makes semantic sense while “street” + “tired” doesn’t.
This is exactly what self-attention does. For each word:
- Create a “query” representing “what am I looking for?”
- Create “keys” for all words representing “what can I offer?”
- Compare query to all keys (similarity scores)
- Use scores to weight the “values” (actual content)
The Math: Query, Key, Value
Section titled “The Math: Query, Key, Value”Given an input sequence of embeddings X (shape: [seq_len, d_model]):
Q = X @ W_Q # Queries: what each position is looking forK = X @ W_K # Keys: what each position can be found byV = X @ W_V # Values: what each position contains
# Compute attention scoresscores = Q @ K.T / sqrt(d_k) # [seq_len, seq_len]
# Convert to probabilitiesattention_weights = softmax(scores) # Each row sums to 1
# Weighted sum of valuesoutput = attention_weights @ V # [seq_len, d_model]Let’s break this down:
Step 1: Project to Q, K, V
Each input embedding gets transformed into three different representations:
- Query (Q): “When I need context, what should I look for?”
- Key (K): “What kind of context can I provide to others?”
- Value (V): “What information do I actually contain?”
These are separate because what you’re looking for (Q) might be different from what you offer (K). For example, a pronoun like “it” is looking for its antecedent, but offers very little information itself.
Step 2: Compute Similarity (Q @ K.T)
For each query, compute dot product with all keys. High dot product = high similarity.
For "it" (query) comparing to: "animal" (key): score = 0.8 ← high similarity "street" (key): score = 0.2 ← low similarity "tired" (key): score = 0.1Step 3: Scale by sqrt(d_k)
Why divide by sqrt(d_k)? Without it, dot products grow with dimension size, pushing softmax into saturated regions where gradients vanish.
d_k = 64 (typical)sqrt(64) = 8
Without scaling: scores might be [-100, 150, -80]softmax([-100, 150, -80]) ≈ [0, 1, 0] # One-hot, no gradient
With scaling: scores become [-12.5, 18.75, -10]softmax([-12.5, 18.75, -10]) ≈ [0.001, 0.998, 0.001] # Still peaked but gradient existsStep 4: Softmax (Probabilities)
Convert scores to probabilities. Each position’s attention weights sum to 1.
Step 5: Weighted Sum of Values
The output for each position is a weighted average of all values, where weights come from attention scores.
The Formula (Complete)
Section titled “The Formula (Complete)”Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ VThis single formula is the heart of modern AI.
Think of this formula like a spotlight operator at a concert. The Query is what you’re trying to illuminate. The Keys are labels on different performers. The dot product tells you how relevant each performer is to what you’re looking for. Softmax makes sure your spotlight energy (attention) is distributed as probabilities. And Values are what you actually see when you shine the light there—the content you get back.
Worked Example with Numbers
Section titled “Worked Example with Numbers”Let’s trace through with a tiny example. Sequence: [“I”, “saw”, “it”] with d_model = 4.
# Input embeddings (simplified)X = [ [1.0, 0.0, 0.5, 0.2], # "I" [0.3, 0.8, 0.1, 0.9], # "saw" [0.5, 0.5, 0.5, 0.5], # "it"]
# Weight matrices (learned during training)W_Q = [[...]] # 4×4 matrixW_K = [[...]] # 4×4 matrixW_V = [[...]] # 4×4 matrix
# Project to Q, K, VQ = X @ W_Q # [3, 4]K = X @ W_K # [3, 4]V = X @ W_V # [3, 4]
# Compute attention scores for position 2 ("it")# Q[2] @ K.T gives scores for "it" attending to all positionsq_it = Q[2] # [4]scores_it = q_it @ K.T # [3] - one score per position# Let's say this gives: [0.1, 0.3, 0.6]
# Scale by sqrt(d_k)d_k = 4scaled_scores = scores_it / sqrt(4) # [0.05, 0.15, 0.3]
# Softmaxattention_weights = softmax([0.05, 0.15, 0.3]) # ≈ [0.28, 0.33, 0.39]
# Weighted sum of values# "it" gets 28% of "I"'s value, 33% of "saw"'s value, 39% of its own valueoutput_it = 0.28 * V[0] + 0.33 * V[1] + 0.39 * V[2]PyTorch Implementation
Section titled “PyTorch Implementation”import torchimport torch.nn as nnimport torch.nn.functional as Fimport math
class SelfAttention(nn.Module): """ Single-head self-attention.
This is the core building block of transformers. """ def __init__(self, d_model: int): super().__init__() self.d_model = d_model
# Linear projections for Q, K, V self.W_q = nn.Linear(d_model, d_model, bias=False) self.W_k = nn.Linear(d_model, d_model, bias=False) self.W_v = nn.Linear(d_model, d_model, bias=False)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None): """ Args: x: Input tensor [batch, seq_len, d_model] mask: Optional attention mask [batch, seq_len, seq_len]
Returns: output: Attended values [batch, seq_len, d_model] attention_weights: Attention patterns [batch, seq_len, seq_len] """ # Project to Q, K, V Q = self.W_q(x) # [batch, seq_len, d_model] K = self.W_k(x) V = self.W_v(x)
# Compute attention scores # Q @ K.T: [batch, seq_len, d_model] @ [batch, d_model, seq_len] # = [batch, seq_len, seq_len] scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(self.d_model)
# Apply mask if provided (for causal/padding masking) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over the last dimension (keys) attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values output = torch.bmm(attention_weights, V)
return output, attention_weights
# Test itd_model = 64seq_len = 10batch_size = 2
attention = SelfAttention(d_model)x = torch.randn(batch_size, seq_len, d_model)output, weights = attention(x)
print(f"Input shape: {x.shape}")print(f"Output shape: {output.shape}")print(f"Attention weights shape: {weights.shape}")print(f"Weights sum per query: {weights.sum(dim=-1)}") # Should be all 1sNotice how the output shape matches the input shape — attention doesn’t change dimensions, it just mixes information between positions. The attention weights sum to 1 for each query position (thanks to softmax), making them interpretable as “how much attention goes to each key.”
Visualizing Attention
Section titled “Visualizing Attention”The attention weights form a [seq_len × seq_len] matrix showing how much each position attends to each other position:
"The" "cat" "sat" "on" "it""The" [ 0.8 0.1 0.05 0.03 0.02 ]"cat" [ 0.3 0.5 0.1 0.05 0.05 ]"sat" [ 0.1 0.6 0.2 0.05 0.05 ]"on" [ 0.1 0.2 0.3 0.3 0.1 ]"it" [ 0.1 0.7 0.1 0.05 0.05 ] ← "it" attends strongly to "cat"This is why attention is so interpretable — you can literally see what the model is “paying attention to.”
Multi-Head Attention: Multiple Perspectives
Section titled “Multi-Head Attention: Multiple Perspectives”Single-head attention has a limitation: each position can only compute one weighted average of values. But language has many types of relationships!
Consider: “The cat that I saw yesterday chased the mouse.”
Different aspects we might want to capture:
- Syntactic: “cat” → “chased” (subject-verb)
- Semantic: “cat” → “mouse” (predator-prey)
- Coreference: “that” → “cat” (relative clause)
Multi-head attention runs multiple attention operations in parallel, each learning different relationship types.
The Architecture
Section titled “The Architecture”Input X │ ├──────┬──────┬──────┬──────► (split into heads) │ │ │ │ Head₁ Head₂ Head₃ Head₄ │ │ │ │ └──────┴──────┴──────┘ │ Concat │ Linear (W_O) │ OutputImplementation Details
Section titled “Implementation Details”Instead of one attention with d_model dimensions, use h heads each with d_k = d_model/h dimensions.
class MultiHeadAttention(nn.Module): """ Multi-head attention mechanism.
Each head learns different attention patterns, then results are concatenated and projected. """ def __init__(self, d_model: int, num_heads: int): super().__init__() assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model self.num_heads = num_heads self.d_k = d_model // num_heads
# Projections for all heads at once (more efficient) self.W_q = nn.Linear(d_model, d_model, bias=False) self.W_k = nn.Linear(d_model, d_model, bias=False) self.W_v = nn.Linear(d_model, d_model, bias=False) self.W_o = nn.Linear(d_model, d_model, bias=False)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None): batch_size, seq_len, _ = x.shape
# Project to Q, K, V Q = self.W_q(x) # [batch, seq_len, d_model] K = self.W_k(x) V = self.W_v(x)
# Reshape for multi-head: [batch, seq_len, d_model] # → [batch, num_heads, seq_len, d_k] Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Compute attention for all heads in parallel # [batch, heads, seq_len, d_k] @ [batch, heads, d_k, seq_len] # = [batch, heads, seq_len, seq_len] scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = F.softmax(scores, dim=-1)
# Apply attention to values # [batch, heads, seq_len, seq_len] @ [batch, heads, seq_len, d_k] # = [batch, heads, seq_len, d_k] attended = torch.matmul(attention_weights, V)
# Concatenate heads: [batch, heads, seq_len, d_k] # → [batch, seq_len, d_model] attended = attended.transpose(1, 2).contiguous() attended = attended.view(batch_size, seq_len, self.d_model)
# Final linear projection output = self.W_o(attended)
return output, attention_weights
# Test multi-head attentionmha = MultiHeadAttention(d_model=64, num_heads=8)x = torch.randn(2, 10, 64)output, weights = mha(x)
print(f"Input: {x.shape}")print(f"Output: {output.shape}")print(f"Attention weights: {weights.shape}") # [batch, heads, seq_len, seq_len]Notice how each head operates on d_model/num_heads dimensions (64/8 = 8 in this case). The heads run in parallel, each learning its own attention pattern, then the results are concatenated and projected back to d_model dimensions.
Did You Know? In the original transformer paper, the authors used 8 attention heads. When researchers later visualized what each head learned, they found remarkable specialization: some heads tracked syntactic dependencies (subject-verb), others tracked positional relationships (adjacent words), and some captured semantic similarity. The model “discovered” linguistics without being taught!
Positional Encoding: Teaching Order to an Orderless System
Section titled “Positional Encoding: Teaching Order to an Orderless System”Here’s a subtle but critical problem with self-attention: it’s permutation invariant.
If you shuffle the input tokens, the attention computation doesn’t change (each token still attends to all others). But word order matters! “Dog bites man” ≠ “Man bites dog.”
The solution is positional encoding: add information about position to each token.
The Sinusoidal Encoding
Section titled “The Sinusoidal Encoding”The original paper used sine and cosine functions of different frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))This creates a unique pattern for each position that the model can learn to interpret.
Worked Example: Computing Positional Encodings
Section titled “Worked Example: Computing Positional Encodings”Let’s compute actual values for d_model = 8 (tiny, for illustration):
Position 0, dimensions 0-7: dim 0: sin(0 / 10000^(0/8)) = sin(0) = 0.000 dim 1: cos(0 / 10000^(0/8)) = cos(0) = 1.000 dim 2: sin(0 / 10000^(2/8)) = sin(0) = 0.000 dim 3: cos(0 / 10000^(2/8)) = cos(0) = 1.000 ... PE[0] = [0.000, 1.000, 0.000, 1.000, 0.000, 1.000, 0.000, 1.000]
Position 1, dimensions 0-7: dim 0: sin(1 / 10000^(0/8)) = sin(1.0) = 0.841 dim 1: cos(1 / 10000^(0/8)) = cos(1.0) = 0.540 dim 2: sin(1 / 10000^(2/8)) = sin(0.1) = 0.100 dim 3: cos(1 / 10000^(2/8)) = cos(0.1) = 0.995 ... PE[1] = [0.841, 0.540, 0.100, 0.995, 0.010, 1.000, 0.001, 1.000]
Position 100: PE[100] = [-0.506, 0.863, -0.999, 0.045, 0.842, 0.540, 0.100, 0.995]Notice how:
- Low dimensions (0, 1) oscillate rapidly — they change significantly between adjacent positions
- High dimensions (6, 7) oscillate slowly — they’re almost constant for nearby positions
- Position 0 is special — all sine terms are 0, all cosine terms are 1
- Each position gets a unique fingerprint that the model learns to decode
Why sinusoids?
- Bounded values: Always between -1 and 1
- Unique patterns: Each position has a distinct encoding
- Relative positions: The model can learn to compute PE(pos+k) from PE(pos)
- Generalization: Works for sequences longer than training data
class PositionalEncoding(nn.Module): """ Sinusoidal positional encoding from "Attention Is All You Need."
Adds position information to token embeddings so the model knows word order. """ def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1): super().__init__() self.dropout = nn.Dropout(p=dropout)
# Create positional encoding matrix pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1).float()
# Compute the divisor term div_term = torch.exp( torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model) )
# Apply sin to even indices, cos to odd indices pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension and register as buffer (not a parameter) pe = pe.unsqueeze(0) # [1, max_len, d_model] self.register_buffer('pe', pe)
def forward(self, x: torch.Tensor) -> torch.Tensor: """ Args: x: Token embeddings [batch, seq_len, d_model]
Returns: x + positional encoding """ x = x + self.pe[:, :x.size(1)] return self.dropout(x)
# Visualize positional encodingspe = PositionalEncoding(d_model=64, max_len=100)positions = pe.pe[0, :50, :].numpy() # First 50 positions
# Each row is a position, each column is a dimension# You'd see wave patterns at different frequenciesNotice how the encoding is computed once and stored as a buffer (not a parameter). This is because positional encodings are fixed — they don’t change during training. The register_buffer call ensures they’re saved with the model but not updated by the optimizer.
Learned vs. Fixed Positional Encodings
Section titled “Learned vs. Fixed Positional Encodings”Modern models often use learned positional embeddings instead:
class LearnedPositionalEncoding(nn.Module): def __init__(self, d_model: int, max_len: int = 5000): super().__init__() self.position_embeddings = nn.Embedding(max_len, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor: seq_len = x.size(1) positions = torch.arange(seq_len, device=x.device) return x + self.position_embeddings(positions)Notice how learned embeddings use nn.Embedding — the same mechanism used for token embeddings. The position indices (0, 1, 2, …) are looked up in a learned table. This is simpler than sinusoidal but limited to the maximum length seen during training.
BERT uses learned positions. GPT uses learned positions. The original transformer used sinusoidal. Both work well.
Did You Know? RoPE (Rotary Position Embedding), invented by Jianlin Su in 2021, is now used in most modern LLMs including LLaMA, Mistral, and many others. Instead of adding positional information, it rotates the query and key vectors based on position. This elegant approach allows models to better capture relative positions and generalize to longer sequences than seen during training.
The Full Transformer Block
Section titled “The Full Transformer Block”Now let’s assemble the complete transformer encoder block. Each block contains:
- Multi-Head Self-Attention (with residual connection + layer norm)
- Feed-Forward Network (with residual connection + layer norm)
Input │ ├──────────────────────┐ │ │ Multi-Head │ Attention │ │ │ └──────► Add ◄─────────┘ │ LayerNorm │ ├──────────────────────┐ │ │ Feed-Forward │ Network │ │ │ └──────► Add ◄─────────┘ │ LayerNorm │ OutputThe Feed-Forward Network
Section titled “The Feed-Forward Network”A simple two-layer MLP applied independently to each position:
class FeedForward(nn.Module): """ Position-wise feed-forward network.
Applied independently to each position (no interaction between positions). Typically expands to 4x d_model then contracts back. """ def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.1): super().__init__() d_ff = d_ff or 4 * d_model # Default: 4x expansion
self.net = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), # Modern choice; original used ReLU nn.Dropout(dropout), nn.Linear(d_ff, d_model), nn.Dropout(dropout) )
def forward(self, x: torch.Tensor) -> torch.Tensor: return self.net(x)The Complete Encoder Block
Section titled “The Complete Encoder Block”class TransformerEncoderBlock(nn.Module): """ A single transformer encoder block.
Contains multi-head attention and feed-forward network, each with residual connections and layer normalization. """ def __init__( self, d_model: int, num_heads: int, d_ff: int = None, dropout: float = 0.1 ): super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads) self.ff = FeedForward(d_model, d_ff, dropout)
self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None): # Self-attention with residual connection attended, attention_weights = self.attention(x, mask) x = self.norm1(x + self.dropout(attended))
# Feed-forward with residual connection ff_out = self.ff(x) x = self.norm2(x + self.dropout(ff_out))
return x, attention_weightsThe Full Transformer Encoder
Section titled “The Full Transformer Encoder”Stack multiple blocks:
class TransformerEncoder(nn.Module): """ Complete transformer encoder.
Stacks multiple encoder blocks with shared positional encoding. """ def __init__( self, vocab_size: int, d_model: int = 512, num_heads: int = 8, num_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1 ): super().__init__()
self.d_model = d_model
# Token embedding self.token_embedding = nn.Embedding(vocab_size, d_model)
# Positional encoding self.positional_encoding = PositionalEncoding(d_model, max_len, dropout)
# Stack of encoder blocks self.layers = nn.ModuleList([ TransformerEncoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_layers) ])
# Final layer norm self.norm = nn.LayerNorm(d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None): # Embed tokens and add positional encoding x = self.token_embedding(x) * math.sqrt(self.d_model) x = self.positional_encoding(x)
# Pass through encoder blocks attention_weights = [] for layer in self.layers: x, attn = layer(x, mask) attention_weights.append(attn)
x = self.norm(x)
return x, attention_weights
# Create a small transformer encoderencoder = TransformerEncoder( vocab_size=10000, d_model=256, num_heads=8, num_layers=4, d_ff=1024)
# Test with random token IDstokens = torch.randint(0, 10000, (2, 50)) # Batch of 2, length 50output, attentions = encoder(tokens)
print(f"Input tokens: {tokens.shape}")print(f"Output embeddings: {output.shape}")print(f"Number of attention matrices: {len(attentions)}")print(f"Each attention matrix: {attentions[0].shape}")Notice how the encoder returns both the output embeddings and all attention weights. The attention weights are invaluable for interpretability — you can visualize exactly what each layer and head is “paying attention to.” This transparency is one of the reasons transformers became so popular in research.
Encoder vs. Decoder: Two Flavors of Transformer
Section titled “Encoder vs. Decoder: Two Flavors of Transformer”The original transformer paper described an encoder-decoder architecture for translation. Let’s understand the difference:
Encoder-Only (BERT style)
Section titled “Encoder-Only (BERT style)”- Sees the entire input at once
- Each position attends to all other positions
- Used for: classification, named entity recognition, embeddings
Input: [CLS] The cat sat [SEP] ↓ ↓ ↓ ↓ ↓ ←───────────────────→ (bidirectional attention) ↓ ↓ ↓ ↓ ↓Output: h_cls h_1 h_2 h_3 h_sepDecoder-Only (GPT style)
Section titled “Decoder-Only (GPT style)”- Generates tokens one at a time
- Each position can only attend to previous positions (causal masking)
- Used for: text generation, chat, code completion
Input: The cat sat ↓ ↓ ↓ ←─ ←── ←─── (causal attention: can only look back) ↓ ↓ ↓Output: cat sat onCausal Masking
Section titled “Causal Masking”To prevent the decoder from “cheating” by looking at future tokens:
def create_causal_mask(seq_len: int) -> torch.Tensor: """ Create a causal mask for decoder self-attention.
Lower triangular matrix: position i can attend to positions 0...i """ mask = torch.tril(torch.ones(seq_len, seq_len)) return mask
# Example for seq_len=5mask = create_causal_mask(5)print(mask)# tensor([[1., 0., 0., 0., 0.],# [1., 1., 0., 0., 0.],# [1., 1., 1., 0., 0.],# [1., 1., 1., 1., 0.],# [1., 1., 1., 1., 1.]])
# Position 0 can only see position 0# Position 1 can see positions 0, 1# Position 4 can see positions 0, 1, 2, 3, 4Notice how torch.tril (lower triangular) gives us exactly what we need — each row allows attention only to that position and earlier ones. Where the mask is 0, we set the attention score to negative infinity before softmax, making those attention weights effectively zero.
Encoder-Decoder (Original Transformer)
Section titled “Encoder-Decoder (Original Transformer)”- Encoder processes source sequence (bidirectional)
- Decoder generates target sequence (causal)
- Decoder also attends to encoder outputs (cross-attention)
- Used for: translation, summarization
Source: "Le chat" → Encoder → encoded representations ↓Target: "The cat" → Decoder ← cross-attention ↓Output: "The cat"Did You Know? GPT-3 and all subsequent GPT models are decoder-only. BERT is encoder-only. The original transformer was encoder-decoder. Surprisingly, decoder-only models turned out to be the best for general language understanding, despite being designed for generation. This was unexpected — many researchers thought bidirectional context (encoder-style) would always be superior.
The Complexity Trade-off: O(n²)
Section titled “The Complexity Trade-off: O(n²)”Self-attention has a critical limitation: quadratic complexity in sequence length.
For a sequence of length n:
- Attention scores matrix: n × n
- Memory: O(n²)
- Computation: O(n²)
This means:
- 1000 tokens: 1 million attention computations
- 10,000 tokens: 100 million attention computations
- 100,000 tokens: 10 billion attention computations
Why This Matters
Section titled “Why This Matters”| Sequence Length | Memory (FP16) | Relative Cost |
|---|---|---|
| 512 | ~1 MB | 1× |
| 2,048 | ~16 MB | 16× |
| 8,192 | ~256 MB | 256× |
| 32,768 | ~4 GB | 4,096× |
| 131,072 | ~64 GB | 65,536× |
This is why early transformers were limited to 512-2048 tokens, and why research into efficient attention is so active.
Efficient Attention Variants
Section titled “Efficient Attention Variants”Many techniques exist to reduce complexity:
- Sparse Attention (BigBird, Longformer): Only attend to subset of positions
- Linear Attention (Performer, Linear Transformer): Approximate attention with O(n)
- Flash Attention: Same math, but optimized memory access patterns (much faster!)
- Sliding Window: Only attend to local context (Mistral uses this)
Did You Know? Flash Attention, created by Tri Dao at Stanford in 2022, doesn’t change the attention math at all — it computes exactly the same result. But by optimizing how data moves between GPU memory levels (exploiting the memory hierarchy), it achieves 2-4× speedup and uses 5-20× less memory. This is why modern LLMs can handle 128K+ context windows. The insight was that attention is memory-bound, not compute-bound.
Vision Transformers: Attention for Images
Section titled “Vision Transformers: Attention for Images”In 2020, Google showed that transformers could match or beat CNNs on images: the Vision Transformer (ViT).
The Key Idea: Patches as Tokens
Section titled “The Key Idea: Patches as Tokens”Instead of processing pixels, split the image into patches and treat each patch as a token:
224×224 image with 16×16 patches = 14×14 = 196 "tokens"
┌────┬────┬────┬────┐│ P1 │ P2 │ P3 │... │├────┼────┼────┼────┤│ P15│ P16│ P17│... │├────┼────┼────┼────┤│... │... │... │... │└────┴────┴────┴────┘
Each patch (16×16×3 = 768 values) becomes a token embeddingclass PatchEmbedding(nn.Module): """ Convert image into sequence of patch embeddings. """ def __init__( self, image_size: int = 224, patch_size: int = 16, in_channels: int = 3, d_model: int = 768 ): super().__init__() self.patch_size = patch_size self.num_patches = (image_size // patch_size) ** 2
# Conv2d is an efficient way to extract and project patches self.projection = nn.Conv2d( in_channels, d_model, kernel_size=patch_size, stride=patch_size )
def forward(self, x: torch.Tensor) -> torch.Tensor: # x: [batch, channels, height, width] x = self.projection(x) # [batch, d_model, h/p, w/p] x = x.flatten(2) # [batch, d_model, num_patches] x = x.transpose(1, 2) # [batch, num_patches, d_model] return xViT Architecture
Section titled “ViT Architecture”Image → Patch Embedding → [CLS] + Patches → Position Embedding ↓ Transformer Encoder (×12) ↓ CLS token ↓ Classification HeadThe [CLS] token is a learnable embedding prepended to the patch sequence. After passing through the transformer, it contains information about the entire image.
Did You Know? The original ViT paper (Dosovitskiy et al., 2020) had a surprising finding: ViT underperformed CNNs when trained on ImageNet alone, but dramatically outperformed them when pretrained on larger datasets (JFT-300M with 300 million images). This suggested that transformers need more data to learn good visual features, but once they have enough data, they scale better than CNNs.
Why Attention Changed Everything
Section titled “Why Attention Changed Everything”Let’s revisit the Heureka Moment. Why did transformers take over AI?
1. Parallelization
Section titled “1. Parallelization”RNNs: Sequential (slow to train) Transformers: Fully parallel (fast to train)
This meant transformers could be trained on much more data in the same time.
2. Long-Range Dependencies
Section titled “2. Long-Range Dependencies”RNNs: Information degrades over distance Transformers: Every token directly attends to every other token
A word 500 positions away is just as easy to attend to as an adjacent word.
3. Scalability
Section titled “3. Scalability”The “scaling laws” discovered by OpenAI showed that transformers improve predictably with:
- More parameters
- More data
- More compute
This predictability enabled massive investment in scaling (GPT-3, gpt-5, etc.).
4. Transfer Learning
Section titled “4. Transfer Learning”Pretrained transformers transfer remarkably well:
- BERT: Pretrain on text → fine-tune for any NLP task
- GPT: Pretrain on text → few-shot learning on any task
- ViT: Pretrain on images → transfer to any vision task
5. Interpretability (Sort of)
Section titled “5. Interpretability (Sort of)”Attention weights are somewhat interpretable. You can visualize what the model is “looking at.” This isn’t possible with RNNs or CNNs in the same way.
Common Pitfalls
Section titled “Common Pitfalls”1. Forgetting the Scale Factor
Section titled “1. Forgetting the Scale Factor”# Wrong - gradients will vanish in softmaxscores = Q @ K.Tattention = softmax(scores)
# Correct - scale by sqrt(d_k)scores = Q @ K.T / math.sqrt(d_k)attention = softmax(scores)2. Wrong Mask Shape
Section titled “2. Wrong Mask Shape”# Wrong - mask should be [batch, 1, seq_len, seq_len] or broadcastablemask = torch.ones(batch_size, seq_len, seq_len) # Missing head dimension
# Correct for multi-head attentionmask = torch.ones(batch_size, 1, seq_len, seq_len) # Broadcasts over heads3. Forgetting Positional Encoding
Section titled “3. Forgetting Positional Encoding”# Wrong - no position information!x = token_embedding(tokens)output = transformer(x)
# Correctx = token_embedding(tokens)x = x + positional_encoding(x) # Add position informationoutput = transformer(x)4. Not Handling Padding Properly
Section titled “4. Not Handling Padding Properly”# For batches with different sequence lengths, you need padding masks# to prevent attending to padding tokens
def create_padding_mask(seq, pad_token=0): """Create mask where padding positions are 0.""" return (seq != pad_token).unsqueeze(1).unsqueeze(2)5. Memory Explosion with Long Sequences
Section titled “5. Memory Explosion with Long Sequences”# Dangerous - might OOMlong_sequence = torch.randn(1, 10000, 512)attention = MultiHeadAttention(512, 8)output = attention(long_sequence) # 10000×10000 attention matrix!
# Safer - use flash attention or limit sequence length# Or use a model designed for long contexts (Longformer, etc.)Quiz: Test Your Understanding
Section titled “Quiz: Test Your Understanding”Q1: Why do we divide attention scores by sqrt(d_k)?
Answer
Without scaling, dot products grow with dimension size (approximately proportional to d_k). Large dot products push softmax into saturation where gradients approach zero. Dividing by sqrt(d_k) keeps the variance of scores roughly constant regardless of dimension, maintaining healthy gradients.Q2: What’s the difference between encoder and decoder transformers?
Answer
**Encoder** (BERT-style): Bidirectional attention — each position attends to all positions. Used for understanding/classification.Decoder (GPT-style): Causal/autoregressive attention — each position only attends to previous positions. Used for generation.
The key difference is the attention mask: encoders use no mask (or just padding mask), while decoders use a causal mask that prevents looking ahead.
Q3: Why do transformers need positional encoding?
Answer
Self-attention is permutation invariant — if you shuffle the input, the computation doesn't change (ignoring position encoding). But word order matters: "dog bites man" ≠ "man bites dog."Positional encoding adds position information to each token embedding, allowing the model to distinguish between “cat at position 1” and “cat at position 10.”
Q4: Why is attention O(n²) and why does it matter?
Answer
Each of n positions computes attention scores with all n positions, giving n×n scores. This means: - Memory grows quadratically with sequence length - Compute grows quadratically with sequence lengthA 10× longer sequence needs 100× more memory/compute. This limits context windows and makes very long documents challenging. It’s why there’s active research on efficient attention (linear attention, sparse attention, Flash Attention).
Q5: What are the three weight matrices in attention (W_Q, W_K, W_V) for?
Answer
- **W_Q (Query)**: Projects each token into "what am I looking for" space - **W_K (Key)**: Projects each token into "what can I be found by" space - **W_V (Value)**: Projects each token into "what information do I contain" spaceSeparating these allows different transformations for matching (Q·K) versus content retrieval (V). A pronoun like “it” needs to find its antecedent (high Q similarity with nouns) but contains little information itself (low V magnitude).
Economics of Transformers
Section titled “Economics of Transformers”Computational Cost Reality
Section titled “Computational Cost Reality”Transformers revolutionized AI but at significant computational cost:
Training Costs (Estimated):
| Model | Parameters | Training Cost | GPU Hours |
|---|---|---|---|
| GPT-2 | 1.5B | ~$50K | ~1 week |
| GPT-3 | 175B | ~$4.6M | ~3 months |
| gpt-5 | ~1.7T | ~$100M | ~6 months |
| Claude 3 | Unknown | ~$50-100M | Unknown |
Inference Costs per 1M Tokens:
| Model Size | Input Cost | Output Cost |
|---|---|---|
| 7B params | $0.10-0.50 | $0.30-1.00 |
| 70B params | $0.50-2.00 | $2.00-5.00 |
| 400B+ params | $2.00-10.00 | $10.00-30.00 |
The Context Window Economics
Section titled “The Context Window Economics”The O(n²) attention cost means context length has outsized impact:
| Context Length | Relative Memory | Relative Compute |
|---|---|---|
| 4K tokens | 1× | 1× |
| 32K tokens | 64× | 64× |
| 128K tokens | 1,024× | 1,024× |
Flash Attention’s Impact: Reduces memory by 5-20× and speeds up by 2-4×, making 100K+ context practical.
ROI of Understanding Transformers
Section titled “ROI of Understanding Transformers”Career impact data (from industry surveys):
- Transformer expertise salary premium: +$20,000-40,000/year
- AI/ML roles requiring transformer knowledge: 85%+
- Time to proficiency: 2-4 weeks of dedicated study
- Most in-demand sub-skills: attention visualization, efficient inference, fine-tuning
Interview Preparation: Transformers
Section titled “Interview Preparation: Transformers”Common Interview Questions
Section titled “Common Interview Questions”Q1: “Walk me through the self-attention mechanism.”
Strong Answer: “Self-attention allows each position in a sequence to gather information from all other positions. We project input embeddings into three representations: Query (what am I looking for), Key (what can I offer), and Value (what information do I contain). We compute attention scores by taking the dot product of Query with all Keys, scaled by sqrt(d_k) to prevent gradient vanishing. After softmax normalization, these scores weight the Values to produce an output that’s a learned combination of all positions. The key insight is that attention patterns are learned—the model discovers what relationships matter.”
Q2: “Why did transformers replace RNNs?”
Strong Answer: “Three fundamental reasons. First, parallelization: RNNs process sequentially (h_t depends on h_{t-1}), while transformers process all positions simultaneously, enabling massive GPU parallelism. Second, long-range dependencies: RNNs suffer from vanishing gradients over distance, but transformer attention gives each position direct access to every other position. Third, scalability: transformers follow predictable scaling laws—more parameters and data reliably improve performance—enabling systematic investment in larger models.”
Q3: “What’s the role of positional encoding?”
Strong Answer: “Self-attention is permutation invariant—shuffling inputs doesn’t change outputs. But word order matters: ‘dog bites man’ differs from ‘man bites dog.’ Positional encoding injects position information into embeddings. The original paper used sinusoidal functions: different frequencies across dimensions create unique fingerprints per position. Modern models often use learned embeddings instead, or RoPE (Rotary Position Embedding) which rotates vectors based on position for better relative position handling and length generalization.”
Q4: “Explain multi-head attention and why it’s useful.”
Strong Answer: “Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows capturing different relationship types simultaneously: one head might learn syntactic patterns (subject-verb), another semantic relationships (cause-effect), another positional patterns (adjacent words). Each head operates on d_model/num_heads dimensions, then outputs are concatenated and projected. It’s like having multiple experts each focusing on different aspects of the relationships in text.”
Q5: “How would you debug a transformer that’s not learning?”
Strong Answer: “Systematic approach: First, verify attention patterns—visualize attention weights to check if the model is learning meaningful patterns or just attending uniformly. Second, check gradient flow—attention scores should have healthy magnitude (not too large pre-softmax). Third, verify positional encoding is being added correctly. Fourth, check masking—decoder models need causal masks, and padding tokens should be masked. Fifth, start with a tiny dataset and overfit—if the model can’t memorize a few examples, there’s a fundamental bug. Sixth, compare against a known working implementation on the same data.”
System Design Question
Section titled “System Design Question”Q: “Design a transformer-based document search system.”
Strong Answer Structure:
-
Architecture Choice: “Use a bi-encoder approach with BERT-style encoder. Documents are embedded offline, queries are embedded at search time. This enables sub-second search over millions of documents.”
-
Embedding Strategy: “Use mean pooling or CLS token for document representation. Consider chunking long documents (512 token limit) and taking max or mean over chunks. Fine-tune on in-domain data if available.”
-
Index Structure: “Store embeddings in a vector database (Pinecone, Milvus, FAISS). Use approximate nearest neighbor search for scalability. Combine with keyword search (BM25) in a hybrid approach for best results.”
-
Efficiency Considerations: “Quantize embeddings (float32→int8) to reduce storage 4×. Use dimensionality reduction if latency-critical. Consider caching frequent queries.”
-
Quality Improvements: “Add cross-encoder reranking on top-K results for higher precision. Fine-tune on click data for relevance signals. A/B test embedding models.”
️ Hands-On Exercises
Section titled “️ Hands-On Exercises”Exercise 1: Implement Attention from Scratch
Section titled “Exercise 1: Implement Attention from Scratch”Build your own attention mechanism without using PyTorch’s built-in functions:
import torchimport math
def manual_attention(Q, K, V, mask=None): """ Implement scaled dot-product attention from scratch.
Args: Q: Queries [batch, seq_len, d_k] K: Keys [batch, seq_len, d_k] V: Values [batch, seq_len, d_v] mask: Optional attention mask
Returns: output: Attended values attention_weights: Attention patterns """ # YOUR CODE HERE: # 1. Compute Q @ K.T # 2. Scale by sqrt(d_k) # 3. Apply mask if provided # 4. Softmax # 5. Multiply by V pass
# Test your implementationQ = torch.randn(2, 10, 64)K = torch.randn(2, 10, 64)V = torch.randn(2, 10, 64)output, weights = manual_attention(Q, K, V)
# Verify: attention weights should sum to 1 per queryassert torch.allclose(weights.sum(dim=-1), torch.ones(2, 10))Success Criteria: Attention weights sum to 1, output shape matches input.
Exercise 2: Visualize Attention Patterns
Section titled “Exercise 2: Visualize Attention Patterns”Create attention visualizations for real text:
from transformers import AutoTokenizer, AutoModelimport matplotlib.pyplot as pltimport seaborn as sns
def visualize_attention(model, tokenizer, text): """ Visualize attention patterns for a given text.
1. Tokenize the input 2. Run through model with output_attentions=True 3. Extract attention weights from a specific layer/head 4. Create heatmap visualization """ # YOUR CODE HERE pass
# Test with sentences that have interesting attention patternssentences = [ "The cat sat on the mat because it was tired.", "The trophy didn't fit in the suitcase because it was too big.", "The python ate the mouse quickly and then slept.",]
# Visualize attention for pronouns resolving to their antecedentsDeliverable: Heatmaps showing attention patterns with clear pronoun resolution.
Exercise 3: Build a Causal Language Model
Section titled “Exercise 3: Build a Causal Language Model”Implement a minimal GPT-style model:
class MiniGPT(nn.Module): """ Minimal GPT-style decoder-only transformer.
Components needed: 1. Token embedding 2. Positional encoding 3. Causal masked multi-head attention 4. Feed-forward network 5. Output projection to vocabulary """ def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len): super().__init__() # YOUR CODE HERE pass
def forward(self, x): # Remember to apply causal mask! pass
def generate(self, prompt_tokens, max_new_tokens=50): """Autoregressive generation.""" # YOUR CODE HERE pass
# Train on a small corpus (Shakespeare, code, etc.)# Verify it can generate coherent textSuccess Criteria: Model generates somewhat coherent text after training.
Exercise 4: Compare Attention Efficiency
Section titled “Exercise 4: Compare Attention Efficiency”Benchmark standard attention vs optimized versions:
import timeimport torch
def benchmark_attention(seq_lengths, d_model=512, num_heads=8, num_trials=10): """ Measure attention time and memory for different sequence lengths.
1. Standard attention 2. torch.nn.functional.scaled_dot_product_attention (if available) 3. Flash attention (if installed)
Return timing and memory data for plotting. """ results = [] for seq_len in seq_lengths: # YOUR CODE HERE pass return results
# Test with seq_lengths = [512, 1024, 2048, 4096, 8192]# Plot the O(n²) growth# Compare with optimized implementationsDeliverable: Graph showing quadratic growth of attention and efficiency of optimizations.
Did You Know? The Bitter Lesson Confirmed
Section titled “Did You Know? The Bitter Lesson Confirmed”In 2019, Richard Sutton wrote “The Bitter Lesson,” arguing that general methods leveraging computation (like search and learning) ultimately beat approaches encoding human knowledge.
Transformers proved him right spectacularly:
- They don’t encode linguistic rules—they learn them
- They don’t have hand-crafted features—just attention
- They scale predictably with compute
The “bitter” part? Decades of NLP research on parsing, syntax trees, and linguistic features became largely obsolete overnight. The winning strategy was: simple architecture + massive scale.
“70 years of AI research, and the answer turns out to be: matrix multiplication and gradient descent, but lots of it.” — Anonymous ML researcher on Twitter
Community and Resources
Section titled “Community and Resources”Key People to Follow
Section titled “Key People to Follow”Original Transformer Authors:
- Ashish Vaswani (@ashaborali) - Co-founder of Essential AI
- Noam Shazeer - Co-founder of Character.AI, returned to Google
- Jakob Uszkoreit - Co-founder of Inceptive (RNA design with transformers)
Modern Practitioners:
- Andrej Karpathy (@karpathy) - Former Tesla AI Director, incredible educational content
- Jay Alammar - Author of “The Illustrated Transformer”
- Tri Dao - Flash Attention creator, now at Together AI
- Sebastian Raschka (@rasbt) - Excellent books and papers on LLMs
Active Research Areas (2024-2025)
Section titled “Active Research Areas (2024-2025)”Efficiency:
- Mixture of Experts (MoE) - Use only a fraction of parameters per forward pass
- State Space Models (Mamba) - Linear complexity alternative to attention
- Speculative Decoding - Faster inference with draft models
Scale:
- Constitutional AI - Training models to follow principles
- RLHF and DPO - Aligning models with human preferences
- Multimodal - Single models for text, image, audio, video
Understanding:
- Mechanistic Interpretability - Understanding what transformers actually learn internally
- Emergent Abilities - Capabilities that appear suddenly at scale
- In-Context Learning - How transformers learn from examples in the prompt
Further Reading
Section titled “Further Reading”Essential Papers
Section titled “Essential Papers”- “Attention Is All You Need” (Vaswani et al., 2017) — The original transformer paper
- “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018) — Encoder-only
- “Language Models are Few-Shot Learners” (Brown et al., 2020) — GPT-3, scaling laws
- “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020) — Vision Transformer
- “FlashAttention” (Dao et al., 2022) — Efficient attention implementation
Online Resources
Section titled “Online Resources”- The Illustrated Transformer — Best visual explanation
- Andrej Karpathy’s “Let’s build GPT” — Build from scratch
- The Annotated Transformer — Line-by-line PyTorch implementation
Summary
Section titled “Summary”You’ve learned:
- Why RNNs failed: Sequential processing, vanishing gradients, long-range difficulties
- Self-attention: Q, K, V projections → scaled dot product → softmax → weighted sum
- Multi-head attention: Multiple attention patterns in parallel
- Positional encoding: Adding position information to an orderless system
- Transformer architecture: Attention + FFN + residuals + layer norm
- Encoder vs decoder: Bidirectional vs causal attention
- The O(n²) problem: Why long sequences are expensive
- Vision Transformers: Patches as tokens
The transformer architecture is the foundation of modern AI. GPT, BERT, Claude, Gemini, Stable Diffusion, DALL-E — they all build on the ideas in this module.
The Heureka Moment (Revisited)
Section titled “The Heureka Moment (Revisited)”Attention is all you need — and now you understand why.
The insight is elegant: instead of processing sequences step by step, let every position look at every other position simultaneously. Let the model learn what to pay attention to through training.
This simple change:
- Enabled massive parallelization (faster training)
- Eliminated the information bottleneck of sequential processing
- Made scaling predictable and efficient
- Created a universal architecture for text, images, audio, video, and more
When you use ChatGPT, Claude, or any modern AI system, you’re using transformers. You now understand the core mechanism that makes them work.
Next Steps
Section titled “Next Steps”Move on to Module 31: Backpropagation Deep Dive where you’ll learn:
- How gradients actually flow through networks
- The chain rule in detail
- Implementing autograd from scratch
- Debugging gradient issues
This completes the “how neural networks learn” trilogy: forward pass (Module 26), training techniques (Module 28), and now the math of learning itself.
Last updated: 2025-12-11 Status: Complete