Backpropagation Deep Dive

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)

Reading Time: 8-10 hours Prerequisites: Module 29

This is a Heureka Moment module.

When eight researchers at Google discovered in 2017 that they could throw away two decades of sequence modeling wisdom and build something better, the AI community was skeptical. After all, recurrent neural networks had dominated for years. But within three years, their “Attention Is All You Need” paper would become the most cited work in AI history, and every major language model—gpt-5, Claude, Gemini—would be built on their foundation.

The Paper That Changed Everything

Mountain View, California. June 12, 2017. 2:34 AM.

Ashish Vaswani and his Google Brain colleagues huddled around a laptop in a cramped conference room. They had been training their experimental model for days, and the results were finally in.

“That can’t be right,” Jakob Uszkoreit muttered, staring at the screen.

They had just achieved state-of-the-art performance on machine translation—beating every existing system. But that wasn’t what shocked them. It was what they had removed to get there.

No recurrent connections. No convolutions. Just attention. “Attention is all you need,” Vaswani typed as the paper’s title, half-joking. But he wasn’t joking.

“We expected attention to be helpful alongside RNNs, not to replace them entirely. When we removed the recurrence and the model got better, we knew we had found something fundamental.” — Ashish Vaswani, Transformer co-inventor

That paper, with its eight co-authors and deceptively simple title, would go on to become the most cited paper in AI history. Every modern language model—gpt-5, Claude, Gemini, LLaMA—is a direct descendant of what they built that night.

What You’ll Be Able to Do

By the end of this module, you will:

Understand why sequential models (RNNs/LSTMs) hit a wall
Master the self-attention mechanism from intuition to implementation
Build multi-head attention from scratch
Understand positional encoding and why it’s necessary
Implement a complete transformer encoder
Know why “Attention Is All You Need” changed everything
See how Vision Transformers apply attention to images
Grasp the O(n²) complexity and its implications

The Heureka Moment: What You’re About to Discover

Here’s the transformative insight you’ll gain from this module:

Before transformers, neural networks processed sequences one step at a time — like reading a book word by word, unable to look back or ahead without significant effort.

After transformers, neural networks can look at everything simultaneously and learn what to pay attention to — like being able to see an entire book at once and instantly knowing which parts are relevant to each other.

This single architectural change is why we have gpt-5, Claude, Gemini, and virtually every modern AI system. By the end of this module, you’ll understand exactly how it works.

The Problem: Why RNNs Hit a Wall

Before 2017, if you wanted to process sequences (text, audio, time series), you used Recurrent Neural Networks (RNNs) or their improved variant, LSTMs.

How RNNs Work

An RNN processes sequences one element at a time, maintaining a “hidden state” that carries information forward:

Input:  "The cat sat on the mat"
         ↓     ↓    ↓   ↓   ↓   ↓
        h₀ → h₁ → h₂ → h₃ → h₄ → h₅ → h₆
                                        ↓
                                     Output

Each hidden state h_t depends on the previous state h_{t-1} and the current input. Information flows forward through time.

The Three Fatal Flaws

1. Sequential Processing (Slow)

RNNs must process tokens one at a time. You can’t parallelize because h₃ depends on h₂ which depends on h₁. Training on a 1000-token sequence takes 1000 sequential steps.

Modern GPUs have thousands of cores, but RNNs can only use one at a time
for each sequence position. It's like having a 1000-lane highway but
being forced to drive in a single lane.

2. Vanishing/Exploding Gradients (Again)

Information must travel through many timesteps. For a 100-word sentence, the gradient for word 1 must backpropagate through 99 steps. Remember Module 28? The same multiplication problem kills gradients in RNNs.

Did You Know? LSTMs (Long Short-Term Memory) were invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997 specifically to address the vanishing gradient problem. Their “gating mechanism” allows information to flow unchanged through time — similar to ResNet’s skip connections. LSTMs dominated sequence modeling for 20 years until transformers arrived.

3. Long-Range Dependencies (Difficult)

For the sentence: “The cat, which had been sleeping on the warm sunny windowsill for most of the afternoon, finally woke up and stretched.”

What does “stretched” refer to? The cat! But there are 16 words between “cat” and “stretched.” RNNs struggle to maintain this connection over long distances.

The LSTM Partial Solution

LSTMs add “gates” that control information flow:

┌─────────────────────────────────────────────┐
│                   LSTM Cell                  │
│                                             │
│   forget_gate = σ(W_f · [h_{t-1}, x_t])    │  ← What to forget
│   input_gate  = σ(W_i · [h_{t-1}, x_t])    │  ← What to add
│   output_gate = σ(W_o · [h_{t-1}, x_t])    │  ← What to output
│                                             │
│   cell_state = forget_gate * c_{t-1}       │
│              + input_gate * tanh(W_c · [h_{t-1}, x_t])
│                                             │
│   h_t = output_gate * tanh(cell_state)     │
└─────────────────────────────────────────────┘

LSTMs helped, but didn’t solve the fundamental problem: you still have to process sequentially.

The Revolution: “Attention Is All You Need”

In June 2017, eight researchers at Google published a paper that would reshape AI: “Attention Is All You Need.”

Did You Know? The transformer paper has been cited over 130,000 times — more than almost any computer science paper ever written. The eight authors (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin) are now scattered across Google, Google DeepMind, and various AI startups. Noam Shazeer later co-founded Character.AI and then returned to Google. The paper’s title was intentionally provocative — they were claiming that attention alone, without any recurrence, could match or beat RNN/LSTM models.

The key insight was radical: throw away recurrence entirely. Don’t process sequences one step at a time. Instead, let every token look at every other token simultaneously and learn which connections matter.

Think of the difference like reading a book. An RNN is like reading with a tiny flashlight that only illuminates one word at a time—you have to remember what you read before. A transformer is like turning on all the lights at once and seeing the entire page, instantly noticing which words relate to each other. This parallel vision is what makes transformers so powerful.

This is the attention mechanism.

Self-Attention: The Core Innovation

The Intuition: A Different Kind of Search

Think of attention as a soft database lookup.

In a traditional database, you have:

A query (what you’re looking for)
Keys (labels on stored items)
Values (the actual stored content)

You find exact matches: SELECT value WHERE key = query

Attention does the same thing, but with soft matching:

Instead of exact matches, compute similarity scores
Instead of returning one result, return a weighted combination of all values
The weights are learned, not hand-coded

The Analogy: Finding Relevant Context

Imagine you’re reading the sentence: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal or the street?

As a human, you instantly know it’s “the animal” — streets don’t get tired. How did you know? You looked at “it,” then looked at all other words, and found that “animal” + “tired” makes semantic sense while “street” + “tired” doesn’t.

This is exactly what self-attention does. For each word:

Create a “query” representing “what am I looking for?”
Create “keys” for all words representing “what can I offer?”
Compare query to all keys (similarity scores)
Use scores to weight the “values” (actual content)

The Math: Query, Key, Value

Given an input sequence of embeddings X (shape: [seq_len, d_model]):

Q = X @ W_Q    # Queries: what each position is looking for
K = X @ W_K    # Keys: what each position can be found by
V = X @ W_V    # Values: what each position contains

# Compute attention scores
scores = Q @ K.T / sqrt(d_k)    # [seq_len, seq_len]

# Convert to probabilities
attention_weights = softmax(scores)    # Each row sums to 1

# Weighted sum of values
output = attention_weights @ V    # [seq_len, d_model]

Let’s break this down:

Step 1: Project to Q, K, V

Each input embedding gets transformed into three different representations:

Query (Q): “When I need context, what should I look for?”
Key (K): “What kind of context can I provide to others?”
Value (V): “What information do I actually contain?”

These are separate because what you’re looking for (Q) might be different from what you offer (K). For example, a pronoun like “it” is looking for its antecedent, but offers very little information itself.

Step 2: Compute Similarity (Q @ K.T)

For each query, compute dot product with all keys. High dot product = high similarity.

For "it" (query) comparing to:
  "animal" (key): score = 0.8   ← high similarity
  "street" (key): score = 0.2   ← low similarity
  "tired" (key):  score = 0.1

Step 3: Scale by sqrt(d_k)

Why divide by sqrt(d_k)? Without it, dot products grow with dimension size, pushing softmax into saturated regions where gradients vanish.

d_k = 64 (typical)
sqrt(64) = 8

Without scaling: scores might be [-100, 150, -80]
softmax([-100, 150, -80]) ≈ [0, 1, 0]  # One-hot, no gradient

With scaling: scores become [-12.5, 18.75, -10]
softmax([-12.5, 18.75, -10]) ≈ [0.001, 0.998, 0.001]  # Still peaked but gradient exists

Step 4: Softmax (Probabilities)

Convert scores to probabilities. Each position’s attention weights sum to 1.

Step 5: Weighted Sum of Values

The output for each position is a weighted average of all values, where weights come from attention scores.

The Formula (Complete)

Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V

This single formula is the heart of modern AI.

Think of this formula like a spotlight operator at a concert. The Query is what you’re trying to illuminate. The Keys are labels on different performers. The dot product tells you how relevant each performer is to what you’re looking for. Softmax makes sure your spotlight energy (attention) is distributed as probabilities. And Values are what you actually see when you shine the light there—the content you get back.

Worked Example with Numbers

Let’s trace through with a tiny example. Sequence: [“I”, “saw”, “it”] with d_model = 4.

# Input embeddings (simplified)
X = [
    [1.0, 0.0, 0.5, 0.2],   # "I"
    [0.3, 0.8, 0.1, 0.9],   # "saw"
    [0.5, 0.5, 0.5, 0.5],   # "it"
]

# Weight matrices (learned during training)
W_Q = [[...]]  # 4×4 matrix
W_K = [[...]]  # 4×4 matrix
W_V = [[...]]  # 4×4 matrix

# Project to Q, K, V
Q = X @ W_Q  # [3, 4]
K = X @ W_K  # [3, 4]
V = X @ W_V  # [3, 4]

# Compute attention scores for position 2 ("it")
# Q[2] @ K.T gives scores for "it" attending to all positions
q_it = Q[2]                    # [4]
scores_it = q_it @ K.T         # [3] - one score per position
# Let's say this gives: [0.1, 0.3, 0.6]

# Scale by sqrt(d_k)
d_k = 4
scaled_scores = scores_it / sqrt(4)  # [0.05, 0.15, 0.3]

# Softmax
attention_weights = softmax([0.05, 0.15, 0.3])  # ≈ [0.28, 0.33, 0.39]

# Weighted sum of values
# "it" gets 28% of "I"'s value, 33% of "saw"'s value, 39% of its own value
output_it = 0.28 * V[0] + 0.33 * V[1] + 0.39 * V[2]

PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    """
    Single-head self-attention.

    This is the core building block of transformers.
    """
    def __init__(self, d_model: int):
        super().__init__()
        self.d_model = d_model

        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        """
        Args:
            x: Input tensor [batch, seq_len, d_model]
            mask: Optional attention mask [batch, seq_len, seq_len]

        Returns:
            output: Attended values [batch, seq_len, d_model]
            attention_weights: Attention patterns [batch, seq_len, seq_len]
        """
        # Project to Q, K, V
        Q = self.W_q(x)  # [batch, seq_len, d_model]
        K = self.W_k(x)
        V = self.W_v(x)

        # Compute attention scores
        # Q @ K.T: [batch, seq_len, d_model] @ [batch, d_model, seq_len]
        #        = [batch, seq_len, seq_len]
        scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(self.d_model)

        # Apply mask if provided (for causal/padding masking)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax over the last dimension (keys)
        attention_weights = F.softmax(scores, dim=-1)

        # Weighted sum of values
        output = torch.bmm(attention_weights, V)

        return output, attention_weights


# Test it
d_model = 64
seq_len = 10
batch_size = 2

attention = SelfAttention(d_model)
x = torch.randn(batch_size, seq_len, d_model)
output, weights = attention(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Weights sum per query: {weights.sum(dim=-1)}")  # Should be all 1s

Notice how the output shape matches the input shape — attention doesn’t change dimensions, it just mixes information between positions. The attention weights sum to 1 for each query position (thanks to softmax), making them interpretable as “how much attention goes to each key.”

Visualizing Attention

The attention weights form a [seq_len × seq_len] matrix showing how much each position attends to each other position:

             "The"  "cat"  "sat"  "on"  "it"
"The"    [   0.8    0.1    0.05   0.03  0.02  ]
"cat"    [   0.3    0.5    0.1    0.05  0.05  ]
"sat"    [   0.1    0.6    0.2    0.05  0.05  ]
"on"     [   0.1    0.2    0.3    0.3   0.1   ]
"it"     [   0.1    0.7    0.1    0.05  0.05  ]  ← "it" attends strongly to "cat"

This is why attention is so interpretable — you can literally see what the model is “paying attention to.”

Multi-Head Attention: Multiple Perspectives

Single-head attention has a limitation: each position can only compute one weighted average of values. But language has many types of relationships!

Consider: “The cat that I saw yesterday chased the mouse.”

Different aspects we might want to capture:

Syntactic: “cat” → “chased” (subject-verb)
Semantic: “cat” → “mouse” (predator-prey)
Coreference: “that” → “cat” (relative clause)

Multi-head attention runs multiple attention operations in parallel, each learning different relationship types.

The Architecture

Input X
   │
   ├──────┬──────┬──────┬──────►  (split into heads)
   │      │      │      │
 Head₁  Head₂  Head₃  Head₄
   │      │      │      │
   └──────┴──────┴──────┘
           │
        Concat
           │
        Linear (W_O)
           │
        Output

Implementation Details

Instead of one attention with d_model dimensions, use h heads each with d_k = d_model/h dimensions.

class MultiHeadAttention(nn.Module):
    """
    Multi-head attention mechanism.

    Each head learns different attention patterns, then
    results are concatenated and projected.
    """
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Projections for all heads at once (more efficient)
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        batch_size, seq_len, _ = x.shape

        # Project to Q, K, V
        Q = self.W_q(x)  # [batch, seq_len, d_model]
        K = self.W_k(x)
        V = self.W_v(x)

        # Reshape for multi-head: [batch, seq_len, d_model]
        #                       → [batch, num_heads, seq_len, d_k]
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Compute attention for all heads in parallel
        # [batch, heads, seq_len, d_k] @ [batch, heads, d_k, seq_len]
        # = [batch, heads, seq_len, seq_len]
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = F.softmax(scores, dim=-1)

        # Apply attention to values
        # [batch, heads, seq_len, seq_len] @ [batch, heads, seq_len, d_k]
        # = [batch, heads, seq_len, d_k]
        attended = torch.matmul(attention_weights, V)

        # Concatenate heads: [batch, heads, seq_len, d_k]
        #                  → [batch, seq_len, d_model]
        attended = attended.transpose(1, 2).contiguous()
        attended = attended.view(batch_size, seq_len, self.d_model)

        # Final linear projection
        output = self.W_o(attended)

        return output, attention_weights


# Test multi-head attention
mha = MultiHeadAttention(d_model=64, num_heads=8)
x = torch.randn(2, 10, 64)
output, weights = mha(x)

print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
print(f"Attention weights: {weights.shape}")  # [batch, heads, seq_len, seq_len]

Notice how each head operates on d_model/num_heads dimensions (64/8 = 8 in this case). The heads run in parallel, each learning its own attention pattern, then the results are concatenated and projected back to d_model dimensions.

Did You Know? In the original transformer paper, the authors used 8 attention heads. When researchers later visualized what each head learned, they found remarkable specialization: some heads tracked syntactic dependencies (subject-verb), others tracked positional relationships (adjacent words), and some captured semantic similarity. The model “discovered” linguistics without being taught!

Positional Encoding: Teaching Order to an Orderless System

Here’s a subtle but critical problem with self-attention: it’s permutation invariant.

If you shuffle the input tokens, the attention computation doesn’t change (each token still attends to all others). But word order matters! “Dog bites man” ≠ “Man bites dog.”

The solution is positional encoding: add information about position to each token.

The Sinusoidal Encoding

The original paper used sine and cosine functions of different frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates a unique pattern for each position that the model can learn to interpret.

Worked Example: Computing Positional Encodings

Let’s compute actual values for d_model = 8 (tiny, for illustration):

Position 0, dimensions 0-7:
  dim 0: sin(0 / 10000^(0/8)) = sin(0) = 0.000
  dim 1: cos(0 / 10000^(0/8)) = cos(0) = 1.000
  dim 2: sin(0 / 10000^(2/8)) = sin(0) = 0.000
  dim 3: cos(0 / 10000^(2/8)) = cos(0) = 1.000
  ...
  PE[0] = [0.000, 1.000, 0.000, 1.000, 0.000, 1.000, 0.000, 1.000]

Position 1, dimensions 0-7:
  dim 0: sin(1 / 10000^(0/8)) = sin(1.0) = 0.841
  dim 1: cos(1 / 10000^(0/8)) = cos(1.0) = 0.540
  dim 2: sin(1 / 10000^(2/8)) = sin(0.1) = 0.100
  dim 3: cos(1 / 10000^(2/8)) = cos(0.1) = 0.995
  ...
  PE[1] = [0.841, 0.540, 0.100, 0.995, 0.010, 1.000, 0.001, 1.000]

Position 100:
  PE[100] = [-0.506, 0.863, -0.999, 0.045, 0.842, 0.540, 0.100, 0.995]

Notice how:

Low dimensions (0, 1) oscillate rapidly — they change significantly between adjacent positions
High dimensions (6, 7) oscillate slowly — they’re almost constant for nearby positions
Position 0 is special — all sine terms are 0, all cosine terms are 1
Each position gets a unique fingerprint that the model learns to decode

Why sinusoids?

Bounded values: Always between -1 and 1
Unique patterns: Each position has a distinct encoding
Relative positions: The model can learn to compute PE(pos+k) from PE(pos)
Generalization: Works for sequences longer than training data

class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding from "Attention Is All You Need."

    Adds position information to token embeddings so the model
    knows word order.
    """
    def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()

        # Compute the divisor term
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() *
            (-math.log(10000.0) / d_model)
        )

        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add batch dimension and register as buffer (not a parameter)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Token embeddings [batch, seq_len, d_model]

        Returns:
            x + positional encoding
        """
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)


# Visualize positional encodings
pe = PositionalEncoding(d_model=64, max_len=100)
positions = pe.pe[0, :50, :].numpy()  # First 50 positions

# Each row is a position, each column is a dimension
# You'd see wave patterns at different frequencies

Notice how the encoding is computed once and stored as a buffer (not a parameter). This is because positional encodings are fixed — they don’t change during training. The register_buffer call ensures they’re saved with the model but not updated by the optimizer.

Learned vs. Fixed Positional Encodings

Modern models often use learned positional embeddings instead:

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        self.position_embeddings = nn.Embedding(max_len, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        seq_len = x.size(1)
        positions = torch.arange(seq_len, device=x.device)
        return x + self.position_embeddings(positions)

Notice how learned embeddings use nn.Embedding — the same mechanism used for token embeddings. The position indices (0, 1, 2, …) are looked up in a learned table. This is simpler than sinusoidal but limited to the maximum length seen during training.

BERT uses learned positions. GPT uses learned positions. The original transformer used sinusoidal. Both work well.

Did You Know? RoPE (Rotary Position Embedding), invented by Jianlin Su in 2021, is now used in most modern LLMs including LLaMA, Mistral, and many others. Instead of adding positional information, it rotates the query and key vectors based on position. This elegant approach allows models to better capture relative positions and generalize to longer sequences than seen during training.

The Full Transformer Block

Now let’s assemble the complete transformer encoder block. Each block contains:

Multi-Head Self-Attention (with residual connection + layer norm)
Feed-Forward Network (with residual connection + layer norm)

    Input
      │
      ├──────────────────────┐
      │                      │
  Multi-Head                 │
  Attention                  │
      │                      │
      └──────► Add ◄─────────┘
               │
           LayerNorm
               │
      ├──────────────────────┐
      │                      │
  Feed-Forward               │
  Network                    │
      │                      │
      └──────► Add ◄─────────┘
               │
           LayerNorm
               │
           Output

The Feed-Forward Network

A simple two-layer MLP applied independently to each position:

class FeedForward(nn.Module):
    """
    Position-wise feed-forward network.

    Applied independently to each position (no interaction between positions).
    Typically expands to 4x d_model then contracts back.
    """
    def __init__(self, d_model: int, d_ff: int = None, dropout: float = 0.1):
        super().__init__()
        d_ff = d_ff or 4 * d_model  # Default: 4x expansion

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),  # Modern choice; original used ReLU
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

The Complete Encoder Block

class TransformerEncoderBlock(nn.Module):
    """
    A single transformer encoder block.

    Contains multi-head attention and feed-forward network,
    each with residual connections and layer normalization.
    """
    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int = None,
        dropout: float = 0.1
    ):
        super().__init__()

        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        # Self-attention with residual connection
        attended, attention_weights = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attended))

        # Feed-forward with residual connection
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))

        return x, attention_weights

The Full Transformer Encoder

Stack multiple blocks:

class TransformerEncoder(nn.Module):
    """
    Complete transformer encoder.

    Stacks multiple encoder blocks with shared positional encoding.
    """
    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        num_heads: int = 8,
        num_layers: int = 6,
        d_ff: int = 2048,
        max_len: int = 5000,
        dropout: float = 0.1
    ):
        super().__init__()

        self.d_model = d_model

        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)

        # Positional encoding
        self.positional_encoding = PositionalEncoding(d_model, max_len, dropout)

        # Stack of encoder blocks
        self.layers = nn.ModuleList([
            TransformerEncoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Final layer norm
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        # Embed tokens and add positional encoding
        x = self.token_embedding(x) * math.sqrt(self.d_model)
        x = self.positional_encoding(x)

        # Pass through encoder blocks
        attention_weights = []
        for layer in self.layers:
            x, attn = layer(x, mask)
            attention_weights.append(attn)

        x = self.norm(x)

        return x, attention_weights


# Create a small transformer encoder
encoder = TransformerEncoder(
    vocab_size=10000,
    d_model=256,
    num_heads=8,
    num_layers=4,
    d_ff=1024
)

# Test with random token IDs
tokens = torch.randint(0, 10000, (2, 50))  # Batch of 2, length 50
output, attentions = encoder(tokens)

print(f"Input tokens: {tokens.shape}")
print(f"Output embeddings: {output.shape}")
print(f"Number of attention matrices: {len(attentions)}")
print(f"Each attention matrix: {attentions[0].shape}")

Notice how the encoder returns both the output embeddings and all attention weights. The attention weights are invaluable for interpretability — you can visualize exactly what each layer and head is “paying attention to.” This transparency is one of the reasons transformers became so popular in research.

Encoder vs. Decoder: Two Flavors of Transformer

The original transformer paper described an encoder-decoder architecture for translation. Let’s understand the difference:

Encoder-Only (BERT style)

Sees the entire input at once
Each position attends to all other positions
Used for: classification, named entity recognition, embeddings

Input:   [CLS] The cat sat [SEP]
          ↓     ↓    ↓   ↓    ↓
         ←───────────────────→   (bidirectional attention)
          ↓     ↓    ↓   ↓    ↓
Output:  h_cls h_1  h_2 h_3 h_sep

Decoder-Only (GPT style)

Generates tokens one at a time
Each position can only attend to previous positions (causal masking)
Used for: text generation, chat, code completion

Input:   The  cat  sat
          ↓    ↓    ↓
         ←─   ←──  ←───   (causal attention: can only look back)
          ↓    ↓    ↓
Output:  cat  sat  on

Causal Masking

To prevent the decoder from “cheating” by looking at future tokens:

def create_causal_mask(seq_len: int) -> torch.Tensor:
    """
    Create a causal mask for decoder self-attention.

    Lower triangular matrix: position i can attend to positions 0...i
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask

# Example for seq_len=5
mask = create_causal_mask(5)
print(mask)
# tensor([[1., 0., 0., 0., 0.],
#         [1., 1., 0., 0., 0.],
#         [1., 1., 1., 0., 0.],
#         [1., 1., 1., 1., 0.],
#         [1., 1., 1., 1., 1.]])

# Position 0 can only see position 0
# Position 1 can see positions 0, 1
# Position 4 can see positions 0, 1, 2, 3, 4

Notice how torch.tril (lower triangular) gives us exactly what we need — each row allows attention only to that position and earlier ones. Where the mask is 0, we set the attention score to negative infinity before softmax, making those attention weights effectively zero.

Encoder-Decoder (Original Transformer)

Encoder processes source sequence (bidirectional)
Decoder generates target sequence (causal)
Decoder also attends to encoder outputs (cross-attention)
Used for: translation, summarization

Source: "Le chat"     →  Encoder  → encoded representations
                                            ↓
Target: "The cat"     →  Decoder  ← cross-attention
                            ↓
Output: "The cat"

Did You Know? GPT-3 and all subsequent GPT models are decoder-only. BERT is encoder-only. The original transformer was encoder-decoder. Surprisingly, decoder-only models turned out to be the best for general language understanding, despite being designed for generation. This was unexpected — many researchers thought bidirectional context (encoder-style) would always be superior.

The Complexity Trade-off: O(n²)

Self-attention has a critical limitation: quadratic complexity in sequence length.

For a sequence of length n:

Attention scores matrix: n × n
Memory: O(n²)
Computation: O(n²)

This means:

1000 tokens: 1 million attention computations
10,000 tokens: 100 million attention computations
100,000 tokens: 10 billion attention computations

Why This Matters

Sequence Length	Memory (FP16)	Relative Cost
512	~1 MB	1×
2,048	~16 MB	16×
8,192	~256 MB	256×
32,768	~4 GB	4,096×
131,072	~64 GB	65,536×

This is why early transformers were limited to 512-2048 tokens, and why research into efficient attention is so active.

Efficient Attention Variants

Many techniques exist to reduce complexity:

Sparse Attention (BigBird, Longformer): Only attend to subset of positions
Linear Attention (Performer, Linear Transformer): Approximate attention with O(n)
Flash Attention: Same math, but optimized memory access patterns (much faster!)
Sliding Window: Only attend to local context (Mistral uses this)

Did You Know? Flash Attention, created by Tri Dao at Stanford in 2022, doesn’t change the attention math at all — it computes exactly the same result. But by optimizing how data moves between GPU memory levels (exploiting the memory hierarchy), it achieves 2-4× speedup and uses 5-20× less memory. This is why modern LLMs can handle 128K+ context windows. The insight was that attention is memory-bound, not compute-bound.

Vision Transformers: Attention for Images

In 2020, Google showed that transformers could match or beat CNNs on images: the Vision Transformer (ViT).

The Key Idea: Patches as Tokens

Instead of processing pixels, split the image into patches and treat each patch as a token:

224×224 image with 16×16 patches = 14×14 = 196 "tokens"

┌────┬────┬────┬────┐
│ P1 │ P2 │ P3 │... │
├────┼────┼────┼────┤
│ P15│ P16│ P17│... │
├────┼────┼────┼────┤
│... │... │... │... │
└────┴────┴────┴────┘

Each patch (16×16×3 = 768 values) becomes a token embedding

class PatchEmbedding(nn.Module):
    """
    Convert image into sequence of patch embeddings.
    """
    def __init__(
        self,
        image_size: int = 224,
        patch_size: int = 16,
        in_channels: int = 3,
        d_model: int = 768
    ):
        super().__init__()
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2

        # Conv2d is an efficient way to extract and project patches
        self.projection = nn.Conv2d(
            in_channels, d_model,
            kernel_size=patch_size,
            stride=patch_size
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [batch, channels, height, width]
        x = self.projection(x)  # [batch, d_model, h/p, w/p]
        x = x.flatten(2)        # [batch, d_model, num_patches]
        x = x.transpose(1, 2)   # [batch, num_patches, d_model]
        return x

ViT Architecture

Image → Patch Embedding → [CLS] + Patches → Position Embedding
                                    ↓
                          Transformer Encoder (×12)
                                    ↓
                              CLS token
                                    ↓
                           Classification Head

The [CLS] token is a learnable embedding prepended to the patch sequence. After passing through the transformer, it contains information about the entire image.

Did You Know? The original ViT paper (Dosovitskiy et al., 2020) had a surprising finding: ViT underperformed CNNs when trained on ImageNet alone, but dramatically outperformed them when pretrained on larger datasets (JFT-300M with 300 million images). This suggested that transformers need more data to learn good visual features, but once they have enough data, they scale better than CNNs.

Why Attention Changed Everything

Let’s revisit the Heureka Moment. Why did transformers take over AI?

1. Parallelization

RNNs: Sequential (slow to train) Transformers: Fully parallel (fast to train)

This meant transformers could be trained on much more data in the same time.

2. Long-Range Dependencies

RNNs: Information degrades over distance Transformers: Every token directly attends to every other token

A word 500 positions away is just as easy to attend to as an adjacent word.

3. Scalability

The “scaling laws” discovered by OpenAI showed that transformers improve predictably with:

More parameters
More data
More compute

This predictability enabled massive investment in scaling (GPT-3, gpt-5, etc.).

4. Transfer Learning

Pretrained transformers transfer remarkably well:

BERT: Pretrain on text → fine-tune for any NLP task
GPT: Pretrain on text → few-shot learning on any task
ViT: Pretrain on images → transfer to any vision task

5. Interpretability (Sort of)

Attention weights are somewhat interpretable. You can visualize what the model is “looking at.” This isn’t possible with RNNs or CNNs in the same way.

Common Pitfalls

1. Forgetting the Scale Factor

# Wrong - gradients will vanish in softmax
scores = Q @ K.T
attention = softmax(scores)

# Correct - scale by sqrt(d_k)
scores = Q @ K.T / math.sqrt(d_k)
attention = softmax(scores)

2. Wrong Mask Shape

# Wrong - mask should be [batch, 1, seq_len, seq_len] or broadcastable
mask = torch.ones(batch_size, seq_len, seq_len)  # Missing head dimension

# Correct for multi-head attention
mask = torch.ones(batch_size, 1, seq_len, seq_len)  # Broadcasts over heads

3. Forgetting Positional Encoding

# Wrong - no position information!
x = token_embedding(tokens)
output = transformer(x)

# Correct
x = token_embedding(tokens)
x = x + positional_encoding(x)  # Add position information
output = transformer(x)

4. Not Handling Padding Properly

# For batches with different sequence lengths, you need padding masks
# to prevent attending to padding tokens

def create_padding_mask(seq, pad_token=0):
    """Create mask where padding positions are 0."""
    return (seq != pad_token).unsqueeze(1).unsqueeze(2)

5. Memory Explosion with Long Sequences

# Dangerous - might OOM
long_sequence = torch.randn(1, 10000, 512)
attention = MultiHeadAttention(512, 8)
output = attention(long_sequence)  # 10000×10000 attention matrix!

# Safer - use flash attention or limit sequence length
# Or use a model designed for long contexts (Longformer, etc.)

Quiz: Test Your Understanding

Q1: Why do we divide attention scores by sqrt(d_k)?

Answer

Without scaling, dot products grow with dimension size (approximately proportional to d_k). Large dot products push softmax into saturation where gradients approach zero. Dividing by sqrt(d_k) keeps the variance of scores roughly constant regardless of dimension, maintaining healthy gradients.

Q2: What’s the difference between encoder and decoder transformers?

Answer

**Encoder** (BERT-style): Bidirectional attention — each position attends to all positions. Used for understanding/classification.

Decoder (GPT-style): Causal/autoregressive attention — each position only attends to previous positions. Used for generation.

The key difference is the attention mask: encoders use no mask (or just padding mask), while decoders use a causal mask that prevents looking ahead.

Q3: Why do transformers need positional encoding?

Answer

Self-attention is permutation invariant — if you shuffle the input, the computation doesn't change (ignoring position encoding). But word order matters: "dog bites man" ≠ "man bites dog."

Positional encoding adds position information to each token embedding, allowing the model to distinguish between “cat at position 1” and “cat at position 10.”

Q4: Why is attention O(n²) and why does it matter?

Answer

Each of n positions computes attention scores with all n positions, giving n×n scores. This means: - Memory grows quadratically with sequence length - Compute grows quadratically with sequence length

A 10× longer sequence needs 100× more memory/compute. This limits context windows and makes very long documents challenging. It’s why there’s active research on efficient attention (linear attention, sparse attention, Flash Attention).

Q5: What are the three weight matrices in attention (W_Q, W_K, W_V) for?

Answer

- **W_Q (Query)**: Projects each token into "what am I looking for" space - **W_K (Key)**: Projects each token into "what can I be found by" space - **W_V (Value)**: Projects each token into "what information do I contain" space

Separating these allows different transformations for matching (Q·K) versus content retrieval (V). A pronoun like “it” needs to find its antecedent (high Q similarity with nouns) but contains little information itself (low V magnitude).

Economics of Transformers

Computational Cost Reality

Transformers revolutionized AI but at significant computational cost:

Training Costs (Estimated):

Model	Parameters	Training Cost	GPU Hours
GPT-2	1.5B	~$50K	~1 week
GPT-3	175B	~$4.6M	~3 months
gpt-5	~1.7T	~$100M	~6 months
Claude 3	Unknown	~$50-100M	Unknown

Inference Costs per 1M Tokens:

Model Size	Input Cost	Output Cost
7B params	$0.10-0.50	$0.30-1.00
70B params	$0.50-2.00	$2.00-5.00
400B+ params	$2.00-10.00	$10.00-30.00

The Context Window Economics

The O(n²) attention cost means context length has outsized impact:

Context Length	Relative Memory	Relative Compute
4K tokens	1×	1×
32K tokens	64×	64×
128K tokens	1,024×	1,024×

Flash Attention’s Impact: Reduces memory by 5-20× and speeds up by 2-4×, making 100K+ context practical.

ROI of Understanding Transformers

Career impact data (from industry surveys):

Transformer expertise salary premium: +$20,000-40,000/year
AI/ML roles requiring transformer knowledge: 85%+
Time to proficiency: 2-4 weeks of dedicated study
Most in-demand sub-skills: attention visualization, efficient inference, fine-tuning

Interview Preparation: Transformers

Common Interview Questions

Q1: “Walk me through the self-attention mechanism.”

Strong Answer: “Self-attention allows each position in a sequence to gather information from all other positions. We project input embeddings into three representations: Query (what am I looking for), Key (what can I offer), and Value (what information do I contain). We compute attention scores by taking the dot product of Query with all Keys, scaled by sqrt(d_k) to prevent gradient vanishing. After softmax normalization, these scores weight the Values to produce an output that’s a learned combination of all positions. The key insight is that attention patterns are learned—the model discovers what relationships matter.”

Q2: “Why did transformers replace RNNs?”

Strong Answer: “Three fundamental reasons. First, parallelization: RNNs process sequentially (h_t depends on h_{t-1}), while transformers process all positions simultaneously, enabling massive GPU parallelism. Second, long-range dependencies: RNNs suffer from vanishing gradients over distance, but transformer attention gives each position direct access to every other position. Third, scalability: transformers follow predictable scaling laws—more parameters and data reliably improve performance—enabling systematic investment in larger models.”

Q3: “What’s the role of positional encoding?”

Strong Answer: “Self-attention is permutation invariant—shuffling inputs doesn’t change outputs. But word order matters: ‘dog bites man’ differs from ‘man bites dog.’ Positional encoding injects position information into embeddings. The original paper used sinusoidal functions: different frequencies across dimensions create unique fingerprints per position. Modern models often use learned embeddings instead, or RoPE (Rotary Position Embedding) which rotates vectors based on position for better relative position handling and length generalization.”

Q4: “Explain multi-head attention and why it’s useful.”

Strong Answer: “Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows capturing different relationship types simultaneously: one head might learn syntactic patterns (subject-verb), another semantic relationships (cause-effect), another positional patterns (adjacent words). Each head operates on d_model/num_heads dimensions, then outputs are concatenated and projected. It’s like having multiple experts each focusing on different aspects of the relationships in text.”

Q5: “How would you debug a transformer that’s not learning?”

Strong Answer: “Systematic approach: First, verify attention patterns—visualize attention weights to check if the model is learning meaningful patterns or just attending uniformly. Second, check gradient flow—attention scores should have healthy magnitude (not too large pre-softmax). Third, verify positional encoding is being added correctly. Fourth, check masking—decoder models need causal masks, and padding tokens should be masked. Fifth, start with a tiny dataset and overfit—if the model can’t memorize a few examples, there’s a fundamental bug. Sixth, compare against a known working implementation on the same data.”

System Design Question

Q: “Design a transformer-based document search system.”

Strong Answer Structure:

Architecture Choice: “Use a bi-encoder approach with BERT-style encoder. Documents are embedded offline, queries are embedded at search time. This enables sub-second search over millions of documents.”
Embedding Strategy: “Use mean pooling or CLS token for document representation. Consider chunking long documents (512 token limit) and taking max or mean over chunks. Fine-tune on in-domain data if available.”
Index Structure: “Store embeddings in a vector database (Pinecone, Milvus, FAISS). Use approximate nearest neighbor search for scalability. Combine with keyword search (BM25) in a hybrid approach for best results.”
Efficiency Considerations: “Quantize embeddings (float32→int8) to reduce storage 4×. Use dimensionality reduction if latency-critical. Consider caching frequent queries.”
Quality Improvements: “Add cross-encoder reranking on top-K results for higher precision. Fine-tune on click data for relevance signals. A/B test embedding models.”

️ Hands-On Exercises

Exercise 1: Implement Attention from Scratch

Build your own attention mechanism without using PyTorch’s built-in functions:

import torch
import math

def manual_attention(Q, K, V, mask=None):
    """
    Implement scaled dot-product attention from scratch.

    Args:
        Q: Queries [batch, seq_len, d_k]
        K: Keys [batch, seq_len, d_k]
        V: Values [batch, seq_len, d_v]
        mask: Optional attention mask

    Returns:
        output: Attended values
        attention_weights: Attention patterns
    """
    # YOUR CODE HERE:
    # 1. Compute Q @ K.T
    # 2. Scale by sqrt(d_k)
    # 3. Apply mask if provided
    # 4. Softmax
    # 5. Multiply by V
    pass

# Test your implementation
Q = torch.randn(2, 10, 64)
K = torch.randn(2, 10, 64)
V = torch.randn(2, 10, 64)
output, weights = manual_attention(Q, K, V)

# Verify: attention weights should sum to 1 per query
assert torch.allclose(weights.sum(dim=-1), torch.ones(2, 10))

Success Criteria: Attention weights sum to 1, output shape matches input.

Exercise 2: Visualize Attention Patterns

Create attention visualizations for real text:

from transformers import AutoTokenizer, AutoModel
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention(model, tokenizer, text):
    """
    Visualize attention patterns for a given text.

    1. Tokenize the input
    2. Run through model with output_attentions=True
    3. Extract attention weights from a specific layer/head
    4. Create heatmap visualization
    """
    # YOUR CODE HERE
    pass

# Test with sentences that have interesting attention patterns
sentences = [
    "The cat sat on the mat because it was tired.",
    "The trophy didn't fit in the suitcase because it was too big.",
    "The python ate the mouse quickly and then slept.",
]

# Visualize attention for pronouns resolving to their antecedents

Deliverable: Heatmaps showing attention patterns with clear pronoun resolution.

Exercise 3: Build a Causal Language Model

Implement a minimal GPT-style model:

class MiniGPT(nn.Module):
    """
    Minimal GPT-style decoder-only transformer.

    Components needed:
    1. Token embedding
    2. Positional encoding
    3. Causal masked multi-head attention
    4. Feed-forward network
    5. Output projection to vocabulary
    """
    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_len):
        super().__init__()
        # YOUR CODE HERE
        pass

    def forward(self, x):
        # Remember to apply causal mask!
        pass

    def generate(self, prompt_tokens, max_new_tokens=50):
        """Autoregressive generation."""
        # YOUR CODE HERE
        pass

# Train on a small corpus (Shakespeare, code, etc.)
# Verify it can generate coherent text

Success Criteria: Model generates somewhat coherent text after training.

Exercise 4: Compare Attention Efficiency

Benchmark standard attention vs optimized versions:

import time
import torch

def benchmark_attention(seq_lengths, d_model=512, num_heads=8, num_trials=10):
    """
    Measure attention time and memory for different sequence lengths.

    1. Standard attention
    2. torch.nn.functional.scaled_dot_product_attention (if available)
    3. Flash attention (if installed)

    Return timing and memory data for plotting.
    """
    results = []
    for seq_len in seq_lengths:
        # YOUR CODE HERE
        pass
    return results

# Test with seq_lengths = [512, 1024, 2048, 4096, 8192]
# Plot the O(n²) growth
# Compare with optimized implementations

Deliverable: Graph showing quadratic growth of attention and efficiency of optimizations.

Did You Know? The Bitter Lesson Confirmed

In 2019, Richard Sutton wrote “The Bitter Lesson,” arguing that general methods leveraging computation (like search and learning) ultimately beat approaches encoding human knowledge.

Transformers proved him right spectacularly:

They don’t encode linguistic rules—they learn them
They don’t have hand-crafted features—just attention
They scale predictably with compute

The “bitter” part? Decades of NLP research on parsing, syntax trees, and linguistic features became largely obsolete overnight. The winning strategy was: simple architecture + massive scale.

“70 years of AI research, and the answer turns out to be: matrix multiplication and gradient descent, but lots of it.” — Anonymous ML researcher on Twitter

Community and Resources

Key People to Follow

Original Transformer Authors:

Ashish Vaswani (@ashaborali) - Co-founder of Essential AI
Noam Shazeer - Co-founder of Character.AI, returned to Google
Jakob Uszkoreit - Co-founder of Inceptive (RNA design with transformers)

Modern Practitioners:

Andrej Karpathy (@karpathy) - Former Tesla AI Director, incredible educational content
Jay Alammar - Author of “The Illustrated Transformer”
Tri Dao - Flash Attention creator, now at Together AI
Sebastian Raschka (@rasbt) - Excellent books and papers on LLMs

Active Research Areas (2024-2025)

Efficiency:

Mixture of Experts (MoE) - Use only a fraction of parameters per forward pass
State Space Models (Mamba) - Linear complexity alternative to attention
Speculative Decoding - Faster inference with draft models

Scale:

Constitutional AI - Training models to follow principles
RLHF and DPO - Aligning models with human preferences
Multimodal - Single models for text, image, audio, video

Understanding:

Mechanistic Interpretability - Understanding what transformers actually learn internally
Emergent Abilities - Capabilities that appear suddenly at scale
In-Context Learning - How transformers learn from examples in the prompt

Summary

You’ve learned:

Why RNNs failed: Sequential processing, vanishing gradients, long-range difficulties
Self-attention: Q, K, V projections → scaled dot product → softmax → weighted sum
Multi-head attention: Multiple attention patterns in parallel
Positional encoding: Adding position information to an orderless system
Transformer architecture: Attention + FFN + residuals + layer norm
Encoder vs decoder: Bidirectional vs causal attention
The O(n²) problem: Why long sequences are expensive
Vision Transformers: Patches as tokens

The transformer architecture is the foundation of modern AI. GPT, BERT, Claude, Gemini, Stable Diffusion, DALL-E — they all build on the ideas in this module.

The Heureka Moment (Revisited)

Attention is all you need — and now you understand why.

The insight is elegant: instead of processing sequences step by step, let every position look at every other position simultaneously. Let the model learn what to pay attention to through training.

This simple change:

Enabled massive parallelization (faster training)
Eliminated the information bottleneck of sequential processing
Made scaling predictable and efficient
Created a universal architecture for text, images, audio, video, and more

When you use ChatGPT, Claude, or any modern AI system, you’re using transformers. You now understand the core mechanism that makes them work.

Next Steps

Move on to Module 31: Backpropagation Deep Dive where you’ll learn:

How gradients actually flow through networks
The chain rule in detail
Implementing autograd from scratch
Debugging gradient issues

This completes the “how neural networks learn” trilogy: forward pass (Module 26), training techniques (Module 28), and now the math of learning itself.

Last updated: 2025-12-11 Status: Complete

Backpropagation Deep Dive

Or: The Architecture That Ate AI (And Why “Attention Is All You Need”)

The Paper That Changed Everything

What You’ll Be Able to Do

The Heureka Moment: What You’re About to Discover

The Problem: Why RNNs Hit a Wall

How RNNs Work

The Three Fatal Flaws

The LSTM Partial Solution

The Revolution: “Attention Is All You Need”

Self-Attention: The Core Innovation

The Intuition: A Different Kind of Search

The Analogy: Finding Relevant Context

The Math: Query, Key, Value

The Formula (Complete)

Worked Example with Numbers

PyTorch Implementation

Visualizing Attention

Multi-Head Attention: Multiple Perspectives

The Architecture

Implementation Details

Positional Encoding: Teaching Order to an Orderless System

The Sinusoidal Encoding

Worked Example: Computing Positional Encodings

Learned vs. Fixed Positional Encodings

The Full Transformer Block

The Feed-Forward Network

The Complete Encoder Block

The Full Transformer Encoder

Encoder vs. Decoder: Two Flavors of Transformer

Encoder-Only (BERT style)

Decoder-Only (GPT style)

Causal Masking

Encoder-Decoder (Original Transformer)

The Complexity Trade-off: O(n²)

Why This Matters

Efficient Attention Variants

Vision Transformers: Attention for Images

The Key Idea: Patches as Tokens

ViT Architecture

Why Attention Changed Everything

1. Parallelization

2. Long-Range Dependencies

3. Scalability

4. Transfer Learning

5. Interpretability (Sort of)

Common Pitfalls

1. Forgetting the Scale Factor

2. Wrong Mask Shape

3. Forgetting Positional Encoding

4. Not Handling Padding Properly

5. Memory Explosion with Long Sequences

Quiz: Test Your Understanding

Economics of Transformers

Computational Cost Reality

The Context Window Economics

ROI of Understanding Transformers

Interview Preparation: Transformers

Common Interview Questions

System Design Question

️ Hands-On Exercises

Exercise 1: Implement Attention from Scratch

Exercise 2: Visualize Attention Patterns

Exercise 3: Build a Causal Language Model

Exercise 4: Compare Attention Efficiency

Did You Know? The Bitter Lesson Confirmed

Community and Resources

Key People to Follow

Active Research Areas (2024-2025)

Further Reading

Essential Papers

Online Resources

Summary

The Heureka Moment (Revisited)

Next Steps