Training Neural Networks
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 6-8
Or: The Framework That Made Deep Learning Accessible
Section titled “Or: The Framework That Made Deep Learning Accessible”Reading Time: 6-7 hours Prerequisites: Module 26
When researcher Soumith Chintala discovered in September 2016 that debugging TensorFlow was like trying to repair a car engine while it was running, he realized something had to change. After watching brilliant AI engineers waste days fighting incomprehensible error messages, he and colleague Adam Paszke found a better way. That night, they started building PyTorch. Within five years, their creation would power everything from gpt-5 to Stable Diffusion, fundamentally changing how the world builds AI.
Did You Know? The Researcher’s Rebellion
Section titled “Did You Know? The Researcher’s Rebellion”Menlo Park. September 2016. 11:47 PM.
Soumith Chintala was done. For months, he had watched brilliant AI researchers at Facebook waste hours—sometimes days—fighting TensorFlow’s static graphs. Print statements didn’t work. Debuggers were useless. One typo meant recompiling everything.
“This is insane,” he muttered to his colleague Adam Paszke. “We’re supposed to be doing AI research, not fighting our tools.”
They decided to build something better. Not incrementally better—fundamentally different. A framework where Python code was just… Python code. Where you could debug neural networks like any other program. Where ideas could be tested in minutes, not days.
They called it PyTorch. Within three years, it would conquer academic AI. Within five, it would power everything from gpt-5 to Stable Diffusion.
“The best framework is the one that gets out of your way. TensorFlow made you think about graphs. PyTorch just let you think about math.” — Soumith Chintala, PyTorch creator (2020 interview)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand PyTorch tensors and their relationship to NumPy arrays
- Master automatic differentiation with autograd
- Build neural networks using
nn.Module - Implement training loops with optimizers and loss functions
- Move computations between CPU and GPU
- Appreciate the elegance of PyTorch compared to manual implementations
Introduction: From Pain to Power
Section titled “Introduction: From Pain to Power”In Module 26, you built neural networks from scratch. You computed gradients by hand using the chain rule. You tracked intermediate values in caches. You debugged NaN explosions at 2am.
It was educational. It was also painful.
That pain was the point. You now understand what happens under the hood. But here’s the truth: nobody builds production neural networks from scratch. Think of it like learning to build a car engine from raw metal before being allowed to drive. Valuable knowledge, but not how you’d get to work every day.
PyTorch is the power tool that makes deep learning practical.
Think of it this way: In Module 26, you learned to chop down a tree with a hand axe. Now you get a chainsaw. The chainsaw doesn’t make the hand axe knowledge useless—understanding how to fell a tree helps you use the chainsaw safely and effectively.
Did You Know? The Birth of PyTorch
Section titled “Did You Know? The Birth of PyTorch”The Framework Wars
Section titled “The Framework Wars”In 2015, Google released TensorFlow. It was powerful, backed by Google’s resources, and quickly became the dominant deep learning framework. But researchers had a problem: TensorFlow was painful to use.
TensorFlow 1.x used something called static graphs. You had to:
- Define your entire computation as a graph
- Compile the graph
- Create a “session” to run it
- Feed data through placeholders
It was like writing a recipe, translating it to assembly language, compiling it, and only then cooking - for every single meal.
# TensorFlow 1.x - The pain was realimport tensorflow as tf
# Step 1: Define placeholders (not real data yet!)x = tf.placeholder(tf.float32, shape=[None, 784])y = tf.placeholder(tf.float32, shape=[None, 10])
# Step 2: Build the graph (nothing runs!)W = tf.Variable(tf.zeros([784, 10]))logits = tf.matmul(x, W)
# Step 3: Create a sessionwith tf.Session() as sess: sess.run(tf.global_variables_initializer()) # Initialize result = sess.run(logits, feed_dict={x: data}) # Finally run!Debugging was a nightmare. You couldn’t just print a variable - you had to evaluate it in a session. Error messages pointed to the graph construction, not where the actual problem was.
The Facebook Answer
Section titled “The Facebook Answer”Enter Soumith Chintala at Facebook AI Research. He and his team created PyTorch in 2016 with a radical philosophy: define-by-run.
Instead of building a graph and then running it, PyTorch builds the graph as you run Python code. You write normal Python. You can use print statements. You can use Python debuggers. If statements, for loops - they all just work.
# PyTorch - The relief was immediateimport torch
x = torch.randn(32, 784) # This creates actual data!W = torch.randn(784, 10, requires_grad=True)
logits = x @ W # This actually computes the result!print(logits.shape) # You can just print it!The Research Takeover
Section titled “The Research Takeover”By 2019, PyTorch had conquered academia:
- NeurIPS 2019: 75% of papers used PyTorch
- ICLR 2020: 80% PyTorch
- CVPR 2020: 70% PyTorch
Why? Researchers need to iterate fast. They try crazy ideas. Many don’t work. Static graphs meant recompiling for every experiment. Dynamic graphs meant instant feedback.
TensorFlow noticed. TensorFlow 2.0 (2019) adopted eager execution by default - essentially admitting PyTorch got it right.
The Name
Section titled “The Name”Why “PyTorch”? It’s the Python version of Torch, a scientific computing framework written in Lua that was popular in academia during the early 2010s. The original Torch was named after the Olympic torch - a symbol of passing knowledge forward.
The PyTorch logo is a stylized flame - representing both the torch and the “fire” of GPU-accelerated computing.
Part 1: Tensors - The Universal Container
Section titled “Part 1: Tensors - The Universal Container”What is a Tensor, Really?
Section titled “What is a Tensor, Really?”You’ve heard “tensor” thrown around. Let’s demystify it.
A tensor is just a multi-dimensional array. That’s it. No magic. But this simple concept is the foundation of everything in deep learning.
Think of tensors like containers of different dimensions:
| Dimensions | Math Name | Real Example | Shape |
|---|---|---|---|
| 0 | Scalar | The temperature right now: 72.5°F | [] |
| 1 | Vector | Today’s hourly temperatures: [68, 70, 72, 75, 73] | [5] |
| 2 | Matrix | A grayscale image with pixel values | [28, 28] |
| 3 | 3D Tensor | A color image (RGB channels × height × width) | [3, 224, 224] |
| 4 | 4D Tensor | A batch of color images | [32, 3, 224, 224] |
| 5 | 5D Tensor | A batch of video clips (batch × frames × channels × H × W) | [8, 16, 3, 224, 224] |
The key insight: Neural networks don’t care about what data means. They just see tensors of numbers. An image, a sentence, a stock price history - all become tensors.
Why This Module Matters
Section titled “Why This Module Matters”You might wonder: we already have NumPy. Why learn another array type?
Three killer features:
1. Automatic Differentiation
NumPy can do math on arrays. PyTorch can do math on arrays and track how to compute gradients. This is the magic that makes deep learning practical.
# NumPy: Just computationimport numpy as npx = np.array([2.0, 3.0])y = x ** 2 # [4, 9]# Now compute dy/dx manually? Good luck!
# PyTorch: Computation + gradient trackingimport torchx = torch.tensor([2.0, 3.0], requires_grad=True)y = (x ** 2).sum() # 13y.backward() # Compute gradients automaticallyprint(x.grad) # tensor([4., 6.]) - that's dy/dx = 2x!2. GPU Acceleration
Moving computation to a GPU in NumPy requires different libraries and painful code changes. In PyTorch, it’s one line:
# CPU tensorx = torch.randn(1000, 1000)
# GPU tensor - one line!x_gpu = x.cuda() # or x.to('cuda')3. Deep Learning Ecosystem
PyTorch tensors integrate seamlessly with neural network layers, optimizers, data loaders, and the entire deep learning workflow.
Creating Tensors
Section titled “Creating Tensors”Let’s get hands-on. There are many ways to create tensors:
From Python Data
The most direct way - convert Python lists:
import torch
# From a simple listx = torch.tensor([1, 2, 3, 4, 5])print(x) # tensor([1, 2, 3, 4, 5])
# From nested lists (creates a matrix)matrix = torch.tensor([[1, 2, 3], [4, 5, 6]])print(matrix.shape) # torch.Size([2, 3])Common Initializations
In practice, you rarely type out values. You create tensors filled with specific patterns:
# Zeros and ones - common for initializationzeros = torch.zeros(3, 4) # 3×4 matrix of zerosones = torch.ones(2, 3, 4) # 2×3×4 tensor of ones
# Random values - essential for weight initializationuniform = torch.rand(5, 5) # Uniform between [0, 1)normal = torch.randn(5, 5) # Normal distribution (mean=0, std=1)
# Sequences - useful for indices and positionssequence = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]linspace = torch.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Identity matrix - useful in linear algebraidentity = torch.eye(4) # 4×4 identity matrixCopying Shape from Another Tensor
Often you need a tensor the same shape as another:
x = torch.randn(3, 4, 5)
# Create zeros/ones with the same shape, dtype, and devicezeros_like_x = torch.zeros_like(x)ones_like_x = torch.ones_like(x)random_like_x = torch.randn_like(x)This is especially useful when you need to create tensors on the same device (CPU or GPU) as your model.
Tensor Properties
Section titled “Tensor Properties”Every tensor has properties you’ll check constantly:
t = torch.randn(3, 4, 5)
# Shape: The dimensions of the tensorprint(t.shape) # torch.Size([3, 4, 5])print(t.size()) # Same thing, method form
# Data type: What kind of numbersprint(t.dtype) # torch.float32 (default for randn)
# Device: Where the tensor livesprint(t.device) # cpu (or cuda:0, cuda:1, etc.)
# Number of dimensionsprint(t.ndim) # 3
# Total number of elementsprint(t.numel()) # 60 (3 × 4 × 5)Did You Know? Data Types Matter More Than You Think
Section titled “Did You Know? Data Types Matter More Than You Think”PyTorch supports many data types, and choosing the right one affects both correctness and performance.
For Neural Networks (most common):
torch.float32(ortorch.float) - The default. Good balance of precision and speed.torch.float16(ortorch.half) - Half precision. 2× faster on modern GPUs, but less precise.torch.bfloat16- “Brain float”. Better for training than float16 because it has more exponent bits.
For Indices and Counts:
torch.int64(ortorch.long) - Required for indices in PyTorch. Most common integer type.torch.int32- When you know values fit and want to save memory.
For Images:
torch.uint8- Unsigned 8-bit integers (0-255). Raw image format.
# Creating tensors with specific typesweights = torch.randn(100, 100, dtype=torch.float32)indices = torch.tensor([0, 5, 3, 7], dtype=torch.long)image = torch.randint(0, 256, (3, 224, 224), dtype=torch.uint8)
# Converting between typesweights_half = weights.half() # to float16weights_back = weights_half.float() # back to float32The fp16 Training Revolution
In 2017, researchers discovered you could train neural networks in half precision (float16) with almost no accuracy loss - but 2× faster and using half the memory. This “mixed precision training” is now standard for large models.
# Modern training uses automatic mixed precisionwith torch.cuda.amp.autocast(): output = model(input) # Automatically uses fp16 where safeThe NumPy Bridge
Section titled “The NumPy Bridge”PyTorch and NumPy are best friends. They can share memory, making conversion instant:
import numpy as np
# NumPy → PyTorch (shared memory!)numpy_array = np.array([1, 2, 3, 4, 5])tensor = torch.from_numpy(numpy_array)
# They share memory - changes propagate!numpy_array[0] = 100print(tensor) # tensor([100, 2, 3, 4, 5]) - changed too!
# If you want a copy instead:tensor_copy = torch.tensor(numpy_array) # Independent copy
# PyTorch → NumPytensor = torch.randn(3, 4)numpy_array = tensor.numpy() # Shared memory (if on CPU)
# Safe conversion (handles GPU tensors too)numpy_array = tensor.detach().cpu().numpy()Warning: The shared memory behavior is a feature, not a bug. It’s fast and memory-efficient. But it can surprise you if you modify one and expect the other unchanged!
Part 2: Autograd - The Magic Behind Deep Learning
Section titled “Part 2: Autograd - The Magic Behind Deep Learning”This is where PyTorch becomes truly magical. Remember computing gradients by hand in Module 26? All those partial derivatives, the chain rule applied recursively, the careful tracking of intermediate values?
PyTorch does all of that automatically.
The Computational Graph
Section titled “The Computational Graph”When you create a tensor with requires_grad=True, PyTorch starts recording every operation. It builds an invisible “computational graph” that tracks how to compute gradients.
Let’s see it in action:
# Create a tensor that tracks gradientsx = torch.tensor([2.0, 3.0], requires_grad=True)
# Perform operations - PyTorch records them!y = x ** 2 # y = [4, 9]z = y.sum() # z = 13
# Compute gradientsz.backward()
# Gradients are stored in .gradprint(x.grad) # tensor([4., 6.])What happened? Let’s trace through:
z = x[0]² + x[1]²∂z/∂x[0] = 2·x[0] = 2·2 = 4∂z/∂x[1] = 2·x[1] = 2·3 = 6
PyTorch computed exactly the gradients we’d compute by hand - but automatically!
Why This Matters
Section titled “Why This Matters”In Module 26, you implemented backpropagation manually. For a simple network, it was manageable. But modern networks have:
- Millions of parameters
- Hundreds of layers
- Complex architectures (skip connections, attention, normalization)
Computing gradients by hand for gpt-5? That would be tens of thousands of lines of gradient code. With PyTorch:
loss.backward() # That's it. Gradients for every parameter.The Chain Rule in Action
Section titled “The Chain Rule in Action”The real power shows with complex computations:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Complex computation with multiple stepsy = x * 2 # y = [2, 4, 6]z = y ** 2 # z = [4, 16, 36]loss = z.mean() # loss = 56/3 ≈ 18.67
# One backward call - all gradients computed!loss.backward()
print(x.grad) # tensor([2.6667, 5.3333, 8.0000])Let’s verify: The chain rule says:
∂loss/∂x = ∂loss/∂z · ∂z/∂y · ∂y/∂x∂loss/∂z = 1/3(derivative of mean)∂z/∂y = 2y = [4, 8, 12]∂y/∂x = 2- Result:
[4·2/3, 8·2/3, 12·2/3] = [8/3, 16/3, 24/3]
Did You Know? The Secret History of Automatic Differentiation
Section titled “Did You Know? The Secret History of Automatic Differentiation”Automatic differentiation isn’t a deep learning invention. It was developed in the 1960s and 1970s for computational physics and engineering!
Reverse-mode automatic differentiation (what PyTorch uses) was published by Seppo Linnainmaa in 1970 for his master’s thesis. Backpropagation in neural networks was rediscovered independently in the 1980s - it’s the same algorithm applied to neural network computation graphs.
The key insight: forward-mode AD is efficient when you have few inputs and many outputs. Reverse-mode is efficient when you have many inputs and few outputs. Neural networks have millions of inputs (weights) and one output (loss) - reverse mode wins!
Critical Gotcha: Gradients Accumulate!
Section titled “Critical Gotcha: Gradients Accumulate!”This trips up every PyTorch beginner:
x = torch.tensor([1.0], requires_grad=True)
y = x * 2y.backward()print(x.grad) # tensor([2.])
y = x * 3y.backward()print(x.grad) # tensor([5.]) - Not 3! It's 2 + 3!By default, calling .backward() adds to existing gradients. This is actually useful for some advanced techniques (like gradient accumulation), but usually you want fresh gradients each time.
The fix: Zero gradients before each backward pass:
x.grad.zero_() # Zero out accumulated gradientsy = x * 4y.backward()print(x.grad) # tensor([4.]) - Fresh gradientIn training loops, you’ll see optimizer.zero_grad() - this zeros all parameter gradients.
When to Detach from the Graph
Section titled “When to Detach from the Graph”Sometimes you want to use a computed value without tracking gradients:
x = torch.tensor([2.0], requires_grad=True)y = x ** 2
# Detach: Creates a new tensor, no gradient trackingz = y.detach()print(z.requires_grad) # False
# Or use no_grad contextwith torch.no_grad(): z = y * 2 print(z.requires_grad) # FalseWhen to use this:
- Computing metrics (accuracy, etc.) during training
- Using a frozen pretrained model
- Preventing gradients from flowing to certain parts of the network
Part 3: Building Neural Networks with nn.Module
Section titled “Part 3: Building Neural Networks with nn.Module”Now we get to build actual neural networks! PyTorch provides torch.nn, a module specifically designed for deep learning.
The nn.Module Class
Section titled “The nn.Module Class”Every neural network in PyTorch inherits from nn.Module. This base class provides:
- Automatic parameter registration
- Easy device movement (CPU ↔ GPU)
- Training/evaluation mode switching
- Model saving and loading
Here’s the pattern you’ll use hundreds of times:
import torch.nn as nn
class SimpleNetwork(nn.Module): def __init__(self, input_size, hidden_size, output_size): super().__init__() # Always call this first!
# Define layers (registered automatically!) self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size) self.relu = nn.ReLU()
def forward(self, x): # Define how data flows through the network x = self.fc1(x) x = self.relu(x) x = self.fc2(x) return x
# Create the networkmodel = SimpleNetwork(784, 128, 10)
# Forward pass - just call the model!x = torch.randn(32, 784) # Batch of 32 imagesoutput = model(x) # Calls forward() automaticallyprint(output.shape) # torch.Size([32, 10])Compare this to Module 26 where you manually created weight matrices, implemented forward propagation, and tracked everything yourself. The PyTorch version is almost self-documenting!
Did You Know? Why super().init()?
Section titled “Did You Know? Why super().init()?”That super().__init__() call isn’t just Python formality. It initializes PyTorch’s internal machinery that:
- Creates a registry for parameters
- Enables recursive calls like
.to(device)on all submodules - Sets up hooks for saving/loading
Forget it, and your model silently breaks in confusing ways. Every PyTorch tutorial includes it, and now you know why!
Common Layers Explained
Section titled “Common Layers Explained”PyTorch provides layers for every architecture. Here are the ones you’ll use most:
Linear (Fully Connected) Layers
These are the basic building blocks - matrix multiplication plus bias:
# Linear: y = xW^T + blayer = nn.Linear(in_features=784, out_features=128)
# What it creates internally:# - weight: [128, 784] matrix# - bias: [128] vector (optional, bias=True by default)Activation Functions
Activations introduce non-linearity (without them, a deep network is just one linear transformation):
# As modules (use in __init__)nn.ReLU() # max(0, x) - most commonnn.LeakyReLU(0.01) # Allows small negative valuesnn.GELU() # Used in transformers (smoother than ReLU)nn.Sigmoid() # Squashes to [0, 1]nn.Tanh() # Squashes to [-1, 1]
# As functions (use in forward)import torch.nn.functional as Foutput = F.relu(x)output = F.gelu(x)Normalization Layers
These stabilize training by normalizing intermediate values:
nn.BatchNorm1d(num_features) # Normalize across batchnn.LayerNorm(normalized_shape) # Normalize across features (transformers)Dropout
Randomly zeroes elements during training to prevent overfitting:
nn.Dropout(p=0.5) # 50% of elements set to zero during trainingSequential: The Quick Way
Section titled “Sequential: The Quick Way”For simple architectures, you don’t need a custom class:
model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, 10))
# Works exactly like a custom nn.Moduleoutput = model(input)Use Sequential for prototypes and simple models. Use custom classes when you need complex control flow (if statements, loops, skip connections).
Inspecting Your Model
Section titled “Inspecting Your Model”PyTorch makes it easy to see what’s inside:
model = SimpleNetwork(784, 128, 10)
# Print model structureprint(model)# SimpleNetwork(# (fc1): Linear(in_features=784, out_features=128, bias=True)# (fc2): Linear(in_features=128, out_features=10, bias=True)# (relu): ReLU()# )
# List all parameters with namesfor name, param in model.named_parameters(): print(f"{name}: {param.shape}")# fc1.weight: torch.Size([128, 784])# fc1.bias: torch.Size([128])# fc2.weight: torch.Size([10, 128])# fc2.bias: torch.Size([10])
# Count total parameterstotal = sum(p.numel() for p in model.parameters())print(f"Total parameters: {total:,}") # 101,770Part 4: Training Neural Networks
Section titled “Part 4: Training Neural Networks”Now for the payoff - actually training a network!
Loss Functions: Measuring Wrongness
Section titled “Loss Functions: Measuring Wrongness”A loss function measures how wrong your predictions are. Lower is better.
For Classification (choosing between categories):
# Cross-entropy loss - the workhorse of classificationcriterion = nn.CrossEntropyLoss()
# It expects:# - Input: Raw logits (NOT softmaxed!) Shape: [batch, num_classes]# - Target: Class indices (NOT one-hot!) Shape: [batch]
logits = torch.randn(32, 10) # Raw network outputlabels = torch.randint(0, 10, (32,)) # Class labels 0-9loss = criterion(logits, labels)Critical: CrossEntropyLoss applies softmax internally. Don’t softmax your outputs first - you’ll get wrong gradients and worse training!
For Regression (predicting numbers):
criterion = nn.MSELoss() # Mean squared errorcriterion = nn.L1Loss() # Mean absolute errorOptimizers: Updating Weights
Section titled “Optimizers: Updating Weights”Optimizers implement gradient descent algorithms. They take gradients and update parameters:
import torch.optim as optim
# SGD: Simple, but needs tuningoptimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Adam: Usually works well out of the boxoptimizer = optim.Adam(model.parameters(), lr=0.001)
# AdamW: Adam with proper weight decay (recommended for transformers)optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)Did You Know? The Adam Story
Section titled “Did You Know? The Adam Story”Adam (2014) combined ideas from two earlier optimizers:
- Momentum: Use exponentially weighted average of past gradients
- RMSprop: Adapt learning rate per-parameter based on gradient history
The name “Adam” comes from “adaptive moment estimation”. Within a year of publication, it became the default optimizer for most deep learning - it just works in most situations without careful tuning.
But it’s not perfect! Researchers later found that Adam’s weight decay implementation was subtly wrong. AdamW (2017) fixed this, and it’s now preferred for large models.
The Training Loop
Section titled “The Training Loop”Here’s the standard PyTorch training pattern you’ll use forever:
model = SimpleNetwork(784, 128, 10)criterion = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs): model.train() # Enable training mode (dropout, batchnorm behave differently)
for batch_idx, (data, labels) in enumerate(train_loader): # 1. Zero gradients from previous batch optimizer.zero_grad()
# 2. Forward pass outputs = model(data)
# 3. Compute loss loss = criterion(outputs, labels)
# 4. Backward pass (compute gradients) loss.backward()
# 5. Update weights optimizer.step()
if batch_idx % 100 == 0: print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")The five steps are always the same:
- Zero gradients
- Forward pass
- Compute loss
- Backward pass
- Update weights
This pattern works whether you’re training a 2-layer network on MNIST or a billion-parameter language model.
Evaluation Mode
Section titled “Evaluation Mode”When evaluating (not training), you need to:
- Switch to evaluation mode (changes dropout/batchnorm behavior)
- Disable gradient computation (faster, uses less memory)
model.eval() # Evaluation mode
correct = 0total = 0
with torch.no_grad(): # Don't compute gradients for data, labels in test_loader: outputs = model(data) _, predicted = outputs.max(1) # Get predicted class total += labels.size(0) correct += (predicted == labels).sum().item()
accuracy = 100 * correct / totalprint(f"Accuracy: {accuracy:.2f}%")Part 5: GPU Computing
Section titled “Part 5: GPU Computing”GPUs can make training 10-100× faster. PyTorch makes GPU computing almost trivially easy.
Moving to GPU
Section titled “Moving to GPU”# Check if CUDA (GPU) is availableif torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)}") print(f"GPU Count: {torch.cuda.device_count()}")else: print("No GPU available, using CPU")
# The standard pattern: create a device variabledevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move tensorsx = torch.randn(1000, 1000)x_gpu = x.to(device)
# Move models (moves all parameters)model = SimpleNetwork(784, 128, 10).to(device)
# Create tensors directly on GPUy = torch.randn(1000, 1000, device=device)GPU Training Loop
Section titled “GPU Training Loop”The only change from CPU training: move data to the GPU each batch.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNetwork(784, 128, 10).to(device)criterion = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(num_epochs): for data, labels in train_loader: # Move data to GPU data = data.to(device) labels = labels.to(device)
optimizer.zero_grad() outputs = model(data) loss = criterion(outputs, labels) loss.backward() optimizer.step()Did You Know? GPU Memory Gotchas
Section titled “Did You Know? GPU Memory Gotchas”GPU memory is precious and limited. Common mistakes:
Memory Leak #1: Storing Tensors in Python Lists
# BAD - keeps entire computation graph!losses = []for batch in loader: loss = criterion(model(batch), labels) losses.append(loss) # Full tensor with gradient graph!
# GOOD - just keep the numberlosses = []for batch in loader: loss = criterion(model(batch), labels) losses.append(loss.item()) # Just the Python floatMemory Leak #2: Not Using no_grad() During Evaluation
# BAD - builds computation graph unnecessarilyaccuracy = (model(data).argmax(1) == labels).float().mean()
# GOOD - no gradient tracking neededwith torch.no_grad(): accuracy = (model(data).argmax(1) == labels).float().mean()Checking Memory:
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")print(f"Reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")Part 6: Data Loading
Section titled “Part 6: Data Loading”PyTorch’s DataLoader handles batching, shuffling, and parallel loading.
Basic Usage
Section titled “Basic Usage”from torch.utils.data import DataLoader, TensorDataset
# Create a dataset from tensorsX = torch.randn(1000, 784)y = torch.randint(0, 10, (1000,))dataset = TensorDataset(X, y)
# Create a data loaderloader = DataLoader( dataset, batch_size=32, # Samples per batch shuffle=True, # Shuffle each epoch (for training) num_workers=4, # Parallel data loading processes pin_memory=True # Faster GPU transfer)
# Iteratefor batch_x, batch_y in loader: print(batch_x.shape, batch_y.shape) # [32, 784], [32]Built-in Datasets
Section titled “Built-in Datasets”PyTorch provides standard datasets through torchvision:
from torchvision import datasets, transforms
# Define preprocessingtransform = transforms.Compose([ transforms.ToTensor(), # PIL Image → Tensor transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean/std])
# Load MNISTtrain_data = datasets.MNIST('./data', train=True, download=True, transform=transform)test_data = datasets.MNIST('./data', train=False, download=True, transform=transform)
# Create loaderstrain_loader = DataLoader(train_data, batch_size=64, shuffle=True)test_loader = DataLoader(test_data, batch_size=1000)Part 7: Saving and Loading Models
Section titled “Part 7: Saving and Loading Models”Always save your trained models!
Recommended: Save State Dict
Section titled “Recommended: Save State Dict”# Save just the weights (recommended)torch.save(model.state_dict(), 'model_weights.pth')
# Loadmodel = SimpleNetwork(784, 128, 10) # Create architecture firstmodel.load_state_dict(torch.load('model_weights.pth'))model.eval() # Set to evaluation modeCheckpointing for Training
Section titled “Checkpointing for Training”For long training runs, save checkpoints to resume later:
# Save checkpointcheckpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss,}torch.save(checkpoint, 'checkpoint.pth')
# Load checkpointcheckpoint = torch.load('checkpoint.pth')model.load_state_dict(checkpoint['model_state_dict'])optimizer.load_state_dict(checkpoint['optimizer_state_dict'])start_epoch = checkpoint['epoch']Did You Know? The .pth Security Risk
Section titled “Did You Know? The .pth Security Risk”PyTorch model files use Python’s pickle format. Pickle can execute arbitrary code when loading!
# This could run malicious code!model = torch.load('untrusted_model.pth') # Dangerous!Only load models from sources you trust. For sharing models publicly, consider the new safetensors format that can’t execute code.
Part 8: PyTorch vs From-Scratch Comparison
Section titled “Part 8: PyTorch vs From-Scratch Comparison”Let’s appreciate how far we’ve come. Here’s Module 26 versus PyTorch:
Module 26: Manual Backpropagation
Section titled “Module 26: Manual Backpropagation”# Forward pass - tracking everything manuallydef forward(self, X): self.cache = {'A0': X} A = X for l in range(1, len(self.layer_dims)): Z = self.params[f'W{l}'] @ A + self.params[f'b{l}'] A = self.activation_fn(Z) self.cache[f'A{l}'] = A return A
# Backward pass - chain rule by handdef backward(self, Y): m = Y.shape[1] dA = -(Y / self.cache[f'A{L}'])
for l in reversed(range(1, L + 1)): dZ = dA * self.activation_derivative(self.cache[f'A{l}']) self.grads[f'dW{l}'] = (1/m) * dZ @ self.cache[f'A{l-1}'].T self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True) dA = self.params[f'W{l}'].T @ dZPyTorch: Elegance
Section titled “PyTorch: Elegance”class Network(nn.Module): def __init__(self, input_size, hidden_size, output_size): super().__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x): x = F.relu(self.fc1(x)) return self.fc2(x)
# Training - all the complexity hiddenoutputs = model(inputs)loss = criterion(outputs, labels)loss.backward() # All gradients computed!optimizer.step() # All weights updated!What PyTorch automates:
- Gradient computation for any architecture
- Cache management
- Numerical stability
- GPU support
- Optimizers
- Data loading
- Model serialization
Did You Know? The Future: torch.compile()
Section titled “Did You Know? The Future: torch.compile()”PyTorch 2.0 (2022) introduced something remarkable: torch.compile().
model = MyModel()model = torch.compile(model) # That's it!One line turns your dynamic, debuggable PyTorch model into an optimized, compiled version that runs 30-200% faster. The dynamic graph philosophy remains - you can still use Python control flow, print statements, debuggers - but get static-graph performance.
This represents PyTorch’s philosophy: make the right thing easy. Build your model the simple way, debug it, make sure it works. Then compile for production.
Economics of PyTorch
Section titled “Economics of PyTorch”Total Cost of Development
Section titled “Total Cost of Development”PyTorch doesn’t cost money to use, but the ecosystem has significant economic implications:
Development Time Comparison:
| Task | Manual NumPy | PyTorch | Savings |
|---|---|---|---|
| Simple MLP | 1 day | 1 hour | 87% |
| CNN for images | 3 days | 4 hours | 83% |
| LSTM/Transformer | 1 week | 1 day | 86% |
| Training loop | 2 hours | 15 min | 88% |
| GPU support | 1-2 days | 5 min | 99% |
At $150/hour senior engineer rate:
- Manual implementation: $2,400 for CNN
- PyTorch implementation: $600 for CNN
- Savings: $1,800 per model
The GPU Cost Reality
Section titled “The GPU Cost Reality”Training neural networks requires GPUs. The economics are stark:
| GPU | Purchase Cost | Cloud Cost (AWS) | Memory | Speed |
|---|---|---|---|---|
| RTX 3090 | $1,500 | - | 24GB | 1x |
| A100 40GB | $15,000 | $3.06/hr | 40GB | 3x |
| A100 80GB | $25,000 | $4.10/hr | 80GB | 3.5x |
| H100 | $30,000+ | $5.50/hr | 80GB | 5x |
The crossover point: At ~250 hours of usage, buying a 3090 beats renting cloud GPUs.
Industry Adoption Metrics (2024)
Section titled “Industry Adoption Metrics (2024)”| Framework | GitHub Stars | PyPI Downloads/Month | Job Postings |
|---|---|---|---|
| PyTorch | 85,000+ | 25M+ | 65% |
| TensorFlow | 180,000+ | 15M+ | 30% |
| JAX | 30,000+ | 3M+ | 5% |
The trend: PyTorch dominates research (75%+ of papers) and is rapidly gaining in production. TensorFlow is still strong in production deployments but declining.
ROI of Learning PyTorch
Section titled “ROI of Learning PyTorch”Career impact data (from industry surveys):
- Average salary premium for PyTorch skills: +$15,000/year
- Time to become productive: 2-4 weeks
- ROI: 375% in first year (assuming $15k premium / 4 weeks investment)
Interview Preparation: PyTorch
Section titled “Interview Preparation: PyTorch”Common Interview Questions
Section titled “Common Interview Questions”Q1: “What is automatic differentiation and how does PyTorch implement it?”
Strong Answer: “Automatic differentiation computes gradients by recording operations on tensors and building a computational graph. PyTorch uses reverse-mode autodiff—when you call .backward() on a loss, it traverses the graph backwards applying the chain rule at each node. This is more efficient than numerical differentiation (which requires many forward passes) and less error-prone than symbolic differentiation. In PyTorch, tensors with requires_grad=True track their operations. Each operation creates a grad_fn that knows how to compute its gradient. The graph is dynamic—rebuilt each forward pass—which enables Python control flow like if statements and loops.”
Q2: “Explain the difference between .detach(), .data, and torch.no_grad().”
Strong Answer: “.detach() creates a new tensor that shares storage but doesn’t track gradients—it’s a safe way to stop gradient flow. torch.no_grad() is a context manager that temporarily disables gradient computation for all operations—used during inference for speed and memory savings. .data is legacy and dangerous—it accesses the underlying tensor but can cause silent gradient errors. Modern code should use .detach() for new tensors and torch.no_grad() for inference blocks. In evaluation, always use model.eval() with torch.no_grad().”
Q3: “Why do we call optimizer.zero_grad() before backward()?”
Strong Answer: “PyTorch accumulates gradients by default—calling backward() adds to existing .grad values rather than replacing them. This is useful for gradient accumulation when you want to simulate larger batches than fit in memory. But usually, you want fresh gradients each step, so you zero them first. The typical training loop is: zero_grad → forward → loss → backward → step. Forgetting zero_grad leads to exploding gradients and incorrect updates. Some teams use model.zero_grad() instead, but optimizer.zero_grad() is preferred when using multiple optimizers or gradient accumulation.”
Q4: “How would you debug a neural network that’s not converging?”
Strong Answer: “Systematic debugging approach: First, check the data—visualize inputs, verify labels are correct, ensure proper normalization. Second, check the loss—is it NaN or constant? NaN means gradient explosion (reduce learning rate, add gradient clipping). Constant means gradients aren’t flowing (check activation functions, initialization). Third, overfit on one batch—if you can’t memorize a single batch, the model architecture or training code is broken. Fourth, check gradient flow—print gradient norms per layer. Vanishing gradients suggest ReLU dying or bad initialization. Fifth, try a known-good hyperparameter set before experimenting. The debugging motto: start simple, verify each component, add complexity gradually.”
Q5: “What’s the difference between nn.Module attributes and regular Python attributes?”
Strong Answer: “PyTorch’s nn.Module performs automatic registration. If you assign an nn.Module as an attribute (self.layer = nn.Linear()), it’s registered as a submodule—it appears in .parameters(), moves with .to(device), and saves with state_dict(). Regular Python attributes don’t get this treatment. There’s also nn.Parameter for custom trainable tensors and nn.Buffer for non-trainable state (like batch norm running averages). A common bug: storing layers in a Python list instead of nn.ModuleList—the layers won’t be registered and won’t train. Always use nn.ModuleList or nn.ModuleDict for dynamic layer collections.”
System Design Question
Section titled “System Design Question”Q: “Design a PyTorch training pipeline for a large dataset that doesn’t fit in memory.”
Strong Answer Structure:
-
DataLoader with num_workers: “Use multiple worker processes to load and preprocess data in parallel. Set num_workers=4-8 typically. Enable pin_memory=True for faster GPU transfer.”
-
Memory-mapped datasets: “For huge files, use memory-mapped arrays (np.memmap) or streaming formats (WebDataset, TFDS). Load samples lazily on access.”
-
Gradient accumulation: “For effective batch sizes larger than GPU memory allows, accumulate gradients over N steps before calling optimizer.step().”
-
Mixed precision training: “Use torch.cuda.amp.autocast() for automatic fp16 where safe. Halves memory usage, doubles throughput on modern GPUs.”
-
Checkpointing: “Save regularly. For very long runs, use torch.utils.checkpoint to trade compute for memory—recompute activations during backward.”
-
Distributed training: “For multiple GPUs, use DistributedDataParallel (DDP), not DataParallel. DDP is faster and scales better.”
Did You Know? PyTorch in Production
Section titled “Did You Know? PyTorch in Production”The Production Journey
Section titled “The Production Journey”PyTorch started as a research framework but has matured for production:
Timeline of Production Features:
- 2019: TorchScript for model export
- 2020: TorchServe for serving models
- 2021: Mobile support (iOS, Android)
- 2022: torch.compile() for performance
- 2023: ExecuTorch for edge devices
Who Uses PyTorch in Production?:
- Tesla: Self-driving neural networks
- Meta: Instagram recommendations, content moderation
- Microsoft: Bing search ranking, Azure AI services
- OpenAI: GPT models (pre-training and fine-tuning)
- Stability AI: Stable Diffusion
The ONNX Escape Hatch
Section titled “The ONNX Escape Hatch”Models trained in PyTorch can run anywhere via ONNX (Open Neural Network Exchange):
# Export to ONNXdummy_input = torch.randn(1, 3, 224, 224)torch.onnx.export(model, dummy_input, "model.onnx")
# Run in ONNX Runtime (optimized for production)import onnxruntime as ortsession = ort.InferenceSession("model.onnx")output = session.run(None, {"input": numpy_input})ONNX models can run on:
- C++ applications (no Python dependency)
- Mobile devices (iOS, Android)
- Web browsers (ONNX.js)
- Hardware accelerators (custom chips)
Did You Know? The Million-Dollar Gradient Explosion
Section titled “Did You Know? The Million-Dollar Gradient Explosion”San Francisco. November 2021. 3:47 AM.
The Slack message woke up the entire ML team at a fintech startup. Their PyTorch model—which had been running perfectly for six months—was suddenly producing garbage predictions. Customer trades were being rejected. Losses were mounting.
The senior engineer’s first thought was a data pipeline bug. But the data looked fine. The model architecture hadn’t changed. The weights… wait. The weights were all NaN.
After four frantic hours, they found it: someone had “optimized” the training script by removing optimizer.zero_grad(). In production, they were running periodic retraining, and without zeroing gradients, they accumulated over 10,000 backward passes. The gradients exploded to infinity, then became NaN, and those NaNs propagated to the entire model.
The fix took one line. The outage cost $1.2M in lost trades and customer compensation.
“The most expensive bugs are the ones in code that seems too simple to be wrong.” — Their post-mortem document
The lesson: PyTorch’s gradient accumulation is a feature, not a bug. But forgetting that feature in production can be catastrophic. Always include optimizer.zero_grad() in your training loops, and add assertions that catch NaN values before they propagate.
Common Mistakes and How to Avoid Them
Section titled “Common Mistakes and How to Avoid Them”Mistake #1: Forgetting to Call model.eval()
Section titled “Mistake #1: Forgetting to Call model.eval()”# WRONG - dropout and batchnorm are still in training mode!model.load_state_dict(torch.load('model.pth'))predictions = model(test_data) # Results will be wrong!
# CORRECT - always switch to eval mode for inferencemodel.load_state_dict(torch.load('model.pth'))model.eval() # Critical!with torch.no_grad(): predictions = model(test_data)Why it matters: Dropout randomly zeroes 50% of neurons during training. If you forget .eval(), you’re making predictions with half your model disabled. BatchNorm uses running statistics differently between modes.
Mistake #2: In-Place Operations Breaking Autograd
Section titled “Mistake #2: In-Place Operations Breaking Autograd”# WRONG - in-place operations can break gradient computationx = torch.tensor([1.0, 2.0], requires_grad=True)y = x.relu_() # In-place operation (notice the underscore)z = y.sum()z.backward() # RuntimeError: gradient computation requires non-inplace operations
# CORRECT - use out-of-place operationsx = torch.tensor([1.0, 2.0], requires_grad=True)y = x.relu() # Out-of-place (returns new tensor)z = y.sum()z.backward() # Works!print(x.grad) # tensor([1., 1.])The rule: Operations ending with _ modify tensors in-place and can break gradient tracking. Avoid them on tensors that need gradients.
Mistake #3: Wrong Loss Function for Task
Section titled “Mistake #3: Wrong Loss Function for Task”# WRONG - MSELoss for classificationcriterion = nn.MSELoss()loss = criterion(outputs, labels.float()) # Numerically unstable!
# CORRECT - CrossEntropyLoss for classificationcriterion = nn.CrossEntropyLoss()loss = criterion(outputs, labels) # Proper log-softmax handlingMistake #4: Sending Model and Data to Different Devices
Section titled “Mistake #4: Sending Model and Data to Different Devices”# WRONG - model on GPU, data on CPUmodel = Model().cuda()data = torch.randn(32, 784) # On CPU by default!output = model(data) # RuntimeError: Input and parameter tensors are not on the same device
# CORRECT - ensure everything is on the same devicedevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model = Model().to(device)data = torch.randn(32, 784).to(device)output = model(data) # Works!Mistake #5: Using Python Lists Instead of ModuleList
Section titled “Mistake #5: Using Python Lists Instead of ModuleList”# WRONG - layers won't be registered as parameters!class BadModel(nn.Module): def __init__(self): super().__init__() self.layers = [nn.Linear(10, 10) for _ in range(5)] # Python list
# Check registered parameters:model = BadModel()print(list(model.parameters())) # Empty! Layers aren't registered!
# CORRECT - use nn.ModuleListclass GoodModel(nn.Module): def __init__(self): super().__init__() self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(5)])
model = GoodModel()print(len(list(model.parameters()))) # 10 (5 weights + 5 biases)The Orchestra Conductor Analogy
Section titled “The Orchestra Conductor Analogy”Think of PyTorch like an orchestra conductor:
Without PyTorch (Manual Backprop): You’re not just conducting—you’re simultaneously playing every instrument. You have to track every note (forward pass), compute how each instrument should adjust (gradients), and remember the exact moment each note was played (caches). Exhausting and error-prone.
With PyTorch: You’re a conductor with a magical sheet music. You just wave your baton (call loss.backward()), and every musician instantly knows exactly how to adjust. The sheet music (computational graph) records everything automatically. You focus on the music (model architecture), not the mechanics.
With torch.compile(): Now you have an AI assistant analyzing your conducting patterns and pre-positioning the musicians for optimal performance. Same music, 30-200% faster.
This is why PyTorch transformed deep learning research: researchers could finally focus on the science instead of the plumbing.
️ Hands-On Exercises
Section titled “️ Hands-On Exercises”Exercise 1: Gradient Exploration
Section titled “Exercise 1: Gradient Exploration”Build intuition for autograd by experimenting with different computational graphs:
# Create tensors and compute gradientsx = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Try different operations and predict gradients before running:# 1. y = x.sum() - what's x.grad?# 2. y = (x ** 2).sum() - what's x.grad?# 3. y = x.mean() - what's x.grad?# 4. y = x.max() - what's x.grad? (hint: sparse!)
# Verify your predictions with backward() and print x.gradChallenge: Implement a custom function using autograd.Function that computes both forward and backward passes.
Exercise 2: Build MNIST Classifier
Section titled “Exercise 2: Build MNIST Classifier”Train a complete neural network on MNIST:
# Requirements:# - 2-3 hidden layers# - Dropout for regularization# - Adam optimizer# - CrossEntropyLoss# - Training and validation loop# - Achieve >98% accuracy
# Starter code:transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)# ... complete the implementationSuccess Criteria: >98% test accuracy in under 10 epochs.
Exercise 3: GPU Benchmarking
Section titled “Exercise 3: GPU Benchmarking”Compare CPU vs GPU performance:
import time
def benchmark(device, size=4096, iterations=100): x = torch.randn(size, size, device=device) y = torch.randn(size, size, device=device)
# Warmup for _ in range(10): z = x @ y
if device.type == 'cuda': torch.cuda.synchronize()
start = time.time() for _ in range(iterations): z = x @ y
if device.type == 'cuda': torch.cuda.synchronize()
elapsed = time.time() - start return elapsed / iterations
# Compare and create a plot of speedup vs matrix sizeExpected Result: 10-50x speedup for large matrices on GPU.
Exercise 4: Debugging Challenge
Section titled “Exercise 4: Debugging Challenge”Fix the bugs in this broken training loop:
# This code has 5 bugs. Find and fix them all!class BuggyModel(nn.Module): def __init__(self): # Bug 1: Missing something here self.layers = [nn.Linear(784, 128), nn.Linear(128, 10)]
def forward(self, x): for layer in self.layers: x = layer(x) return x
model = BuggyModel()criterion = nn.MSELoss() # Bug 2: Wrong loss for classificationoptimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(10): for data, labels in train_loader: # Bug 3: Missing something before forward pass outputs = model(data) loss = criterion(outputs, labels) # Bug 4: labels need processing loss.backward() optimizer.step()
# Bug 5: Evaluation without proper mode switching accuracy = (model(test_data).argmax(1) == test_labels).float().mean()Deliverable: Fixed code that trains to >95% accuracy.
Summary
Section titled “Summary”Key Takeaways
Section titled “Key Takeaways”-
PyTorch won the framework wars by prioritizing developer experience. Dynamic graphs and Pythonic design made research iteration 10x faster than static-graph alternatives.
-
Tensors are the universal container for all data in deep learning. Images, text, audio—everything becomes tensors of floats.
-
Autograd is magic that you understand. Having built backprop manually, you know what happens when you call loss.backward().
-
nn.Module is the foundation for all PyTorch models. Always call super().init(), and use ModuleList/ModuleDict for dynamic layers.
-
The training loop is always the same: zero_grad → forward → loss → backward → step. This works for any model, any scale.
-
GPU computing is trivially easy: .to(device) moves anything. But watch for memory leaks—use .item() for scalars, no_grad() for inference.
-
DataLoader handles the plumbing: Batching, shuffling, parallel loading. Set num_workers and pin_memory for maximum throughput.
-
Save checkpoints religiously. Training failures happen. Don’t lose hours of GPU time to a crash.
-
torch.compile() is the future. One line for 30-200% speedup, with no code changes.
-
PyTorch doesn’t replace understanding—it amplifies it. You know what happens inside loss.backward(). That knowledge makes you dangerous.
The Key Insight
Section titled “The Key Insight”Having built neural networks from scratch in Module 26, you now deeply appreciate what PyTorch gives you:
# Module 26: Dozens of lines of careful gradient computation# PyTorch:loss.backward()PyTorch doesn’t replace understanding - it amplifies it. You know what happens inside that one line. You can debug it when things go wrong. You can extend it when needed.
Next Steps
Section titled “Next Steps”In Module 28, you’ll learn Training Deep Networks:
- Why deep networks are hard to train
- Batch normalization
- Dropout and regularization
- Weight initialization strategies
- Learning rate scheduling
- Debugging training failures
The foundation is set. Now let’s learn the art of making networks actually converge!
Did You Know? The JAX Challenger
Section titled “Did You Know? The JAX Challenger”While PyTorch dominates, there’s a rising challenger: JAX, developed at Google.
JAX started as “NumPy on steroids” but has become a serious deep learning framework. Its philosophy is different: instead of dynamic graphs (PyTorch) or static graphs (TensorFlow), JAX uses functional transformations.
# JAX: Transform functions, not tensorsimport jaximport jax.numpy as jnp
def loss_fn(params, x, y): predictions = predict(params, x) return jnp.mean((predictions - y) ** 2)
# Get gradient function by transforming loss_fngrad_fn = jax.grad(loss_fn)gradients = grad_fn(params, x, y)Who uses JAX?:
- Google DeepMind: AlphaFold, Gemini
- OpenAI: Some internal experiments
- Research teams: When they need maximum performance on TPUs
The PyTorch vs JAX trade-off:
| Aspect | PyTorch | JAX |
|---|---|---|
| Debugging | Python debugger works | Harder (functional transforms) |
| Ecosystem | Massive (HuggingFace, etc.) | Growing |
| TPU support | Exists but limited | Excellent (Google’s TPUs) |
| GPU support | Excellent | Good |
| Learning curve | Moderate | Steep |
| Production tooling | TorchServe, ONNX | Less mature |
Bottom line: PyTorch remains the default choice for 90%+ of practitioners. JAX is worth exploring if you need extreme performance, work with TPUs, or do cutting-edge research in areas like neural ODEs or differentiable physics. However, the PyTorch ecosystem’s maturity, especially HuggingFace integration and extensive tooling support, makes it the safer choice for most production applications. Unless you have a specific reason to choose JAX (like TPU-first deployment or cutting-edge functional programming research), start with PyTorch.
Community and Resources
Section titled “Community and Resources”Essential Learning Resources
Section titled “Essential Learning Resources”Books:
- Deep Learning with PyTorch (Eli Stevens, Luca Antiga) - Official PyTorch book, free online
- Programming PyTorch for Deep Learning (Ian Pointer) - O’Reilly practical guide
- PyTorch Pocket Reference (Joe Papa) - Quick reference for common patterns
Video Courses:
- Andrej Karpathy’s “Neural Networks: Zero to Hero” - Free YouTube series, builds intuition
- fast.ai - Practical deep learning, uses PyTorch, emphasizes getting things working
- NYU Deep Learning (Yann LeCun) - Graduate-level theory, available on YouTube
Interactive:
- PyTorch Lightning - Framework that reduces boilerplate
- Weights & Biases - Experiment tracking, integrates seamlessly
- Hugging Face Transformers - Pre-trained models, all PyTorch-native
Getting Help
Section titled “Getting Help”Forums and Communities:
- PyTorch Forums (discuss.pytorch.org) - Official, active, helpful
- r/pytorch - Reddit community
- Stack Overflow [pytorch] - 50,000+ questions answered
- PyTorch Discord - Real-time help
When Debugging:
- Check PyTorch version compatibility
- Search the error message verbatim
- Minimal reproducible example helps others help you
- The forums are friendlier than Stack Overflow for beginners
Contributing to PyTorch
Section titled “Contributing to PyTorch”PyTorch is open source with over 3,000 contributors. If you find a bug or want to add a feature:
- File an issue on GitHub first
- Small PRs are more likely to be merged
- Documentation improvements are always welcome
- The contributing guide is thorough
Further Reading
Section titled “Further Reading”- Official PyTorch Tutorials: https://pytorch.org/tutorials/
- Deep Learning with PyTorch (free book): https://pytorch.org/deep-learning-with-pytorch
- Andrej Karpathy’s micrograd: https://github.com/karpathy/micrograd - A tiny autograd engine for educational purposes
- PyTorch Internals: http://blog.ezyang.com/2019/05/pytorch-internals/ - How the magic works
- The Annotated Transformers: http://nlp.seas.harvard.edu/annotated-transformer - Transformer implementation in PyTorch with explanations
Last updated: 2025-12-11 Status: Complete