Перейти до вмісту

LoRA & Parameter-Efficient Fine-tuning

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Reading Time: 7-8 hours Prerequisites: Module 32


When Jason Allen discovered in August 2022 that his AI-generated artwork had won first place at the Colorado State Fair, beating human artists who had trained for decades, he realized he had sparked a revolution. Artists were furious. Twitter erupted. But while the art world debated ethics, engineers noticed something else: the technology behind Midjourney—diffusion models—was fundamentally different from anything that came before, and it was about to change everything from advertising to drug discovery.


By the end of this module, you will:

  • Understand how diffusion models generate images from noise
  • Master the forward and reverse diffusion processes
  • Learn the U-Net architecture for denoising
  • Understand text conditioning with CLIP
  • Implement classifier-free guidance
  • Know how Stable Diffusion works end-to-end
  • Apply LoRA to customize image generation

London. August 30, 2022. 2:30 PM.

Jason Allen was nervous. He had just won first place in the digital art category at the Colorado State Fair—beating human artists who had spent months on their entries. His piece, “Théâtre D’opéra Spatial,” depicted an elaborate operatic scene with ethereal lighting and impossible architecture.

The problem? Jason had created it with Midjourney, an AI image generator, in about 80 hours of prompt refinement.

When the news broke, artists were furious. “This is the death of artistry,” one competitor declared. “We’re watching the decay of legitimate artistic work.” Twitter erupted. News outlets covered it for weeks. A debate about creativity, authenticity, and the future of art consumed the internet.

What most people didn’t know: Midjourney was powered by diffusion models—the same technology driving Stable Diffusion, DALL-E 2, and a revolution in how images are created. And this was just the beginning.

“I’m not going to apologize for it. I won. I didn’t break any rules.” — Jason Allen, 2022

Within two years, diffusion models would be generating billions of images daily, disrupting stock photography, transforming advertising, and forcing every creative industry to reckon with AI-generated content.

This module teaches you how diffusion models work—from pure noise to photorealistic images, one denoising step at a time.


Imagine you’re watching a time-lapse of a photograph slowly dissolving into static noise on an old TV. Frame by frame, the image becomes less recognizable until it’s pure random fuzz.

Now imagine playing that video in reverse — starting from static and watching a photograph emerge from nothing.

That’s diffusion. We train a neural network to reverse the corruption process, to look at noisy images and predict what they looked like before the noise was added. Do this enough times, starting from pure noise, and you can generate entirely new images.

It’s like teaching someone to restore damaged photographs — but so well that they can “restore” photographs that never existed.

Before diffusion models, we had:

  • GANs (2014): Two networks fighting — often unstable, mode collapse
  • VAEs (2013): Encode-decode with latent space — often blurry
  • Autoregressive (2016): Generate pixel by pixel — slow, loses global coherence

Diffusion models combined the best properties:

  • Stable training (no adversarial games)
  • High quality (sharp, coherent images)
  • Flexible (easy to condition on text, class, etc.)
  • Theoretically grounded (solid probabilistic foundation)

Did You Know? Diffusion models were largely ignored for years after being introduced. The original paper by Sohl-Dickstein et al. (2015) “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” drew from statistical physics. It took until 2020 when Jonathan Ho’s DDPM paper showed they could match GANs in image quality, and 2022 when Stable Diffusion went viral, for the world to pay attention.


The Forward Process: Destroying Images Scientifically

Section titled “The Forward Process: Destroying Images Scientifically”

The forward process is simple: gradually add Gaussian noise to an image until it becomes pure noise.

At each timestep t, we add a small amount of noise:

x_t = √(1 - β_t) · x_{t-1} + √(β_t) · ε
Where:
- x_t is the noisy image at timestep t
- x_{t-1} is the image at the previous timestep
- β_t is the noise schedule (small value, e.g., 0.0001 to 0.02)
- ε ~ N(0, I) is random Gaussian noise

Worked Example:

Let’s trace a single pixel value through 4 timesteps:

Original pixel value: x_0 = 0.8
Noise schedule: β = [0.1, 0.2, 0.3, 0.4]
Step 1: β_1 = 0.1
x_1 = √0.9 · 0.8 + √0.1 · (-0.5) [random noise = -0.5]
x_1 = 0.949 · 0.8 + 0.316 · (-0.5)
x_1 = 0.759 - 0.158 = 0.601
Step 2: β_2 = 0.2
x_2 = √0.8 · 0.601 + √0.2 · (0.3) [random noise = 0.3]
x_2 = 0.894 · 0.601 + 0.447 · 0.3
x_2 = 0.537 + 0.134 = 0.671
Step 3: β_3 = 0.3
x_3 = √0.7 · 0.671 + √0.3 · (-0.8) [random noise = -0.8]
x_3 = 0.837 · 0.671 + 0.548 · (-0.8)
x_3 = 0.561 - 0.438 = 0.123
Step 4: β_4 = 0.4
x_4 = √0.6 · 0.123 + √0.4 · (0.9) [random noise = 0.9]
x_4 = 0.775 · 0.123 + 0.632 · 0.9
x_4 = 0.095 + 0.569 = 0.664

Notice how the pixel value drifts randomly as noise accumulates. After enough steps (~1000), the original value is completely lost.

We can skip directly to any timestep using cumulative products:

α_t = 1 - β_t
ᾱ_t = α_1 · α_2 · ... · α_t (cumulative product)
x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε

This lets us sample any noisy version directly without iterating through all steps!

def forward_diffusion(x_0, t, noise_schedule):
"""Add noise to image at timestep t."""
alpha_bar = torch.cumprod(1 - noise_schedule, dim=0)
alpha_bar_t = alpha_bar[t]
noise = torch.randn_like(x_0)
# Direct formula: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
return x_t, noise

Notice how we return both the noisy image AND the noise we added — the model will learn to predict this noise.


The reverse process is where the magic happens. We train a neural network to predict the noise that was added, then subtract it.

The loss is surprisingly simple:

L = E[||ε - ε_θ(x_t, t)||²]
Where:
- ε is the actual noise we added
- ε_θ(x_t, t) is the model's prediction of that noise
- x_t is the noisy image
- t is the timestep (tells model how noisy the image is)

It’s just MSE between the true noise and predicted noise!

Think of it like this: We show the model a corrupted image and ask “What noise was added?” The model learns to recognize the noise pattern and predict it. Once we know the noise, we can subtract it to get a cleaner image.

def train_step(model, x_0, noise_schedule):
"""Single training step for diffusion model."""
batch_size = x_0.shape[0]
# 1. Sample random timesteps
t = torch.randint(0, len(noise_schedule), (batch_size,))
# 2. Add noise (forward process)
x_t, noise = forward_diffusion(x_0, t, noise_schedule)
# 3. Predict the noise
noise_pred = model(x_t, t)
# 4. Compute loss (simple MSE!)
loss = F.mse_loss(noise_pred, noise)
return loss

Notice how each training step samples a random timestep — the model learns to denoise at ALL noise levels, not just one.

Did You Know? The idea of predicting noise instead of the clean image was a key insight from Ho et al.’s DDPM paper. Earlier work tried to predict the clean image directly, which is much harder. Predicting noise is easier because the noise has a known distribution (Gaussian), while images have complex, varied distributions.


The workhorse of diffusion models is the U-Net — a convolutional architecture shaped like the letter U.

Denoising requires understanding both:

  • Global structure: Is this a face? A landscape? Where should the eyes be?
  • Local details: Exact pixel values, textures, edges

U-Net achieves this through:

  1. Encoder (downsampling): Captures global context
  2. Decoder (upsampling): Reconstructs details
  3. Skip connections: Preserve fine-grained information
Input (noisy image)
┌─────────┐
│ Conv │─────────────────────────────┐ (skip connection)
│ 64→128 │ │
└────┬────┘ │
│ downsample │
▼ │
┌─────────┐ │
│ Conv │──────────────────┐ │
│128→256 │ │ │
└────┬────┘ │ │
│ downsample │ │
▼ │ │
┌─────────┐ │ │
│ Bottleneck │ │
│256→256 │ │ │
└────┬────┘ │ │
│ upsample │ │
▼ ▼ │
┌─────────┐ ┌─────────┐ │
│ Conv │◄───concat──│ skip │ │
│256→128 │ └─────────┘ │
└────┬────┘ │
│ upsample │
▼ ▼
┌─────────┐ ┌─────────┐
│ Conv │◄──────────concat──────│ skip │
│128→64 │ └─────────┘
└────┬────┘
Output (predicted noise)

The model needs to know the timestep (noise level). We encode this as a sinusoidal embedding (like positional encoding in transformers):

def timestep_embedding(t, dim):
"""Create sinusoidal timestep embedding."""
half_dim = dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim) * -emb)
emb = t[:, None] * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return emb

This embedding is added to each layer of the U-Net, telling it how noisy the input is.

Modern U-Nets include self-attention layers (especially at lower resolutions) to capture long-range dependencies:

class AttentionBlock(nn.Module):
"""Self-attention for spatial features."""
def __init__(self, channels):
super().__init__()
self.norm = nn.GroupNorm(8, channels)
self.qkv = nn.Conv1d(channels, channels * 3, 1)
self.proj = nn.Conv1d(channels, channels, 1)
def forward(self, x):
b, c, h, w = x.shape
x_flat = x.view(b, c, h * w)
qkv = self.qkv(self.norm(x_flat))
q, k, v = qkv.chunk(3, dim=1)
# Scaled dot-product attention
attn = torch.softmax(q.transpose(-1, -2) @ k / math.sqrt(c), dim=-1)
out = (v @ attn.transpose(-1, -2)).view(b, c, h, w)
return x + self.proj(out.view(b, c, -1)).view(b, c, h, w)

Notice how attention lets distant pixels communicate — crucial for maintaining global coherence in generated images.

Did You Know? The U-Net architecture was originally invented by Olaf Ronneberger in 2015 for biomedical image segmentation (detecting cell boundaries in microscopy images). It became the standard for diffusion models because its skip connections perfectly preserve the fine details needed for high-quality image generation.


DDPM (Denoising Diffusion Probabilistic Models)

Section titled “DDPM (Denoising Diffusion Probabilistic Models)”

The original formulation requires many steps (~1000) because each step only removes a tiny bit of noise.

Sampling:

def ddpm_sample(model, shape, noise_schedule, num_steps=1000):
"""Sample using DDPM (slow but high quality)."""
x = torch.randn(shape) # Start from pure noise
for t in reversed(range(num_steps)):
# Predict noise
noise_pred = model(x, t)
# Compute coefficients
alpha = 1 - noise_schedule[t]
alpha_bar = torch.cumprod(1 - noise_schedule[:t+1], dim=0)[-1]
beta = noise_schedule[t]
# Denoise one step
mean = (1 / torch.sqrt(alpha)) * (
x - (beta / torch.sqrt(1 - alpha_bar)) * noise_pred
)
# Add noise (except at t=0)
if t > 0:
noise = torch.randn_like(x)
x = mean + torch.sqrt(beta) * noise
else:
x = mean
return x

Problem: 1000 forward passes through a huge U-Net = slow!

DDIM (Denoising Diffusion Implicit Models)

Section titled “DDIM (Denoising Diffusion Implicit Models)”

Song et al. (2020) discovered you can skip steps by making the process deterministic:

def ddim_sample(model, shape, noise_schedule, num_steps=50):
"""Sample using DDIM (fast, deterministic)."""
x = torch.randn(shape)
# Use only a subset of timesteps
timesteps = torch.linspace(999, 0, num_steps).long()
for i, t in enumerate(timesteps):
noise_pred = model(x, t)
alpha_bar_t = get_alpha_bar(t, noise_schedule)
if i < len(timesteps) - 1:
alpha_bar_prev = get_alpha_bar(timesteps[i+1], noise_schedule)
else:
alpha_bar_prev = 1.0
# DDIM update (no random noise!)
pred_x0 = (x - torch.sqrt(1 - alpha_bar_t) * noise_pred) / torch.sqrt(alpha_bar_t)
dir_xt = torch.sqrt(1 - alpha_bar_prev) * noise_pred
x = torch.sqrt(alpha_bar_prev) * pred_x0 + dir_xt
return x

DDIM advantages:

  • 20-50 steps instead of 1000 (20-50x faster!)
  • Deterministic (same noise → same image)
  • Allows interpolation in latent space

Notice how DDIM removes the random noise term — the process becomes deterministic, which is why the same starting noise always produces the same image.


How do we go from “a cat wearing a top hat” to an actual image?

CLIP (Contrastive Language-Image Pre-training) by OpenAI learns to align text and image representations:

"a photo of a cat" ──► Text Encoder ──► [0.2, -0.5, 0.8, ...]
│ should be similar!
[actual cat photo] ──► Image Encoder ──► [0.3, -0.4, 0.7, ...]

CLIP was trained on 400 million image-text pairs from the internet. It learns that “cat” and images of cats should have similar embeddings.

We inject text information into the U-Net using cross-attention:

class CrossAttention(nn.Module):
"""Attend to text embeddings."""
def __init__(self, query_dim, context_dim):
super().__init__()
self.to_q = nn.Linear(query_dim, query_dim)
self.to_k = nn.Linear(context_dim, query_dim)
self.to_v = nn.Linear(context_dim, query_dim)
self.to_out = nn.Linear(query_dim, query_dim)
def forward(self, x, context):
"""
x: image features [batch, seq, dim]
context: text embeddings [batch, text_len, context_dim]
"""
q = self.to_q(x)
k = self.to_k(context)
v = self.to_v(context)
# Attention: image queries attend to text keys/values
attn = torch.softmax(q @ k.transpose(-1, -2) / math.sqrt(q.shape[-1]), dim=-1)
out = attn @ v
return self.to_out(out)

Notice how cross-attention lets every spatial location in the image “look at” the text tokens. When generating the cat’s location, those pixels attend strongly to the “cat” token.

Did You Know? CLIP was trained with a simple contrastive loss: given a batch of image-text pairs, maximize similarity between matching pairs and minimize similarity between non-matching pairs. This seemingly simple objective created representations so powerful they enabled zero-shot image classification, revolutionized image search, and became the backbone of text-to-image generation.


Classifier-Free Guidance: Steering Generation

Section titled “Classifier-Free Guidance: Steering Generation”

One of the most important techniques for high-quality generation is classifier-free guidance (CFG).

Text conditioning alone often produces images that vaguely match the prompt but lack detail or accuracy. We want stronger adherence to the prompt.

Train the model with conditional dropout — sometimes give it the text, sometimes don’t:

def train_with_cfg(model, x_0, text_embedding, noise_schedule, drop_prob=0.1):
"""Training with classifier-free guidance preparation."""
t = torch.randint(0, len(noise_schedule), (x_0.shape[0],))
x_t, noise = forward_diffusion(x_0, t, noise_schedule)
# Randomly drop text conditioning
if random.random() < drop_prob:
text_embedding = torch.zeros_like(text_embedding) # Unconditional
noise_pred = model(x_t, t, text_embedding)
loss = F.mse_loss(noise_pred, noise)
return loss

At inference, run the model twice and blend:

def cfg_sample(model, x_t, t, text_embedding, guidance_scale=7.5):
"""Sample with classifier-free guidance."""
# Unconditional prediction (no text)
noise_uncond = model(x_t, t, torch.zeros_like(text_embedding))
# Conditional prediction (with text)
noise_cond = model(x_t, t, text_embedding)
# Blend: move AWAY from unconditional, TOWARD conditional
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
return noise_pred

Guidance scale effects:

  • guidance_scale = 1.0: Pure conditional (often blurry)
  • guidance_scale = 7-8: Good balance (typical default)
  • guidance_scale > 15: Over-saturated, artifacts

Think of it like this: unconditional output is “generic image,” conditional is “image matching your prompt.” We extrapolate beyond conditional to get MORE of what makes it match the prompt.


Stable Diffusion combines everything into a complete text-to-image system:

┌─────────────────────────────────────────────────────────────┐
│ STABLE DIFFUSION │
├─────────────────────────────────────────────────────────────┤
│ │
│ "a cat wearing ┌──────────┐ │
│ a top hat" ───► │ CLIP │ ───► text embeddings │
│ │ Text │ [77, 768] │
│ │ Encoder │ │
│ └──────────┘ │ │
│ │ │
│ ▼ │
│ Random noise ┌─────────────────────────────────┐ │
│ [4, 64, 64] ───►│ U-Net │ │
│ │ (with cross-attention) │ │
│ │ │ │
│ timestep ──────►│ Predicts noise in latent │ │
│ │ space (not pixel space!) │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ denoised latent │
│ [4, 64, 64] │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ VAE │ │
│ │ Decoder │ │
│ └──────────┘ │
│ │ │
│ ▼ │
│ Final Image │
│ [3, 512, 512] │
│ │
└─────────────────────────────────────────────────────────────┘

Instead of running diffusion on full 512×512×3 images (786K values), Stable Diffusion runs in a compressed latent space (64×64×4 = 16K values).

The VAE (Variational Autoencoder) compresses and decompresses:

  • Encoder: 512×512×3 → 64×64×4 (48× compression!)
  • Decoder: 64×64×4 → 512×512×3

This makes training and inference dramatically faster.

def stable_diffusion_inference(prompt, num_steps=50, guidance_scale=7.5):
"""Complete Stable Diffusion inference."""
# 1. Encode text
text_embeddings = clip_encoder(prompt)
# 2. Start from random latent noise
latents = torch.randn(1, 4, 64, 64)
# 3. Denoise in latent space
for t in tqdm(scheduler.timesteps):
# Expand latents for CFG (unconditional + conditional)
latent_input = torch.cat([latents] * 2)
# Predict noise
noise_pred = unet(latent_input, t, text_embeddings)
# Apply CFG
noise_uncond, noise_cond = noise_pred.chunk(2)
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
# Scheduler step (DDIM, etc.)
latents = scheduler.step(noise_pred, t, latents)
# 4. Decode latents to image
image = vae.decode(latents)
return image

Notice how the entire diffusion process happens in latent space — we only touch pixel space once at the very end.

Did You Know? Stable Diffusion was created by Stability AI in collaboration with researchers from LMU Munich and Runway. The key innovation of latent diffusion came from Robin Rombach’s PhD work. By open-sourcing the model weights, Stability AI sparked an explosion of creativity — thousands of fine-tuned models, LoRAs, and applications emerged within months.


Just like with LLMs, we can fine-tune Stable Diffusion with LoRA to create custom styles or characters.

Stable Diffusion’s U-Net has ~860M parameters. With LoRA, we typically target:

  • Cross-attention layers: Keys and values (text → image mapping)
  • Self-attention layers: Image coherence
  • Linear layers: General adaptation
from peft import LoraConfig, get_peft_model
# LoRA config for Stable Diffusion
lora_config = LoraConfig(
r=4, # Low rank works well for SD
lora_alpha=4,
target_modules=[
"to_k", "to_q", "to_v", # Cross-attention
"to_out.0", # Output projection
"proj_in", "proj_out", # Convolutions
],
lora_dropout=0.0,
)
# Apply to U-Net
unet = get_peft_model(unet, lora_config)

For LoRA fine-tuning, you typically need:

  • Style transfer: 10-50 images of the target style
  • Character/concept: 5-20 images of the subject
  • Captions: Descriptions of each image
AspectDreamboothLoRA
ParametersFull fine-tune0.1% of parameters
Data needed3-10 images5-50 images
Training time15-30 min10-20 min
Model sizeFull copy (~5GB)Adapter only (~10-100MB)
CombinabilityHardEasy (stack multiple)

LoRA’s killer feature: you can combine multiple LoRAs at inference time!

# Load and combine multiple LoRAs
base_model = load_stable_diffusion()
art_style_lora = load_lora("impressionist_style.safetensors")
character_lora = load_lora("my_character.safetensors")
# Apply both with different strengths
model = apply_lora(base_model, art_style_lora, strength=0.8)
model = apply_lora(model, character_lora, strength=0.6)
# Generate: character in impressionist style!
image = model("portrait of [character], impressionist painting")

Production War Stories: When Diffusion Models Go Wrong

Section titled “Production War Stories: When Diffusion Models Go Wrong”

The $2 Million Recall: Getty Images vs AI Art

Section titled “The $2 Million Recall: Getty Images vs AI Art”

San Francisco. January 2023. 9:30 AM.

A marketing director at a major consumer goods company received an urgent call from their legal team. Their Q1 campaign, featuring dozens of AI-generated product images, had just been flagged: several images contained subtle watermarks—remnants of Getty Images’ training data that had been memorized by the diffusion model.

The cost? $2.3 million in legal fees and settlements, plus another $800K to reshoot everything with traditional photography.

The Lesson: Stable Diffusion 1.x and many open models were trained on datasets containing copyrighted images. These models can—and do—regurgitate fragments of their training data, including watermarks, logos, and even recognizable faces.

The Fix:

# Always check for potential copyright issues
import clip
from PIL import Image
def check_image_similarity(generated_image, reference_images):
"""Compare generated image against known copyrighted references"""
# Use CLIP to check similarity
model, preprocess = clip.load("ViT-B/32")
gen_features = model.encode_image(preprocess(generated_image))
for ref in reference_images:
ref_features = model.encode_image(preprocess(ref))
similarity = (gen_features @ ref_features.T).item()
if similarity > 0.85: # High similarity threshold
return True, similarity
return False, 0

Post-Mortem: The company now uses models trained only on licensed data (Adobe Firefly, Shutterstock’s model) for commercial work and runs automated similarity checks against known copyrighted datasets.

The Support Ticket Avalanche: When 1000 Steps Met Production

Section titled “The Support Ticket Avalanche: When 1000 Steps Met Production”

New York. March 2023. 2:47 AM.

A startup’s image generation API had been running smoothly for weeks. Then a viral TikTok tutorial recommended their service, and traffic 100x’d overnight.

The problem? Their engineers had left num_inference_steps=1000 in production—the training default. Each image took 45 seconds to generate. With 1000 concurrent users, their GPU cluster melted.

By morning, they had 12,000 support tickets, a $47,000 cloud bill, and a crashed API.

The Lesson: Training defaults are NOT production defaults. Always benchmark and optimize inference settings before deployment.

The Fix:

# Production-optimized settings
PRODUCTION_SETTINGS = {
"num_inference_steps": 25, # Not 1000!
"scheduler": "DPMSolverMultistep", # Not DDPM!
"enable_attention_slicing": True,
"enable_vae_slicing": True,
"torch_dtype": torch.float16, # Not float32!
}
# Result: 45 seconds → 1.8 seconds per image
# Cost: $47K → $1.2K for same traffic

The NSFW Filter Failure: A Brand Crisis in 48 Hours

Section titled “The NSFW Filter Failure: A Brand Crisis in 48 Hours”

Los Angeles. July 2023.

A children’s educational app integrated AI image generation for creating “custom story illustrations.” Their safety filter? A simple NSFW classifier with 92% accuracy.

That 8% failure rate proved catastrophic. Within 48 hours, screenshots of inappropriate generated content were viral on social media. The app was pulled from both app stores. The company’s reputation—built over three years—was destroyed in two days.

The Lesson: For sensitive applications, a single classifier isn’t enough. You need defense in depth.

The Fix:

# Multi-layer safety system
def safe_generation_pipeline(prompt: str, user_id: str):
# Layer 1: Input prompt filtering
if contains_blocked_terms(prompt):
return None, "Blocked prompt"
# Layer 2: Prompt rewriting for safety
safe_prompt = llm_rewrite_prompt(prompt, "child-appropriate")
# Layer 3: Generate with safety model
image = generate_with_safety_model(safe_prompt) # SDXL-safe variant
# Layer 4: Post-generation NSFW check
nsfw_score = nsfw_classifier(image)
if nsfw_score > 0.05: # Very low threshold
return None, "Failed safety check"
# Layer 5: Human review queue for edge cases
if nsfw_score > 0.01:
queue_for_review(image, user_id)
return image, "Success"

Did You Know? The original Stable Diffusion release had no NSFW filter at all. Stability AI added one after public pressure, but the open-weights model means anyone can remove it. This is why platforms, not models, must enforce safety.


Causes:

  • Guidance scale too low
  • Too few denoising steps
  • Poor prompt engineering

Solutions:

  • Increase guidance scale (try 7-12)
  • Use at least 30-50 steps
  • Be specific in prompts

Causes:

  • Conflicting prompt elements
  • Weak words not emphasized
  • Model bias toward common concepts

Solutions:

  • Use parentheses for emphasis: (detailed hands:1.3)
  • Negative prompts: exclude unwanted elements
  • Reorder prompt (earlier = more important)

Causes:

  • Guidance scale too high
  • Incompatible model/LoRA combinations
  • Poor training data (for custom models)

Solutions:

  • Lower guidance scale
  • Check LoRA compatibility
  • Curate training data carefully

Causes:

  • No character consistency mechanism
  • Varied poses/angles in training data

Solutions:

  • Use reference images (IP-Adapter)
  • Train dedicated character LoRA
  • Use consistent seed for similar outputs

Causes:

  • Too many steps
  • Not using optimizations

Solutions:

  • Use DDIM or DPM++ schedulers (20-30 steps)
  • Enable xformers memory-efficient attention
  • Use FP16/BF16 precision
  • Consider LCM-LoRA for 4-8 step generation

Did You Know? The “hands problem” that plagued early diffusion models (generating extra fingers, distorted hands) happens because hands are underrepresented in training data compared to faces, and they have complex, variable geometry. Newer models like SDXL and SD 3.0 have improved significantly through better training data curation and architectural improvements.


2015: Diffusion Models (Sohl-Dickstein)
└── Theoretical foundation from thermodynamics
2020: DDPM (Ho et al.)
└── Practical implementation, matched GAN quality
└── 1000 steps, slow but stable
2020: DDIM (Song et al.)
└── Deterministic sampling
└── 50 steps, much faster
2021: Guided Diffusion (Dhariwal & Nichol)
└── Classifier guidance
└── Beat GANs on ImageNet
2021: GLIDE (OpenAI)
└── Text-to-image with CLIP
└── Classifier-free guidance
2022: DALL-E 2 (OpenAI)
└── Diffusion + CLIP prior
└── High-quality text-to-image
2022: Stable Diffusion (Stability AI)
└── Latent diffusion (efficient!)
└── Open source revolution
2023: SDXL (Stability AI)
└── 1024px, better prompts
└── Two U-Nets (base + refiner)
2024: SD 3.0 / Flux
└── Transformer-based (DiT)
└── Better text rendering

Q1: Why does Stable Diffusion run diffusion in latent space instead of pixel space?

Answer

Running in latent space is 48× more efficient:

  • Pixel space: 512×512×3 = 786,432 values
  • Latent space: 64×64×4 = 16,384 values

This makes training and inference dramatically faster while maintaining quality because:

  1. The VAE learns to compress to perceptually important features
  2. The U-Net can focus on semantic content, not pixel details
  3. Less memory, faster forward passes

Q2: What is classifier-free guidance and why does it improve image quality?

Answer

Classifier-free guidance (CFG) combines unconditional and conditional predictions:

noise_pred = noise_uncond + scale × (noise_cond - noise_uncond)

It improves quality by:

  1. Amplifying features that distinguish “this prompt” from “generic image”
  2. Suppressing generic features not specific to the prompt
  3. Creating a trade-off: higher scale = more prompt adherence but more artifacts

Typical scales: 7-8 for balance, higher for artistic effect.

Q3: A diffusion model is trained for 1000 timesteps. During inference, you want to generate an image in 50 steps. What technique allows this?

Answer

DDIM (Denoising Diffusion Implicit Models) allows skipping steps by:

  1. Making the sampling process deterministic (no random noise added)
  2. Using a non-Markovian process that can “skip” timesteps
  3. Interpolating directly between any two noise levels

DDPM requires sequential steps because each step adds random noise. DDIM removes this randomness, allowing larger jumps.

Q4: You’re training a LoRA on Stable Diffusion to create a specific art style. You have 30 training images. What modules should you target and why?

Answer

For style transfer, target:

  1. Cross-attention K/V (to_k, to_v): How text maps to image features
  2. Self-attention (to_q, to_k, to_v in self-attn): Image coherence and style
  3. Output projections (to_out): Final feature transformation

Why: Style is primarily about HOW features are rendered, which is controlled by attention patterns. Cross-attention controls text→image mapping (so “painting” triggers your style), while self-attention controls overall image coherence.

Low rank (r=4-8) is usually sufficient for style.

Q5: Explain why the forward diffusion process uses the formula x_t = √ᾱ_t · x_0 + √(1 - ᾱ_t) · ε instead of just x_t = x_0 + ε.

Answer

The formula maintains unit variance throughout the diffusion process:

Var(x_t) = (√ᾱ_t)² · Var(x_0) + (√(1-ᾱ_t))² · Var(ε)
= ᾱ_t · 1 + (1-ᾱ_t) · 1
= 1

If we just added noise (x_t = x_0 + ε), variance would grow unbounded, making training unstable.

The coefficients ensure:

  1. Signal preservation: √ᾱ_t controls how much original signal remains
  2. Noise calibration: √(1-ᾱ_t) controls noise magnitude
  3. Smooth transition: From pure signal (t=0) to pure noise (t=T)

This is also known as a variance-preserving diffusion process.


You’ve learned:

  1. Diffusion = noise and denoise: Forward adds noise, reverse removes it
  2. U-Net architecture: Encoder-decoder with skip connections for denoising
  3. DDPM vs DDIM: 1000 steps vs 50 steps, quality vs speed trade-off
  4. Text conditioning: CLIP embeddings + cross-attention
  5. Classifier-free guidance: Amplify prompt adherence by comparing conditional vs unconditional
  6. Stable Diffusion: Latent diffusion (VAE + U-Net + CLIP) for efficiency
  7. LoRA: Efficient fine-tuning for custom styles/characters

The key insight: Diffusion models learn to reverse corruption. Train on “what noise was added?” and you get a model that can generate images from pure noise.


The economics of image creation have been revolutionized:

Stock Photography (Pre-AI):

Use CaseCost per ImageTime to Find
Stock photo license$10-50030 min-2 hrs
Custom photoshoot$500-5,0001-4 weeks
Concept art (freelancer)$200-2,0002-7 days
Product rendering$500-3,0001-2 weeks

AI Generation (2024):

PlatformCost per ImageTime to Generate
Midjourney$0.03-0.1030 seconds
DALL-E 3$0.04-0.0820 seconds
Stable Diffusion (self-hosted)$0.002-0.015-30 seconds
Stable Diffusion (cloud API)$0.01-0.0510 seconds

Cost reduction: 95-99% for many use cases.

Running Stable Diffusion locally vs cloud:

SetupHardware CostPer-Image CostBreakeven
RTX 3090 (24GB)$1,500~$0.00115,000 images
RTX 4090 (24GB)$1,800~$0.000518,000 images
A100 40GB (cloud)$3/hr~$0.01N/A (rental)
Replicate API$0/setup$0.05/image0 images

ROI calculation: If generating >1,000 images/month, local hardware pays for itself within 6-12 months.

Stock photography: Shutterstock, Getty Images saw significant stock price drops after Stable Diffusion’s open-source release. Both companies now offer AI generation tools themselves.

Advertising: Creative agency Publicis reported 30-50% faster ad concepting when using AI image generation for initial ideation.

Game development: Indie studios report 10x faster concept art iteration, enabling smaller teams to produce more visual content.

Quality LevelToolCostUse Case
IdeationAny$0.01Brainstorming, moodboards
Social mediaSD/MJ$0.05Instagram, Twitter
MarketingDALL-E 3/MJ$0.10Ads, presentations
PrintCustom fine-tuned$0.50Magazines, packaging
Hero imagesProfessional + AI$50-500Final campaign assets

Q1: “Explain the difference between the forward and reverse diffusion processes.”

Strong Answer: “Forward diffusion is a fixed, defined process—we gradually add Gaussian noise to an image over many timesteps until it becomes pure noise. It’s not learned; it follows a predetermined schedule. Reverse diffusion is the learned process—we train a neural network to predict and remove the noise at each step. The key insight is that while forward diffusion destroys information deterministically, reverse diffusion must learn to reconstruct plausible images from that destruction. The model learns to denoise by predicting the noise that was added, then subtracting it.”

Q2: “Why does Stable Diffusion operate in latent space instead of pixel space?”

Strong Answer: “Computational efficiency. A 512×512 RGB image has 786,432 dimensions. The latent space is 64×64×4 = 16,384 dimensions—48× smaller. This makes attention operations (which are O(n²)) dramatically cheaper. The VAE learns to compress images to perceptually important features, so we lose minimal quality. The U-Net can focus on semantic content rather than pixel-level details. This insight from the Latent Diffusion paper enabled running on consumer GPUs and made Stable Diffusion accessible to millions.”

Q3: “What is classifier-free guidance and why is it important?”

Strong Answer: “CFG combines conditional and unconditional predictions during inference: we run the model twice, once with the text prompt and once without, then extrapolate in the direction of the prompt. Mathematically: noise_pred = noise_uncond + scale × (noise_cond - noise_uncond). The guidance scale controls how strongly we push toward prompt adherence. Values of 7-8 work well for most cases. It’s important because pure conditional generation often produces blurry, generic images. CFG amplifies the features that distinguish ‘this specific prompt’ from ‘any image.’”

Q4: “Compare DDPM and DDIM sampling. When would you use each?”

Strong Answer: “DDPM is the original formulation—stochastic sampling with 1000 steps. Each step adds noise, which provides diversity but requires many iterations. DDIM makes the process deterministic by removing the noise term, enabling 20-50 step sampling without quality loss. Use DDPM when: you need maximum diversity and quality isn’t time-critical. Use DDIM when: you need fast inference, reproducibility (same seed = same output), or latent space interpolation. Most production systems use DDIM or its variants (DPM++, Euler) for speed.”

Q5: “How would you fine-tune Stable Diffusion for a specific character or style?”

Strong Answer: “Two main approaches: Dreambooth and LoRA. Dreambooth fine-tunes the entire model with a unique identifier token (e.g., ‘sks person’) and regularization images. It’s effective but produces large model files. LoRA adds low-rank adapters to attention layers, training only 0.1% of parameters. It produces small files (10-100MB) that can be combined and switched at runtime. For 5-20 training images, I’d use LoRA targeting cross-attention K/V and self-attention layers, with r=4-8 and 500-1000 training steps. Monitor for overfitting by checking if generations become too similar to training data.”

Q: “Design an AI image generation service for a stock photography company.”

Strong Answer Structure:

  1. Architecture:

    • “Async queue-based architecture: user submits prompt → job queued → GPU workers process → results stored in S3 → user notified”
    • “Separate GPU pools for different quality tiers (fast/preview vs high-quality)”
  2. Model Selection:

    • “Base: Stable Diffusion XL for quality and prompt following”
    • “Fine-tuned LoRAs for specific use cases (people, products, landscapes)”
    • “Multiple checkpoint versions for A/B testing”
  3. Quality Control:

    • “NSFW filter on outputs (CLIP-based classifier)”
    • “Watermark detection to prevent copyright issues”
    • “Human review queue for high-value/flagged content”
  4. Optimization:

    • “Batched inference for throughput”
    • “Mixed precision (FP16) for speed”
    • “Flash attention for memory efficiency”
    • “Prompt caching for repeated text embeddings”
  5. Scaling:

    • “Kubernetes with GPU node autoscaling”
    • “Multi-region for global latency”
    • “CDN for result delivery”

Exercise 1: Visualize the Diffusion Process

Section titled “Exercise 1: Visualize the Diffusion Process”

Create a visualization of forward and reverse diffusion:

import torch
import matplotlib.pyplot as plt
from diffusers import StableDiffusionPipeline
def visualize_diffusion_steps(image, num_steps=10):
"""
Visualize the forward diffusion process:
1. Load an image
2. Apply increasing noise levels
3. Plot as a grid showing degradation
Then visualize reverse:
1. Start from noise
2. Generate with fewer steps each time
3. Show progressive denoising
"""
# YOUR CODE HERE
# Use the forward_diffusion function from the module
# Plot a grid of images at different noise levels
pass
# Test with a sample image
# Create a 2-row visualization: forward (left to right) and reverse (right to left)

Deliverable: Grid visualization showing image → noise → image transition.

Benchmark different schedulers:

from diffusers import (
DDPMScheduler,
DDIMScheduler,
PNDMScheduler,
EulerDiscreteScheduler,
DPMSolverMultistepScheduler,
)
def compare_schedulers(prompt, schedulers, step_counts=[10, 20, 30, 50]):
"""
Compare different schedulers on the same prompt:
1. Generate images with each scheduler at different step counts
2. Measure generation time
3. Calculate FID or CLIP score for quality
4. Create comparison grid
"""
results = {}
for scheduler_name, scheduler in schedulers.items():
for num_steps in step_counts:
# YOUR CODE HERE
# Time the generation
# Store the image and metrics
pass
return results
# Compare: DDPM, DDIM, Euler, DPM++
# Find the sweet spot: minimum steps for acceptable quality

Deliverable: Table showing scheduler × steps → quality/time trade-off.

Fine-tune Stable Diffusion with LoRA:

from diffusers import StableDiffusionPipeline
from peft import LoraConfig, get_peft_model
import torch
def train_style_lora(
base_model_id: str,
training_images: list,
training_captions: list,
output_dir: str,
num_epochs: int = 10,
):
"""
Train a LoRA for a specific art style:
1. Load base Stable Diffusion
2. Apply LoRA config to U-Net
3. Create training dataloader
4. Training loop with noise prediction loss
5. Save LoRA weights
Target: cross-attention layers (to_k, to_v, to_q)
"""
# YOUR CODE HERE
pass
# Train on 10-20 images of a specific style
# Test that the style transfers to new prompts

Deliverable: Working LoRA that applies a specific style.

Exercise 4: Implement Classifier-Free Guidance

Section titled “Exercise 4: Implement Classifier-Free Guidance”

Build CFG from scratch:

def classifier_free_guidance_sample(
model,
prompt_embedding,
negative_prompt_embedding,
scheduler,
num_steps: int = 30,
guidance_scale: float = 7.5,
):
"""
Implement CFG sampling:
1. Start from random noise
2. At each step:
- Run model with prompt (conditional)
- Run model without prompt (unconditional)
- Blend: uncond + scale * (cond - uncond)
3. Denoise using scheduler
Experiment with guidance_scale: 1, 3, 7, 12, 20
Document the quality vs artifacts trade-off
"""
# YOUR CODE HERE
pass
# Generate images at different guidance scales
# Create a comparison grid showing the effect

Deliverable: Grid showing guidance scale effect on same prompt.


Did You Know? The Thermodynamics Connection

Section titled “Did You Know? The Thermodynamics Connection”

The original diffusion models paper by Sohl-Dickstein et al. (2015) drew inspiration from non-equilibrium thermodynamics—the physics of systems evolving toward or away from thermal equilibrium.

The forward diffusion process is analogous to a physical system increasing entropy—a hot cup of coffee cooling to room temperature, order dissolving into disorder.

The reverse process is like decreasing entropy—which is thermodynamically impossible without adding energy/information. In diffusion models, the “energy” comes from the learned neural network that “knows” what images should look like.

This physics connection explains why diffusion works: we’re learning to reverse a natural process of decay. The math of Gaussian noise addition is well-understood from statistical mechanics, giving diffusion models a solid theoretical foundation.

“We’re not generating images from nothing—we’re learning to reverse the arrow of thermodynamic time, reconstructing order from chaos.” — Inspired by the original 2015 paper


Research Pioneers:

  • Jonathan Ho (@_jonathanho) - DDPM first author, now at Google
  • Robin Rombach - Latent Diffusion/Stable Diffusion creator
  • Yang Song - Score-based generative modeling
  • Prafulla Dhariwal - Guided diffusion, now at OpenAI

Practitioners:

  • Emad Mostaque (@EMostaque) - Stability AI founder
  • ComfyUI community - Advanced workflow builders
  • Civitai - Model and LoRA sharing platform

Architecture:

  • DiT (Diffusion Transformers): Replacing U-Net with transformers (SD 3.0, Flux)
  • Consistency Models: Single-step generation
  • Rectified Flow: Faster training and sampling

Control:

  • ControlNet: Conditioning on poses, edges, depth
  • IP-Adapter: Image prompt conditioning
  • Inpainting: Coherent editing of specific regions

Efficiency:

  • LCM-LoRA: 4-8 step high-quality generation
  • Distillation: Teacher-student for speed
  • Quantization: INT8/INT4 for deployment

  1. DDPM: “Denoising Diffusion Probabilistic Models” (Ho et al., 2020)

  2. DDIM: “Denoising Diffusion Implicit Models” (Song et al., 2020)

  3. Latent Diffusion: “High-Resolution Image Synthesis with Latent Diffusion Models” (Rombach et al., 2022)

  4. Classifier-Free Guidance: (Ho & Salimans, 2022)

  1. Hugging Face Diffusers: Official library

  2. The Annotated Diffusion Model: Step-by-step implementation

  3. Stable Diffusion Deep Dive: Comprehensive guide

  4. ComfyUI: Node-based diffusion workflow

  5. Civitai: Model and LoRA repository

    • https://civitai.com
    • Community hub for sharing fine-tuned models and LoRAs. Browse thousands of custom styles and characters with example images.

For those new to diffusion models, we recommend this progression through the resources above:

  1. Start with DDPM paper (Ho et al., 2020) - understand the foundation
  2. Follow the Annotated Diffusion tutorial - implement from scratch
  3. Learn Diffusers library - production-ready pipelines
  4. Explore ComfyUI - visual experimentation
  5. Study Latent Diffusion paper - understand Stable Diffusion’s architecture
  6. Experiment on Civitai - see what the community has built

This path takes you from theory to practice, building intuition at each stage before moving to the next level of complexity.


Mistake 1: Using DDPM Scheduler in Production

Section titled “Mistake 1: Using DDPM Scheduler in Production”
# WRONG: DDPM needs 1000 steps
from diffusers import DDPMScheduler
scheduler = DDPMScheduler(num_train_timesteps=1000)
# Takes 45+ seconds per image!
# CORRECT: Use DDIM or DPM++ for inference
from diffusers import DPMSolverMultistepScheduler
scheduler = DPMSolverMultistepScheduler(num_train_timesteps=1000)
# 20-25 steps = 2-3 seconds per image

Why: DDPM is the original training scheduler but terribly slow for inference. DPM++ and DDIM achieve nearly identical quality in 20-50x fewer steps.

Mistake 2: Ignoring Guidance Scale Trade-offs

Section titled “Mistake 2: Ignoring Guidance Scale Trade-offs”
# WRONG: Always using default guidance_scale=7.5
image = pipe(prompt, guidance_scale=7.5).images[0] # Fine for most cases
# WRONG: Cranking it to maximum for "better quality"
image = pipe(prompt, guidance_scale=20).images[0] # Oversaturated, artifacts!
# CORRECT: Tune based on use case
# Artistic/creative: 3-5
# Photorealistic: 7-9
# Strong adherence: 10-12
image = pipe(prompt, guidance_scale=8.0).images[0]

Why: Higher guidance doesn’t mean better. Beyond ~12, colors oversaturate and artifacts appear. Different models have different sweet spots.

# WRONG: Full FP32 (uses 2x VRAM)
pipe = StableDiffusionPipeline.from_pretrained("model", torch_dtype=torch.float32)
# CORRECT: FP16 for inference
pipe = StableDiffusionPipeline.from_pretrained("model", torch_dtype=torch.float16)
pipe.enable_attention_slicing() # Further reduce VRAM

Why: Diffusion models work perfectly fine in FP16 for inference. You’ll halve memory usage with negligible quality loss.

Mistake 4: Generating at Wrong Resolutions

Section titled “Mistake 4: Generating at Wrong Resolutions”
# WRONG: Arbitrary resolution
image = pipe(prompt, height=700, width=500).images[0] # Stretched, artifacts
# CORRECT: Use model's native resolution or multiples
# SD 1.5: 512x512, 512x768, 768x512
# SDXL: 1024x1024, 1024x768, 1152x896
image = pipe(prompt, height=1024, width=1024).images[0]

Why: Models are trained at specific resolutions. Non-standard sizes cause distortions and repeated patterns.

Mistake 5: Not Seeding for Reproducibility

Section titled “Mistake 5: Not Seeding for Reproducibility”
# WRONG: Random generation (can't reproduce good results)
image = pipe(prompt).images[0]
# CORRECT: Use explicit seeds
import torch
generator = torch.Generator("cuda").manual_seed(42)
image = pipe(prompt, generator=generator).images[0]
# Save the seed with your images!

Why: Found a perfect generation? Without the seed, you can never recreate it. Always log seeds in production.


  1. Diffusion = Reverse Denoising: Forward process adds noise, reverse process removes it—the neural network learns to denoise, enabling generation from pure noise

  2. Latent Space is Essential: Operating in compressed latent space (Stable Diffusion) reduces computation 64x while preserving quality—VAE encoding makes consumer GPUs viable

  3. U-Net Architecture: The workhorse of diffusion—takes noisy image and timestep, predicts noise to subtract; attention layers enable text conditioning

  4. CLIP Connects Text and Images: Text embeddings from CLIP guide the diffusion process—the model learns which noise patterns correspond to which concepts

  5. Classifier-Free Guidance: The secret sauce for controllable generation—interpolate between conditional and unconditional predictions to strengthen prompt adherence

  6. Scheduler Choice Matters: DDPM for training, DPM++/DDIM for inference—wrong scheduler = 50x slower generation with no quality benefit

  7. LoRA Enables Customization: Low-rank adaptation fine-tunes specific styles/concepts with 1000x fewer parameters than full fine-tuning

  8. Guidance Scale Trade-off: Higher guidance = stronger prompt adherence but more artifacts; 7-12 is the sweet spot for most use cases

  9. Resolution Constraints: Always generate at trained resolution or aspect ratios—non-standard sizes cause repetition and distortion artifacts

  10. Reproducibility Requires Seeds: Always save seeds with generations—without them, you cannot recreate or iterate on successful outputs


Move on to Module 34: Code Generation Models where you’ll learn:

  • How Codex, Copilot, and Code Llama work
  • Specialized training for code
  • Fill-in-the-middle and infilling
  • Evaluating code generation

Or explore the deliverable to:

  • Visualize the diffusion process step by step
  • Experiment with different schedulers
  • Generate images with Stable Diffusion
  • Create custom LoRAs

Last updated: 2025-11-27 Status: Complete