Fine-tuning LLMs
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 6-8
Or: Teaching Old Models New Tricks (Without Breaking the Bank)
Section titled “Or: Teaching Old Models New Tricks (Without Breaking the Bank)”Reading Time: 7-8 hours Prerequisites: Module 31
The $15 Million Shortcut That Changed Everything
Section titled “The $15 Million Shortcut That Changed Everything”Palo Alto, California. March 14, 2023. 9:45 PM.
Edward Hu was frustrated. His team at Microsoft Research had spent months fine-tuning GPT-3 for internal applications, and the numbers were brutal: $15 million in compute costs, hundreds of GPUs running for weeks, and every new use case required starting over.
“There has to be a better way,” he muttered, staring at the attention matrices. Then it hit him: what if most of the model’s knowledge was already right, and they only needed to nudge it in the right direction?
Six weeks later, Hu’s team published “LoRA: Low-Rank Adaptation of Large Language Models.” Instead of updating all 175 billion parameters, LoRA updated less than 0.1%—a few million trainable parameters injected into the attention layers. The cost dropped from millions of dollars to hundreds. The time dropped from weeks to hours.
“The insight was embarrassingly simple. During fine-tuning, the weight changes form a low-rank structure. So instead of updating the full matrix, we inject two small matrices whose product approximates the update. Same result, 10,000x cheaper.” — Edward Hu, LoRA inventor, Microsoft Research
Today, LoRA and its variants power virtually every fine-tuned open-source model in production.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand when to fine-tune vs use RAG vs prompt engineering
- Master LoRA and QLoRA for efficient fine-tuning
- Fine-tune open-source models (Llama, Mistral) on custom datasets
- Prepare high-quality training datasets
- Evaluate fine-tuned models properly
- Deploy fine-tuned models to production
- Calculate and optimize fine-tuning costs
The Evolution of Fine-tuning: From Impossible to Accessible
Section titled “The Evolution of Fine-tuning: From Impossible to Accessible”The story of fine-tuning LLMs is a story of democratization—making the impossible possible, then making it affordable.
2018-2020: The Era of the Giants
Section titled “2018-2020: The Era of the Giants”When BERT emerged from Google in 2018, “fine-tuning” was already a thing, but it was manageable—BERT had 340 million parameters, fitting on a single GPU. Researchers fine-tuned BERT for everything: sentiment analysis, named entity recognition, question answering. The recipe was simple: take pre-trained weights, train on your task, done.
Then GPT-3 arrived in 2020 with 175 billion parameters. Suddenly, fine-tuning wasn’t simple anymore. You couldn’t just load the model—it wouldn’t fit. Training required hundreds of GPUs. Only OpenAI, Google, and a handful of well-funded labs could even attempt it.
Did You Know? Training GPT-3 cost an estimated $4.6 million in compute alone. Fine-tuning the full model on a custom dataset would cost another $100,000-500,000—far beyond the reach of most organizations. This created a two-tier system: companies that could afford fine-tuning and everyone else.
2021-2022: The PEFT Revolution
Section titled “2021-2022: The PEFT Revolution”The breakthrough came in 2021 with papers like Adapter-BERT, Prefix Tuning, and LoRA. These methods shared a radical insight: you don’t need to update all parameters. A tiny fraction—sometimes less than 0.1%—can achieve nearly identical results.
Think of it like this: if you want to teach a chess grandmaster to also play checkers, you don’t retrain their entire brain. You add a small “checkers module” that sits alongside their existing knowledge. That’s exactly what PEFT methods do.
2023-Present: The QLoRA Revolution
Section titled “2023-Present: The QLoRA Revolution”Tim Dettmers’s QLoRA paper in May 2023 was the final piece. By combining quantization (compressing the model to 4-bit precision) with LoRA adapters (trained at full precision), he showed that anyone with a gaming GPU could fine-tune models that previously required data center hardware.
The numbers tell the story:
- 2020: Fine-tune GPT-3 → $500,000 and 100 GPUs
- 2023: Fine-tune Llama 65B with QLoRA → $50 and 1 GPU
That’s a 10,000× cost reduction in three years.
The Big Picture: Why Fine-tune?
Section titled “The Big Picture: Why Fine-tune?”Imagine you’ve hired a brilliant new employee — they graduated top of their class, speak eloquently, and have read millions of books. But they know nothing about your company. They don’t know your products, your jargon, or how you like things done.
You have three options:
- Give them a manual to consult (RAG) — they look things up as needed
- Coach them with examples (prompt engineering) — show them what you want each time
- Train them on the job (fine-tuning) — they internalize your company’s way
Fine-tuning is option 3. It modifies the model’s weights so the knowledge becomes part of the model, not something it retrieves or is reminded of each time.
When to Fine-tune
Section titled “When to Fine-tune”| Use Case | Best Approach |
|---|---|
| Custom knowledge (docs, FAQs) | RAG |
| New tasks with few examples | Few-shot prompting |
| Consistent style/format | Fine-tuning |
| Domain-specific language | Fine-tuning |
| New behaviors/capabilities | Fine-tuning |
| Cost optimization (repeated tasks) | Fine-tuning |
| Speed optimization | Fine-tuning |
The key insight: Fine-tuning changes how the model behaves, not what it knows. For adding knowledge, use RAG. For changing behavior, use fine-tuning.
Did You Know? OpenAI’s ChatGPT wasn’t just a larger GPT-3. The magic came from fine-tuning: first on human demonstrations (SFT), then on human preferences (RLHF). The base GPT-3 couldn’t hold a conversation — fine-tuning taught it to be helpful, harmless, and honest. This insight spawned an entire field of “alignment” research.
The Fine-tuning Spectrum
Section titled “The Fine-tuning Spectrum”Not all fine-tuning is created equal. There’s a spectrum from full fine-tuning to minimal adaptation:
Full Fine-tuning ←────────────────────────→ No Fine-tuning │ │ │ LoRA QLoRA Prompt Few-shot │ │ │ │ Tuning Learning │ │ ▼ ▼ ▼ ▼ │ [All params] [4-bit] [Soft prompts] [In-context]Full Fine-tuning
Section titled “Full Fine-tuning”Update ALL model parameters. For a 7B model:
- 7 billion parameters × 4 bytes = 28 GB just for the model
- Plus optimizer states (Adam needs 2x): 84 GB total
- Plus activations, gradients: ~100+ GB
You need multiple A100s ($10-30/hour on cloud).
Parameter-Efficient Fine-tuning (PEFT)
Section titled “Parameter-Efficient Fine-tuning (PEFT)”The insight: We don’t need to update all parameters! We can:
- Freeze most of the model
- Add small trainable components
- Train only these new components
This reduces memory from 100GB to 10-15GB — fitting on consumer GPUs!
Think of it like this: Instead of rebuilding the entire house to add a room, you’re just adding an extension. The foundation (original weights) stays intact.
LoRA: The Game-Changer
Section titled “LoRA: The Game-Changer”LoRA (Low-Rank Adaptation) is the most important fine-tuning technique to understand. Published by Microsoft in 2021, it revolutionized how we customize LLMs.
The Core Idea
Section titled “The Core Idea”Neural network weight matrices are typically huge. For a 7B model, a single layer’s attention weights might be 4096×4096 = 16.7 million parameters.
LoRA’s insight: The update to these weights during fine-tuning is low-rank. We don’t need a full 4096×4096 update matrix — a much smaller decomposition works just as well.
Think of it like compression. Just as a high-resolution image can often be compressed to 10% of its size without visible quality loss, the fine-tuning update can be compressed dramatically.
The Math
Section titled “The Math”Instead of updating weight W directly:
W_new = W + ΔWLoRA decomposes the update into two smaller matrices:
W_new = W + B × A
Where:- W is frozen (original weights): d × d- A is trainable: r × d (r << d)- B is trainable: d × r- ΔW = B × A: d × d (reconstructed)Worked Example:
Original layer: 4096 × 4096 = 16,777,216 parameters (17M)
With LoRA (rank r = 16):
- A: 16 × 4096 = 65,536 parameters
- B: 4096 × 16 = 65,536 parameters
- Total: 131,072 parameters (131K)
Compression ratio: 17M / 131K = 128× fewer trainable parameters!
Visual Representation
Section titled “Visual Representation”Original: With LoRA:
┌─────────┐ ┌─────────┐│ │ │ W │ (frozen)│ W │ │ ││ d × d │ → └────┬────┘│ │ │└─────────┘ ┌────┴────┐ │ + │ └────┬────┘ │ ┌────┴────┐ │ B × A │ │(d×r×r×d)│ (trainable) └─────────┘Why Low-Rank Works
Section titled “Why Low-Rank Works”During fine-tuning, the model doesn’t need to learn entirely new representations — it just needs to adapt existing ones to the new task. This adaptation lies in a low-dimensional subspace.
Think of a professional musician learning a new piece. They don’t relearn how to play their instrument—those skills are already deeply embedded. They just need to learn the specific fingerings and expressions for this new piece. LoRA works the same way: the model already “knows how to think” from pre-training, fine-tuning just teaches it the specific patterns for your task.
The original paper showed that with r=8 (just 8 dimensions!), LoRA could match full fine-tuning performance on many tasks. This is remarkable: 8 dimensions capturing the essence of a task-specific adaptation!
Did You Know? Edward Hu, the lead author of the LoRA paper, was a PhD student at Microsoft Research when he developed the technique. The paper was initially rejected from NeurIPS 2021 but went on to become one of the most cited and influential papers in LLM adaptation. It’s now the standard for fine-tuning open-source models, used by millions.
QLoRA: Fine-tuning for Everyone
Section titled “QLoRA: Fine-tuning for Everyone”QLoRA combines LoRA with quantization to make fine-tuning accessible on consumer GPUs.
What is Quantization?
Section titled “What is Quantization?”Models are typically stored in FP32 (32-bit floating point) or FP16 (16-bit). Quantization reduces this to INT8 (8-bit) or even INT4 (4-bit):
FP32: 3.14159265... (full precision)FP16: 3.1416 (half precision)INT8: 3 (8-bit integer + scale)INT4: ~3 (4-bit, extreme compression)Memory savings:
- FP32: 4 bytes per parameter
- FP16: 2 bytes per parameter
- INT8: 1 byte per parameter
- INT4: 0.5 bytes per parameter
For a 7B model:
- FP32: 28 GB
- FP16: 14 GB
- INT4: 3.5 GB ← Fits on consumer GPU!
QLoRA’s Innovation
Section titled “QLoRA’s Innovation”QLoRA (Tim Dettmers et al., 2023) introduced:
- 4-bit NormalFloat (NF4): A new 4-bit data type optimized for normally distributed weights
- Double Quantization: Quantize the quantization constants too
- Paged Optimizers: Use CPU RAM when GPU memory fills up
The result: Fine-tune a 65B model on a single 48GB GPU!
The Catch
Section titled “The Catch”Quantization introduces some precision loss. But QLoRA showed that 4-bit base model + LoRA adapters can match 16-bit full fine-tuning on most tasks. The LoRA adapters (trained in full precision) compensate for quantization error.
# QLoRA configurationfrom peft import LoraConfigfrom transformers import BitsAndBytesConfig
# 4-bit quantization configbnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # Double quantization)
# LoRA configlora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scaling factor lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], bias="none", task_type="CAUSAL_LM",)Notice how the configuration specifies which modules to apply LoRA to. For transformers, the attention projections (Q, K, V, O) are the most impactful targets.
Did You Know? Tim Dettmers, the creator of QLoRA, is known for his work on making deep learning more accessible. He maintains bitsandbytes, the library that powers QLoRA’s quantization. His blog posts on GPU memory optimization are required reading for anyone training large models on limited hardware.
Practical Fine-tuning: Step by Step
Section titled “Practical Fine-tuning: Step by Step”Let’s walk through fine-tuning a model from start to finish.
Step 1: Choose Your Base Model
Section titled “Step 1: Choose Your Base Model”For most use cases, start with these open-source models:
| Model | Size | Good For |
|---|---|---|
| Llama 4.1 8B | 8B | General tasks, instruction following |
| Mistral 7B | 7B | Fast inference, general tasks |
| Phi-3 | 3.8B | Limited resources, mobile |
| Qwen 2 | 7B | Multilingual, coding |
| Gemma 2 | 9B | Google ecosystem |
Rule of thumb: Start with the smallest model that might work. Fine-tuning amplifies a model’s existing capabilities — it can’t add capabilities that aren’t there.
Step 2: Prepare Your Dataset
Section titled “Step 2: Prepare Your Dataset”The quality of your fine-tuning depends entirely on your dataset quality. Garbage in, garbage out.
Dataset Format:
Most fine-tuning uses instruction format:
{ "instruction": "Summarize the following article", "input": "The article text here...", "output": "The summary here..."}Or conversation format:
{ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is..."} ]}Dataset Size Guidelines:
| Task Type | Minimum Samples | Recommended |
|---|---|---|
| Style transfer | 100-500 | 1,000+ |
| Domain adaptation | 1,000 | 5,000+ |
| New task learning | 5,000 | 10,000+ |
| Behavior modification | 500-2,000 | 5,000+ |
Quality > Quantity: 1,000 high-quality examples beat 100,000 noisy ones.
Step 3: Configure Training
Section titled “Step 3: Configure Training”from transformers import TrainingArguments
training_args = TrainingArguments( output_dir="./results",
# Core training settings num_train_epochs=3, # 3-5 epochs typically per_device_train_batch_size=4, # Depends on GPU memory gradient_accumulation_steps=4, # Effective batch = 4 * 4 = 16
# Learning rate learning_rate=2e-4, # LoRA can use higher LR lr_scheduler_type="cosine", # Gradual decay warmup_ratio=0.03, # Warm up for 3% of steps
# Optimization optim="paged_adamw_8bit", # Memory-efficient optimizer max_grad_norm=0.3, # Gradient clipping
# Logging logging_steps=10, save_strategy="epoch", evaluation_strategy="epoch",
# Memory optimization fp16=True, # Mixed precision gradient_checkpointing=True, # Trade compute for memory)Notice how gradient_checkpointing is enabled — this recomputes activations during backward pass instead of storing them, trading ~30% more compute for ~50% less memory.
Step 4: Fine-tune
Section titled “Step 4: Fine-tune”from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import get_peft_model, prepare_model_for_kbit_trainingfrom trl import SFTTrainer
# Load tokenizertokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")tokenizer.pad_token = tokenizer.eos_token
# Load model with quantizationmodel = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=bnb_config, device_map="auto",)
# Prepare for k-bit trainingmodel = prepare_model_for_kbit_training(model)
# Apply LoRAmodel = get_peft_model(model, lora_config)
# Print trainable parametersdef print_trainable_parameters(model): trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} ({100 * trainable / total:.2f}%)")
print_trainable_parameters(model)# Trainable: 6,553,600 (0.08%) <- Only 0.08% of parameters!
# Traintrainer = SFTTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, args=training_args, max_seq_length=512,)
trainer.train()Notice how only 0.08% of parameters are trainable! This is the power of LoRA.
Step 5: Save and Merge
Section titled “Step 5: Save and Merge”After training, you have two options:
Option A: Keep adapters separate (recommended for testing)
# Save just the LoRA adapters (small, ~50MB)model.save_pretrained("./lora-adapters")Option B: Merge into base model (for deployment)
# Merge LoRA weights into base modelmerged_model = model.merge_and_unload()merged_model.save_pretrained("./merged-model")Production War Stories: When Fine-tuning Goes Wrong
Section titled “Production War Stories: When Fine-tuning Goes Wrong”Learning from failures is often more valuable than studying successes. Here are real stories from the trenches of fine-tuning.
The $2 Million Medical Hallucination
Section titled “The $2 Million Medical Hallucination”Boston. August 2023. A healthcare startup fine-tuned Llama 4 on medical records to create a patient communication assistant. After three weeks and $8,000 in compute, the model looked great—it passed internal testing with 98% accuracy on sample queries.
Six weeks after deployment, a patient received a message suggesting they “increase their insulin dose significantly” based on their recent blood work. The patient did. They ended up in the ER with severe hypoglycemia.
What went wrong? The training data included notes from doctors who used imprecise language like “consider increasing dose” without specifying amounts. The model learned to give advice but not to be precise about medical dosing. Worse, the evaluation focused on fluency and format, not medical accuracy.
The fix took 4 months:
- Added a medical review pipeline with licensed physicians
- Re-fine-tuned with explicit dosing examples and refusal patterns
- Added a hard filter that blocked any message mentioning medication adjustments
- Implemented mandatory human review for all medical content
Total cost: $2.1 million (legal fees, settlements, re-development, and the PR nightmare). The lesson? Fine-tuning amplifies what’s in your data—including subtle errors and dangerous patterns.
The Bias That Nobody Caught
Section titled “The Bias That Nobody Caught”London. October 2023. A fintech company fine-tuned a model on historical loan approval decisions to automate “preliminary screening.” The model achieved 94% agreement with human underwriters—better than their target.
Three months in, a data scientist noticed something odd: applicants with names common in certain ethnic communities were being flagged for “additional review” at 3× the rate of others. The model had learned the biases embedded in decades of human decisions.
What went wrong? The training data reflected historical discrimination patterns. The model didn’t learn to assess creditworthiness—it learned to predict what human underwriters (with their biases) would decide.
The aftermath:
- Regulatory investigation
- $500,000 in fines
- Model rolled back entirely
- Six-month remediation program
Did You Know? This isn’t unique to fine-tuning. Amazon scrapped an AI recruiting tool in 2018 after discovering it had learned to penalize resumes containing the word “women’s” (as in “women’s chess club captain”). The model was trained on 10 years of hiring data—which reflected the tech industry’s gender imbalance. Fine-tuning on historical data inherits historical mistakes.
The Success Story: Bloomberg’s GPT
Section titled “The Success Story: Bloomberg’s GPT”Not all stories are cautionary. Bloomberg’s 50-billion parameter BloombergGPT, trained on financial data, shows fine-tuning done right.
The approach:
- Curated training data: 363 billion tokens of financial documents, filings, news
- Mixed training: 55% financial, 45% general (prevented forgetting)
- Domain evaluation: Created financial NLP benchmarks, not just general ones
- Conservative deployment: Started with internal research tools, not customer-facing products
Results:
- Outperformed gpt-5 on financial reasoning tasks
- Maintained general language abilities
- Now powers internal analyst tools
The key difference? Bloomberg treated fine-tuning as a careful engineering project, not a “train and deploy” experiment.
Dataset Preparation: The Most Important Step
Section titled “Dataset Preparation: The Most Important Step”Your fine-tuning is only as good as your data. Here’s how to prepare high-quality datasets.
Principles of Good Training Data
Section titled “Principles of Good Training Data”- Diverse but focused: Cover the range of tasks, but stay on-topic
- High quality: Every example should be something you’d want the model to output
- Consistent format: Use the same structure throughout
- Balanced: Don’t over-represent any single pattern
- No leakage: Ensure train/eval split is truly separate
Data Cleaning Pipeline
Section titled “Data Cleaning Pipeline”import jsonfrom typing import List, Dict
def clean_dataset(examples: List[Dict]) -> List[Dict]: """Clean and validate training examples.""" cleaned = []
for ex in examples: # Skip empty examples if not ex.get("instruction") or not ex.get("output"): continue
# Skip very short outputs (likely low quality) if len(ex["output"]) < 50: continue
# Skip duplicates (check instruction similarity) if is_duplicate(ex, cleaned): continue
# Normalize whitespace ex["instruction"] = " ".join(ex["instruction"].split()) ex["output"] = " ".join(ex["output"].split())
cleaned.append(ex)
return cleaned
def format_for_training(example: Dict) -> str: """Format example as chat template.""" return f"""<|im_start|>systemYou are a helpful assistant.<|im_end|><|im_start|>user{example['instruction']}<|im_end|><|im_start|>assistant{example['output']}<|im_end|>"""Notice how formatting uses special tokens (<|im_start|>, <|im_end|>). These must match your base model’s chat template — check the tokenizer documentation.
Common Dataset Sources
Section titled “Common Dataset Sources”- Curate from production logs: Real user queries are gold
- Generate with stronger models: Use gpt-5 to create examples
- Public datasets: Hugging Face Hub has thousands
- Manual creation: Expensive but highest quality
Did You Know? The Alpaca dataset, which kickstarted open-source instruction tuning, was generated by prompting GPT-3 to create 52,000 instruction-following examples. It cost only $500 to generate and enabled fine-tuning Llama 7B to follow instructions. This technique — using a stronger model to generate training data for a weaker one — is called “distillation” and is now standard practice.
Evaluation: How Do You Know It Worked?
Section titled “Evaluation: How Do You Know It Worked?”Fine-tuning without evaluation is flying blind. Here’s how to know if your fine-tuning worked.
Quantitative Metrics
Section titled “Quantitative Metrics”Perplexity (PPL): How surprised is the model by the test data?
import math
def compute_perplexity(model, eval_dataset, tokenizer): model.eval() total_loss = 0 total_tokens = 0
for batch in eval_dataset: with torch.no_grad(): outputs = model(**batch) total_loss += outputs.loss.item() * batch["input_ids"].numel() total_tokens += batch["input_ids"].numel()
perplexity = math.exp(total_loss / total_tokens) return perplexityTask-specific metrics:
- Classification: Accuracy, F1
- Generation: BLEU, ROUGE
- Code: Pass@k
Qualitative Evaluation
Section titled “Qualitative Evaluation”Run the model on representative prompts and check:
- Does it follow instructions?
- Is the style correct?
- Does it hallucinate less/more?
- Is it safe?
test_prompts = [ "Explain quantum computing to a 5-year-old", "Write a formal email declining a meeting", "Debug this Python code: [code here]", # Add domain-specific prompts]
for prompt in test_prompts: response = generate(model, prompt) print(f"Prompt: {prompt}") print(f"Response: {response}") print("-" * 50)The Comparison Test
Section titled “The Comparison Test”Always compare:
- Base model (no fine-tuning)
- Fine-tuned model
- Base model with few-shot examples
If #3 beats #2, your fine-tuning didn’t help — just use few-shot prompting!
Common Pitfalls and Solutions
Section titled “Common Pitfalls and Solutions”1. Catastrophic Forgetting
Section titled “1. Catastrophic Forgetting”Problem: Model forgets general knowledge after fine-tuning.
Solution:
- Use LoRA instead of full fine-tuning
- Mix in general data (10-20%) with your task data
- Use lower learning rate
2. Overfitting
Section titled “2. Overfitting”Problem: Model memorizes training data, doesn’t generalize.
Signs:
- Training loss drops but eval loss increases
- Model outputs training examples verbatim
Solution:
- More diverse training data
- Higher LoRA dropout (0.1-0.2)
- Fewer epochs
- Regularization (weight decay)
3. Training Instability
Section titled “3. Training Instability”Problem: Loss spikes or goes NaN.
Solution:
- Lower learning rate
- Gradient clipping (max_grad_norm=0.3)
- Warmup period
- Check for data issues (NaN, very long sequences)
4. Wrong Chat Template
Section titled “4. Wrong Chat Template”Problem: Model outputs garbage or doesn’t follow instructions.
Solution:
- Use the correct chat template for your base model
- Check tokenizer’s
chat_templateattribute - Ensure consistent formatting between train and inference
# Check the model's expected formatprint(tokenizer.chat_template)
# Apply it correctlyformatted = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True)5. Insufficient Data
Section titled “5. Insufficient Data”Problem: Model doesn’t learn the task.
Solution:
- More epochs (but watch for overfitting)
- Data augmentation (paraphrase, back-translate)
- Start with a model closer to your domain
- Consider RAG instead
Did You Know? The phenomenon of catastrophic forgetting has plagued neural networks since the 1980s. It’s why continual learning (learning new tasks without forgetting old ones) remains an open research problem. LoRA’s genius is that by keeping most weights frozen, it naturally prevents catastrophic forgetting — the base model’s knowledge stays intact.
Cost Analysis
Section titled “Cost Analysis”Let’s do the math on fine-tuning costs.
Cloud GPU Costs
Section titled “Cloud GPU Costs”| GPU | VRAM | Cost/hour | Can Train |
|---|---|---|---|
| T4 | 16GB | $0.50 | 7B with QLoRA |
| A10G | 24GB | $1.00 | 7B with QLoRA |
| A100 40GB | 40GB | $4.00 | 13B with QLoRA |
| A100 80GB | 80GB | $8.00 | 70B with QLoRA |
Time Estimates
Section titled “Time Estimates”For fine-tuning Llama 4.1 8B on 10,000 examples:
| Setup | Time | Cost |
|---|---|---|
| 1x A10G | ~4 hours | $4 |
| 1x A100 | ~1.5 hours | $6 |
| 4x A100 | ~25 min | $13 |
Key insight: More GPUs = faster but not cheaper. Optimize for your constraints.
Comparison: Fine-tuning vs RAG vs API
Section titled “Comparison: Fine-tuning vs RAG vs API”For 10,000 queries/month:
| Approach | Setup Cost | Per-Query Cost | Monthly Cost |
|---|---|---|---|
| Fine-tuned local | $5-50 | ~$0 | ~$20 (hosting) |
| RAG with API | $0 | $0.01-0.05 | $100-500 |
| API few-shot | $0 | $0.02-0.10 | $200-1000 |
Fine-tuning wins when you have high volume and stable requirements.
Deployment Options
Section titled “Deployment Options”Option 1: Hugging Face Inference Endpoints
Section titled “Option 1: Hugging Face Inference Endpoints”Easiest deployment — just upload your model:
# Push to Hubmodel.push_to_hub("your-username/my-finetuned-model")
# Deploy as endpoint (click in HF UI or use API)Cost: $0.60-4.00/hour depending on GPU
Option 2: Self-hosted with vLLM
Section titled “Option 2: Self-hosted with vLLM”For cost optimization at scale:
# Install vLLMpip install vllm
# Run serverpython -m vllm.entrypoints.openai.api_server \ --model your-model-path \ --tensor-parallel-size 1vLLM optimizations:
- PagedAttention: 24x throughput improvement
- Continuous batching: Efficient request handling
- OpenAI-compatible API: Drop-in replacement
Option 3: Ollama for Local Deployment
Section titled “Option 3: Ollama for Local Deployment”For personal/team use:
# Create Modelfilecat > Modelfile << 'EOF'FROM ./merged-modelPARAMETER temperature 0.7SYSTEM "You are a helpful assistant fine-tuned for..."EOF
# Create and runollama create my-model -f Modelfileollama run my-modelHands-On Exercises: Learn by Doing
Section titled “Hands-On Exercises: Learn by Doing”Theory is essential, but fine-tuning is a craft you learn by doing. Here are three progressively challenging exercises.
Exercise 1: Your First LoRA Fine-tune (Beginner)
Section titled “Exercise 1: Your First LoRA Fine-tune (Beginner)”Goal: Fine-tune a small model on a simple task to understand the end-to-end process.
Setup: You’ll need a Google Colab account (free tier works) or a machine with at least 8GB VRAM.
Try It Yourself:
# Step 1: Install dependencies!pip install transformers peft datasets accelerate bitsandbytes trl
# Step 2: Load a small model (TinyLlama 1.1B)from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import LoraConfig, get_peft_model
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto")
# Step 3: Apply LoRAlora_config = LoraConfig( r=8, # Start small lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.1, task_type="CAUSAL_LM")model = get_peft_model(model, lora_config)
# Step 4: Create a tiny dataset (just 10 examples to start)train_data = [ {"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}, {"instruction": "Translate to French", "input": "Goodbye", "output": "Au revoir"}, # Add 8 more examples...]
# Step 5: Train for just 100 steps (proof of concept)# ... (full training code in deliverable)What to observe:
- How many trainable parameters vs total parameters?
- How does training loss decrease?
- Can the model now translate words it saw in training? What about new words?
Success Criteria: Model loss decreases during training. Model can reproduce trained translations.
Exercise 2: Compare LoRA Configurations (Intermediate)
Section titled “Exercise 2: Compare LoRA Configurations (Intermediate)”Goal: Understand how LoRA hyperparameters affect results.
Your Turn:
Create three different LoRA configurations and compare them:
| Config | Rank (r) | Alpha | Target Modules | Expected Effect |
|---|---|---|---|---|
| A | 4 | 8 | q_proj, v_proj | Fast, limited capacity |
| B | 16 | 32 | q_proj, k_proj, v_proj, o_proj | Balanced |
| C | 64 | 128 | All linear layers | Slow, high capacity |
Experiment:
- Fine-tune each configuration on the same dataset
- Record training time, final loss, and GPU memory usage
- Evaluate on held-out test examples
- Compare: Which gives best quality per training hour?
Questions to answer:
- Does higher rank always mean better quality?
- At what point do diminishing returns set in?
- Which target modules matter most?
Exercise 3: Production-Ready Fine-tuning (Advanced)
Section titled “Exercise 3: Production-Ready Fine-tuning (Advanced)”Goal: Execute a complete fine-tuning pipeline suitable for production deployment.
Hands-On Exercise:
Create a complete pipeline that includes:
-
Data preparation script that:
- Loads raw data from JSON or CSV
- Cleans and validates examples
- Formats for chat template
- Creates train/validation/test splits (80/10/10)
- Saves processed dataset to disk
-
Training script with:
- Configurable hyperparameters (via YAML or argparse)
- Automatic checkpoint saving
- WandB or MLflow logging
- Early stopping based on validation loss
- Gradient accumulation for larger effective batch sizes
-
Evaluation script that:
- Compares base model vs fine-tuned model
- Calculates perplexity on test set
- Runs qualitative evaluation on sample prompts
- Generates comparison report (Markdown)
-
Deployment script that:
- Merges LoRA adapters into base model
- Quantizes final model to INT4
- Exports for vLLM or Ollama
- Validates inference works correctly
Your deliverable: A complete, documented fine-tuning toolkit.
Time Estimate: 4-6 hours
The Psychology of Learning Rate Selection
Section titled “The Psychology of Learning Rate Selection”One of the most common questions in fine-tuning is “what learning rate should I use?” The answer reveals something deep about how neural networks learn.
Think of learning rate like the volume knob on how much the model “listens” to each training example. Too high, and the model over-reacts to every example, becoming unstable. Too low, and the model barely changes, wasting compute on imperceptible updates.
For full fine-tuning, learning rates are typically tiny: 1e-5 to 1e-6. Why? Because the model already works well—you’re making surgical adjustments, not rebuilding it.
For LoRA, something magical happens: you can use much higher learning rates, typically 1e-4 to 2e-4. The LoRA adapters are initialized to zero (or near-zero), so early in training they have essentially no effect. This “blank slate” can absorb aggressive updates without destabilizing the frozen base model.
Did You Know? The LoRA paper recommends α/r as a scaling factor for learning rate adjustment. If you use r=16 and α=32, the effective learning rate is 2× the nominal learning rate. This explains why you might see different optimal learning rates reported for different rank configurations.
Quiz: Test Your Understanding
Section titled “Quiz: Test Your Understanding”Q1: When should you use fine-tuning instead of RAG?
Answer
Use fine-tuning when you need to change the model’s behavior or style, not its knowledge:
- Consistent output format
- Domain-specific language/jargon
- New task types
- Speed optimization (no retrieval latency)
- Cost optimization at high volume
Use RAG when you need to add knowledge that changes frequently or is very large.
Q2: Why does LoRA work with such low rank (r=8 or r=16)?
Answer
The weight updates during fine-tuning lie in a low-dimensional subspace. The model doesn’t need to learn entirely new representations — it just needs to adapt existing ones. This adaptation is intrinsically low-rank because:
- The base model already has rich representations
- Fine-tuning tasks share structure with pretraining
- The manifold of “useful adaptations” is low-dimensional
Empirically, r=8-16 captures 99%+ of the fine-tuning benefit for most tasks.
Q3: A 7B model has parameters stored in FP16. You apply QLoRA with 4-bit quantization. How much memory is saved?
Answer
Original (FP16): 7B × 2 bytes = 14 GB
QLoRA (4-bit): 7B × 0.5 bytes = 3.5 GB (for base model) Plus LoRA adapters in FP16: ~50-100 MB
Total: ~3.6 GB vs 14 GB
Savings: 14 - 3.6 = 10.4 GB (~75% reduction)
This is what makes QLoRA trainable on consumer GPUs!
Q4: Your fine-tuned model achieves low training loss but outputs training examples verbatim during inference. What’s happening and how do you fix it?
Answer
This is overfitting — the model memorized training data instead of learning the underlying patterns.
Fixes:
- More diverse training data: Add variations, paraphrases
- Fewer epochs: Stop earlier (use validation loss)
- Higher dropout: Increase
lora_dropoutto 0.15-0.2 - Weight decay: Add
weight_decay=0.01to training args - Early stopping: Stop when eval loss starts increasing
- Regularization: Consider adding KL divergence from base model
Q5: You’re fine-tuning for a customer service chatbot. The model keeps forgetting general knowledge (like basic math). What went wrong?
Answer
This is catastrophic forgetting — the model lost general capabilities while learning the new task.
Solutions:
- Use LoRA instead of full fine-tuning: Keeps base weights frozen
- Mix in general data: Add 10-20% general instruction data to your training set
- Lower learning rate: Reduces how much weights change
- Fewer epochs: Less time to forget
- Larger model: Bigger models are more resistant to forgetting
LoRA naturally prevents most forgetting since only the small adapter weights are modified.
Interview Prep: What You’ll Be Asked
Section titled “Interview Prep: What You’ll Be Asked”Fine-tuning questions come up frequently in ML engineering interviews. Here’s what to expect.
Common Interview Questions
Section titled “Common Interview Questions”Q: “Explain LoRA to a product manager.”
Strong Answer: “LoRA is like adding a sticky note to a textbook instead of rewriting the whole book. The textbook (the original model) stays intact—we just add small, focused notes (adapters) that modify how the model responds to certain topics. This makes customization 100× cheaper and lets us maintain multiple specialized versions easily.”
Q: “When would you NOT use fine-tuning?”
Strong Answer: “I’d avoid fine-tuning in three scenarios:
- When the knowledge changes frequently (use RAG instead)
- When I have less than 100 high-quality examples (use few-shot prompting)
- When the base model already does the task well (optimize prompts first)
Fine-tuning makes sense when you need consistent behavior changes, domain-specific language, or cost optimization at high volume.”
Q: “Your fine-tuned model is worse than the base model on general tasks. Why?”
Strong Answer: “This is likely catastrophic forgetting. The model over-specialized on the new task and lost general capabilities. Solutions include: using LoRA instead of full fine-tuning, mixing 10-20% general data into training, using a lower learning rate, or training for fewer epochs. LoRA naturally prevents most forgetting since it keeps the base weights frozen.”
Q: “How would you evaluate a fine-tuned model?”
Strong Answer: “I’d use a three-part evaluation:
- Automated metrics: Perplexity on held-out data, task-specific metrics (F1, BLEU, etc.)
- A/B comparison: Side-by-side evaluation of base vs fine-tuned on representative prompts
- Safety checks: Test for new failure modes, biases, and harmful outputs that might have emerged
I’d also compare against few-shot prompting—if that performs similarly, fine-tuning wasn’t worth the effort.”
Red Flags in Interviews
Section titled “Red Flags in Interviews”Avoid these common mistakes:
- Saying “I always use rank 16” (should be task-dependent)
- Ignoring data quality in favor of data quantity
- Not mentioning evaluation before deployment
- Forgetting to discuss catastrophic forgetting risks
- Claiming fine-tuning “adds knowledge” (it changes behavior, RAG adds knowledge)
Summary
Section titled “Summary”You’ve learned:
- When to fine-tune vs RAG vs prompting — behavior change needs fine-tuning
- LoRA decomposes weight updates into low-rank matrices (128× parameter reduction)
- QLoRA adds 4-bit quantization (75% memory reduction)
- Data quality is everything — clean, diverse, properly formatted
- Evaluation must compare fine-tuned vs base vs few-shot
- Common pitfalls: forgetting, overfitting, wrong templates
- Deployment options: HF endpoints, vLLM, Ollama
The key insight: Fine-tuning is now accessible to everyone. With QLoRA, you can fine-tune a 7B model on a single gaming GPU in a few hours for a few dollars. What once required million-dollar budgets and data center hardware is now within reach of individual developers and small teams. This democratization is transforming how we build AI applications—custom, specialized models are no longer the privilege of large tech companies.
Further Reading
Section titled “Further Reading”Essential Resources
Section titled “Essential Resources”-
LoRA Paper: “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., 2021)
-
QLoRA Paper: “QLoRA: Efficient Finetuning of Quantized LLMs” (Dettmers et al., 2023)
-
Hugging Face PEFT: Official documentation
-
TRL Library: For SFT, RLHF, DPO
Advanced Topics
Section titled “Advanced Topics”- DoRA: Weight-Decomposed Low-Rank Adaptation (2024) — separates magnitude and direction for better fine-tuning quality
- LongLoRA: Efficient fine-tuning for long contexts — enables training on 100K+ token sequences without quadratic attention cost
- NEFTune: Noisy embedding fine-tuning — adds noise to embeddings during training for surprisingly better generalization
- ORPO: Odds Ratio Preference Optimization — combines SFT and preference learning into a single training phase, simpler than DPO
Did You Know? The field of parameter-efficient fine-tuning is moving so fast that by the time you read this, new methods will have emerged. In 2024 alone, we saw DoRA, PiSSA, LoRA+, and dozens of variations. The fundamental insight remains constant: neural network adaptations lie in low-dimensional subspaces. But the optimal way to exploit this insight keeps evolving.
Next Steps
Section titled “Next Steps”Move on to Module 33: Diffusion Models & Image Generation where you’ll learn:
- How Stable Diffusion works
- DDPM and DDIM schedulers
- LoRA for image models
- Text-to-image from scratch
Or explore the deliverable to:
- Fine-tune Llama 4.1 on a custom dataset
- Compare LoRA ranks and configurations
- Evaluate fine-tuned models
- Calculate cost/benefit
Last updated: 2025-11-27 Status: Complete