Single-GPU Local Fine-Tuning
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[MEDIUM]| Time: 2-3 hours
Reading Time: 2-3 hours
Prerequisites: Fine-tuning LLMs, LoRA and Parameter-Efficient Fine-Tuning, Home AI Workstation Fundamentals, and Reproducible Python, CUDA, and ROCm Environments
Learning Outcomes
Section titled “Learning Outcomes”By the end of this module, you will be able to:
- Evaluate whether a local single-GPU fine-tuning project is the right solution compared with prompting, retrieval, or data cleanup.
- Design a constrained LoRA or QLoRA experiment that fits real VRAM, storage, dataset, and evaluation limits instead of relying on guesswork.
- Debug common failure modes in local fine-tuning runs, including out-of-memory errors, unstable loss, inconsistent outputs, and untraceable checkpoints.
- Compare baseline and tuned model behavior using a small held-out evaluation set and a decision rule for keeping or discarding the adapter.
- Package the final training artifacts so another engineer can reproduce the model choice, data split, configuration, and measured outcome.
These outcomes deliberately focus on judgment and engineering practice, not on memorizing fine-tuning terminology. A learner who can run a command but cannot explain why the run should exist is not ready to own a fine-tuning workflow. A learner who can choose the right adaptation method, reserve an evaluation set, interpret failure, and preserve evidence is much closer to practitioner readiness.
Why This Module Matters
Section titled “Why This Module Matters”A machine learning engineer inherits a local workstation with one consumer GPU and a request that sounds simple: “Make the model follow our support format better.” The team has already tried prompt templates, but responses still drift under pressure. Someone proposes fine-tuning because it sounds more permanent than prompting, while someone else warns that local training is a distraction from building a better retrieval system.
The engineer cannot solve this by enthusiasm. A single GPU can support useful adaptation, but it also exposes every weak assumption in the workflow. If the base model is too large, the run fails before learning anything. If the dataset mixes unrelated tasks, the adapter learns noise. If evaluation is just a few remembered prompts, the team may ship a model that only feels better during a demo.
This module teaches single-GPU fine-tuning as a disciplined engineering loop. The point is not to turn a home workstation into a research cluster. The point is to build a narrow, evidence-driven experiment where every choice has a reason: model size, PEFT method, data format, validation split, checkpoint policy, and runtime evaluation.
The senior-level skill is knowing when to stop. Many failed tuning projects do not fail because LoRA is weak or the GPU is small. They fail because the team tried to use fine-tuning for changing facts, used data that did not demonstrate the desired behavior, or kept training after evaluation had already shown no useful gain.
1. Frame the Decision: Fine-Tune, Retrieve, or Prompt?
Section titled “1. Frame the Decision: Fine-Tune, Retrieve, or Prompt?”Single-GPU fine-tuning is best understood as behavior adaptation under resource limits. The model already has broad language ability, and your job is to adjust how it behaves on a narrow pattern. You are not teaching it a large new body of facts, and you are not replacing a production knowledge system with frozen weights.
The first decision is whether the problem belongs in model weights at all. Fine-tuning can improve response format, task consistency, tone, classification discipline, or the ability to imitate a repeated workflow. Retrieval is better when the model needs fresh or source-grounded information. Prompting is better when the behavior can be controlled reliably through instructions without changing model parameters.
A useful test is to ask what should happen when the source material changes. If the answer is “the model should immediately use the new document,” fine-tuning is probably the wrong first tool. If the answer is “the model should keep using the same response structure, judgment rubric, or style across many examples,” a local adapter may be worth testing.
+---------------------------+-------------------------------+-------------------------------+| Problem Signal | Better First Tool | Why It Fits |+---------------------------+-------------------------------+-------------------------------+| Fresh facts change often | Retrieval | Knowledge stays external || Prompt works most times | Prompt engineering | Lowest cost and fastest loop || Format drifts repeatedly | LoRA or QLoRA fine-tuning | Behavior pattern can be shown || Task is broad and vague | Task redesign | Training signal is unclear || Base model lacks ability | Smaller scope or new model | Adapter cannot add everything |+---------------------------+-------------------------------+-------------------------------+For example, a support team may want strict JSON priority labels. If the input is a ticket description and the output always follows one schema, fine-tuning can reinforce that behavior. If the team instead wants the model to answer every support policy question from a changing handbook, retrieval should carry the knowledge and the model should synthesize from retrieved context.
Stop and think: Your team asks for fine-tuning because “the model does not know our procedures.” Are the procedures stable behavior patterns, such as how to classify tickets, or changing facts, such as the current escalation roster? Write down which part belongs in weights and which part belongs outside the model.
The distinction matters because a fine-tuned model can become confidently stale. When you train on last quarter’s policy text, you create a model that may speak as if old policy is current. A retrieval system can swap documents without retraining, while a fine-tuned adapter requires another training and evaluation cycle before the change is trustworthy.
Prompting still deserves respect in this decision. A strong prompt with examples may solve the problem without any training run, especially when the task is simple and latency is acceptable. Local fine-tuning becomes more attractive when prompt examples are too long, behavior must stay consistent across many calls, or the model repeatedly ignores a compact instruction that appears clearly in the prompt.
Single-GPU constraints sharpen this decision. You have less room for waste, so the task must be narrow enough for a small model, small adapter, and small dataset to reveal signal. A project that cannot be stated in one sentence is usually not ready for local fine-tuning. The best first project sounds like “return incident severity as strict JSON from short ticket descriptions,” not “make a general company assistant.”
A senior engineer also considers ownership and lifecycle. A fine-tuned adapter becomes an artifact that someone must store, name, test, document, and eventually retire. If the organization cannot keep track of the dataset, base model, license, evaluation prompts, and deployment runtime, then the training run creates operational debt even if the demo looks impressive.
Use this module’s workflow as a gate. If you cannot define a stable target behavior, choose a fitting base model, inspect the data manually, and compare baseline versus tuned outputs on held-out prompts, do not start training yet. The disciplined pause is part of the engineering work, not a delay from it.
Fine-Tuning Fit Matrix
Section titled “Fine-Tuning Fit Matrix”| Scenario | Fine-Tune Locally? | Better Move | Decision Rationale |
|---|---|---|---|
| Strict JSON output from short support tickets | Usually yes | LoRA on a small instruct model | The behavior is narrow, repeated, and easy to evaluate |
| Monthly employee handbook questions | Usually no | Retrieval over current documents | The knowledge changes and needs source grounding |
| Legal summaries across many document types | Not as a first pass | Scope one format or build retrieval plus review | The task is broad and risk is high |
| Tone adaptation for internal release notes | Possibly yes | Compare prompting against LoRA | The style may be learnable if examples are consistent |
| Teaching a model an entire product catalog | Usually no | Retrieval or database-backed tools | Large factual inventory should not be frozen into weights |
| Classifying short alerts into fixed labels | Usually yes | LoRA or QLoRA with held-out checks | Labels, inputs, and evaluation can be tightly controlled |
This matrix is not a rulebook; it is a way to expose the real reason behind the request. If the reason is “we need current facts,” avoid local fine-tuning as the primary mechanism. If the reason is “we need the model to behave consistently on a repeated input-output pattern,” a small local experiment can be justified.
2. Size the Experiment Before Writing Code
Section titled “2. Size the Experiment Before Writing Code”The most common beginner mistake is choosing the base model before understanding the hardware. Single-GPU fine-tuning is not a moral contest about using the largest model. It is an optimization problem where VRAM, sequence length, batch size, quantization, optimizer state, and adapter settings interact.
VRAM pressure comes from more than the base model weights. Training also needs activations, gradients for trainable parameters, optimizer state, temporary buffers, and memory fragmentation headroom. PEFT reduces the trainable parameter count, but it does not make every model magically fit into every card.
A practical single-GPU plan starts with measurement. Check the GPU name, total memory, free memory, driver stack, and disk space before picking a model. Also confirm whether other desktop or notebook processes are using VRAM, because a local browser, display server, or previous Python process can be the difference between a stable run and an out-of-memory crash.
nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version --format=csvdf -h .For AMD systems, the exact toolchain differs, and library support may vary by operating system and package versions. The engineering principle is the same: record the accelerator, memory budget, driver/runtime stack, and training framework before claiming the experiment is reproducible. The adapter is not the whole artifact; the environment is part of the result.
A conservative sizing rule is to start smaller than your ego wants. If the task can be tested on a sub-billion or low-billion parameter instruct model, start there and prove the workflow. Once the data, evaluation, and artifact policy are correct, you can consider a larger base model if the observed limitation is model capability rather than workflow quality.
+-------------------+ +----------------------+ +------------------+| Hardware Budget | ----> | Model + PEFT Choice | ----> | Training Config || GPU VRAM | | base model size | | sequence length || disk space | | LoRA or QLoRA | | batch strategy || driver/runtime | | quantization support | | checkpoint limit |+-------------------+ +----------------------+ +------------------+ | | | v v v+-------------------+ +----------------------+ +------------------+| Failure Signal | | Adjustment | | Evidence Kept || OOM, slow run | ----> | smaller model, QLoRA | ----> | config + logs || unstable outputs | | cleaner data, eval | | baseline/tuned |+-------------------+ +----------------------+ +------------------+LoRA and QLoRA are the default methods for this scale because they keep the base model mostly frozen and train small adapter matrices. LoRA stores low-rank updates in selected layers, commonly attention and projection modules. QLoRA combines LoRA-style adapter training with a quantized base model so the frozen weights require less memory during training.
The trade-off is not only memory. Quantization can change numerical behavior, kernel compatibility, training speed, and deployment assumptions. A model trained or evaluated through one quantized path may not behave identically when moved to a different runtime. That is why the evaluation runtime should match the intended use as closely as practical.
Stop and think: Suppose two configurations both fit in VRAM. One uses a larger quantized model with fragile runtime support, while the other uses a smaller full-precision or half-precision model with stable tooling. Which one would you choose for a first reproducible experiment, and what evidence would change your mind?
Sequence length is another hidden budget. A short classification prompt and output can use a small maximum sequence length, while long summarization examples may require far more memory. Beginners often copy a high sequence length from a tutorial and then blame the GPU when the real issue is that the task does not need that much context.
Batch size should be treated as an effective batch, not only a per-device value. If VRAM allows only per_device_train_batch_size: 1, gradient accumulation can approximate a larger effective batch while keeping memory lower. This slows wall-clock training, but it may be the difference between a controlled experiment and a failed launch.
Checkpointing policy also affects resource use. Saving every few steps is useful for debugging but quickly creates storage clutter. For a first local adapter, keep a small save_total_limit, record the final adapter, and preserve the config. You need enough artifacts to reproduce and compare, not a folder full of unlabeled intermediate states.
Constraint Planning Table
Section titled “Constraint Planning Table”| Constraint | What It Controls | First Adjustment | Senior-Level Check |
|---|---|---|---|
| VRAM | Base model size, quantization, batch, sequence length | Choose a smaller model or use QLoRA | Confirm the run has headroom, not just a lucky start |
| Dataset size | Signal strength, overfitting risk, inspection burden | Start small and manually review examples | Keep a held-out split with realistic edge cases |
| Sequence length | Activation memory and speed | Match task length instead of copying defaults | Measure prompt and output lengths from real examples |
| Disk space | Checkpoints, merged models, logs, cache | Limit checkpoints and name artifacts clearly | Preserve only artifacts tied to a known config |
| Runtime support | Compatibility for quantization and kernels | Use a proven stack for your GPU | Evaluate in the runtime you expect to deploy |
| Time budget | Iteration speed and debugging feedback | Run short experiments first | Stop when evidence says the design is wrong |
The table should influence the plan before any training command is written. If you cannot explain which constraint is tightest, you are not sizing the experiment; you are hoping. Hope is expensive when a run takes hours and produces artifacts nobody can interpret.
3. Prepare Data That Teaches One Behavior
Section titled “3. Prepare Data That Teaches One Behavior”Fine-tuning quality is usually data quality wearing a training costume. A local adapter learns from the examples you provide, including contradictions, inconsistent tone, malformed outputs, and hidden shortcuts. If the dataset is unclear, the model may still reduce training loss while becoming less useful in the real task.
The first dataset should be narrow enough that a human can inspect every example. This is not a permanent ceiling; it is a learning strategy. Manual inspection teaches you what the model will see, reveals inconsistent labels, and prevents the common mistake of treating a large JSONL file as automatically valuable.
A good example has three properties. The instruction is stable, the input resembles real use, and the output demonstrates exactly the behavior you want repeated. If the output schema varies from example to example, the model receives a mixed signal. If the instruction changes wording unnecessarily, the model may learn superficial phrasing instead of the task.
mkdir -p data evalcat > data/train.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Checkout fails for every customer after the latest payment deployment.","output":"{\"priority\":\"critical\",\"reason\":\"customer checkout is unavailable\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"One analyst reports that an internal dashboard widget loads slowly.","output":"{\"priority\":\"low\",\"reason\":\"limited internal impact\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Password reset emails are not being delivered to customers.","output":"{\"priority\":\"high\",\"reason\":\"account recovery is impaired\"}"}EOFThis example is intentionally small, but it shows the structure. Each row teaches the same behavior: read a short ticket and return a strict JSON object with priority and reason. It does not mix summarization, apology writing, retrieval, and classification in the same file.
The evaluation set must be separate from training. If you train and evaluate on the same examples, you measure memorization and formatting familiarity rather than generalization to similar cases. A small held-out set is enough for a first local experiment, but it must contain prompts that represent the behavior you actually care about.
cat > eval/check.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"The public status page is unreachable during an active incident.","expected":"{\"priority\":\"high\",\"reason\":\"incident communication is impaired\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"A typo appears in one paragraph of an internal help article.","expected":"{\"priority\":\"low\",\"reason\":\"minor documentation issue\"}"}EOFThe held-out examples should not be trick questions, but they should test the same judgment the production workflow needs. If the model only succeeds on obvious outage examples and fails on degraded communication, the evaluation should reveal that. Evaluation is not a celebration step after training; it is the mechanism that protects the team from believing a weak adapter.
Stop and think: If you add an example where the output is a friendly paragraph instead of strict JSON, what behavior might the model learn? Before reading further, write one failure mode you would expect to see in the tuned model’s responses.
A senior data preparation habit is to write a dataset contract. The contract states what every example must contain, what output shape is valid, which labels are allowed, and what examples are out of scope. This prevents silent drift when another teammate contributes new rows.
task: support_ticket_priority_jsoninstruction: "Classify the support ticket priority and return strict JSON."allowed_priorities: - critical - high - medium - lowrequired_output_fields: - priority - reasonout_of_scope: - policy questions requiring current documentation - long incident retrospectives - multi-turn conversations - examples without a clear business impact signalData cleaning should be concrete. Validate JSONL parsing, check duplicate inputs, verify label values, and sample outputs for schema consistency. You do not need a heavyweight data platform for the first run, but you do need enough checks to catch mistakes before they enter the training loop.
.venv/bin/python - <<'PY'import jsonfrom pathlib import Path
allowed = {"critical", "high", "medium", "low"}seen_inputs = set()
for path in [Path("data/train.jsonl"), Path("eval/check.jsonl")]: for line_number, line in enumerate(path.read_text().splitlines(), start=1): row = json.loads(line) for field in ["instruction", "input"]: if not row.get(field): raise ValueError(f"{path}:{line_number} missing {field}") output_field = "output" if path.name == "train.jsonl" else "expected" parsed_output = json.loads(row[output_field]) if parsed_output["priority"] not in allowed: raise ValueError(f"{path}:{line_number} invalid priority") if row["input"] in seen_inputs: raise ValueError(f"{path}:{line_number} duplicate input") seen_inputs.add(row["input"])
print("dataset contract checks passed")PYThe validation script is small, but the habit is important. It turns vague trust into a repeatable gate. When the model performs poorly, you can investigate data quality with evidence instead of wondering whether the JSONL file was malformed all along.
For local fine-tuning, small datasets can work when the target behavior is narrow and the base model is already capable. The adapter is nudging behavior, not teaching language from scratch. However, small does not mean careless. Ten excellent examples may be useful for a smoke test, but a production candidate needs broader coverage of normal cases, edge cases, and negative cases.
A negative case is an example that shows what not to overreact to. In the support priority task, not every customer-facing issue is critical, and not every internal complaint is low priority. Including realistic contrast helps the model learn the decision boundary instead of mapping dramatic words to the highest label.
The most valuable data review question is “Would two domain experts agree on this output?” If the answer is no, the model is being asked to learn ambiguous policy. Fix the rubric before training. Fine-tuning amplifies your examples; it does not adjudicate vague business rules.
4. Configure a Local LoRA or QLoRA Run
Section titled “4. Configure a Local LoRA or QLoRA Run”A training configuration is an engineering document, not just a set of knobs. It records the assumptions of the experiment in a form that can be reviewed, repeated, and compared. The same adapter without its configuration is like a binary without the source version that produced it.
The minimum useful configuration names the base model, PEFT method, LoRA rank, alpha, dropout, target modules, sequence length, batch settings, learning rate, epoch count, checkpoint policy, dataset paths, and output directory. You may not tune every value at first, but you should know which values exist and why the defaults are acceptable.
mkdir -p artifacts logscat > artifacts/config.yaml <<'EOF'base_model: HuggingFaceTB/SmolLM2-135M-Instructpeft_method: loraquantization: nonelora: r: 8 alpha: 16 dropout: 0.05 target_modules: - q_proj - v_projtraining: max_seq_length: 512 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0002 num_train_epochs: 2 save_total_limit: 2paths: train_file: data/train.jsonl eval_file: eval/check.jsonl output_dir: artifacts/adapterEOFThis configuration chooses a very small instruct model for workflow demonstration, not because it is the best production choice. That distinction matters. A tiny model lets you test data formatting, adapter saving, evaluation comparison, and artifact hygiene quickly. If the workflow is broken on a tiny model, a larger model will only make the failure slower and more expensive.
LoRA rank controls the capacity of the adapter. A higher rank can express more change but costs more memory and may overfit small data. Alpha scales the adapter updates, and dropout can reduce overfitting pressure. The right values depend on model, data, and task, but the first run should favor stability and interpretability over aggressive tuning.
Target modules decide where adapters are inserted. Many transformer architectures use projection module names such as q_proj and v_proj, but names vary across model families. A practical engineer verifies module names instead of assuming a tutorial’s names match the selected base model.
.venv/bin/python - <<'PY'from transformers import AutoModelForCausalLM
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"model = AutoModelForCausalLM.from_pretrained(model_name)
matches = []for name, module in model.named_modules(): if name.endswith(("q_proj", "v_proj", "k_proj", "o_proj")): matches.append(name)
for name in matches[:40]: print(name)print(f"matched projection modules: {len(matches)}")PYIf this script shows no matching modules, do not blindly run training. Inspect the model architecture and adjust target_modules to real layer names. A configuration that names nonexistent modules may fail immediately, attach adapters somewhere unintended, or depend on library behavior you have not reviewed.
QLoRA enters when the base model would be too memory-heavy without quantization. The frozen base model is loaded in lower precision, while trainable LoRA adapters remain small. This can make a larger model practical on one GPU, but it adds compatibility questions around bitsandbytes, hardware support, kernels, and deployment runtime.
Use QLoRA when it solves a measured memory problem, not because it sounds more advanced. If a small model trains cleanly without quantization, a first workflow demonstration may be easier to debug without QLoRA. Once the process is sound, quantization becomes a controlled change rather than another variable mixed into an already uncertain run.
A training command should be captured exactly. The command, package versions, GPU state, dataset checksum, and config belong in the experiment notes. If you cannot reproduce the command later, you cannot responsibly compare adapters or explain why one run improved.
{ date nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version --format=csv .venv/bin/python -m pip freeze | sed -n '1,120p' sha256sum data/train.jsonl eval/check.jsonl artifacts/config.yaml} > logs/environment.txtThis log is not glamorous, but it answers questions that matter during review. Did the dataset change? Did the package stack change? Was the GPU memory already crowded? These details often explain differences that otherwise look like model randomness.
Training loss should be interpreted carefully. A lower loss on a tiny dataset may mean the adapter memorized the examples. It does not automatically mean the model improved on held-out prompts. Treat loss as a debugging signal, then let baseline-versus-tuned evaluation answer the product question.
Out-of-memory errors should be handled by reducing the pressure source. Lower sequence length, reduce per-device batch size, use gradient accumulation, choose a smaller model, enable quantization if supported, or close other GPU processes. Do not respond by randomly changing many settings at once, because then you cannot tell which adjustment fixed the problem.
+------------------------------+| OOM During Local Fine-Tuning |+------------------------------+ | v+------------------------------+| Did the model ever load? |+------------------------------+ | yes | no v v+------------------+ +--------------------------+| Reduce sequence | | Use smaller base model || length or batch | | or supported quantization|+------------------+ +--------------------------+ | v+------------------------------+| Still failing after changes? |+------------------------------+ | yes | no v v+------------------+ +--------------------------+| Reduce model or | | Record final memory and || change runtime | | keep the config fixed |+------------------+ +--------------------------+The troubleshooting flow intentionally changes one class of constraint at a time. If you lower sequence length, change model size, alter LoRA rank, switch precision, and modify the dataset in one edit, the next result cannot teach you much. Engineering progress comes from controlled changes.
A local notebook can be useful for exploration, but the final experiment should be scriptable. Scripts make it easier to rerun, review, and automate. They also reduce hidden state from notebook cells executed out of order, which is a common source of misleading fine-tuning results.
For team settings, write the training config so a reviewer can read it without running anything. The config should answer “what is being adapted, with which data, under which resource assumptions, and where will the adapter land?” If those answers are scattered across shell history, notebook cells, and memory, the experiment is not reviewable.
5. Worked Example: Turning Messy Support Notes into a Single-GPU Plan
Section titled “5. Worked Example: Turning Messy Support Notes into a Single-GPU Plan”This worked example demonstrates the reasoning process before the hands-on exercise asks you to run a similar workflow. The input is deliberately messy because real requests rarely arrive as clean training specifications. The goal is to transform an ambiguous request into a narrow, testable local fine-tuning plan.
Input Scenario
Section titled “Input Scenario”A support operations lead says: “We want a local model that reads support tickets and writes the right response. The model should know our support priorities, sound friendly, summarize the issue, decide urgency, and include escalation instructions. We have one workstation GPU and a folder of old tickets.”
A beginner might turn the whole folder into training data and hope the model learns everything at once. A stronger engineer slows down and separates the request into tasks. Some parts are stable behavior, some parts may depend on current policy, and some parts are too broad for a first adapter.
Step 1: Split the Request by Learning Target
Section titled “Step 1: Split the Request by Learning Target”The request contains at least four different behaviors. It asks for prioritization, tone, summarization, and escalation guidance. Those behaviors can interfere with each other if they appear in one small dataset without a clear output contract.
| Requested Behavior | Local Fine-Tuning Fit | Reason |
|---|---|---|
| Decide priority from a short ticket | Good first candidate | Fixed labels and clear evaluation are possible |
| Sound friendly in response text | Possible later | Tone can be learned, but it complicates output scoring |
| Summarize long issue history | Risky first candidate | Longer context and subjective evaluation add complexity |
| Include escalation instructions | Usually retrieval or rules | Current procedures may change and need grounding |
The first local experiment should focus on ticket priority classification with strict JSON output. That task is narrow, has measurable outputs, and can use short examples. Escalation instructions should stay outside the adapter unless the policy is stable and deliberately included in the task contract.
Step 2: Define the Output Contract
Section titled “Step 2: Define the Output Contract”The output contract reduces ambiguity. Instead of “write the right response,” the model must return a JSON object with a priority and reason. This makes baseline comparison easier because the evaluation can check parseability, allowed labels, and judgment quality.
{ "priority": "critical | high | medium | low", "reason": "short business-impact reason"}The reason field should be short because the target behavior is classification discipline, not long-form support writing. If the team later wants a friendly customer response, that should be a second experiment or a downstream templating step. Mixing it into the first run would make the evaluation less clear.
Step 3: Choose the Base Model by Hardware Reality
Section titled “Step 3: Choose the Base Model by Hardware Reality”Assume the workstation has one consumer GPU with limited VRAM and no production training cluster. The worked example chooses a small instruct model for the first run to validate the loop. If the adapter improves format adherence but judgment remains weak because the base model lacks capability, the next experiment can test a larger quantized model.
The decision rule is simple: the first run should be small enough to finish, inspect, and discard without drama. A failed tiny run is cheap evidence. A failed oversized run often produces only frustration and unclear logs.
Step 4: Create Training and Evaluation Splits
Section titled “Step 4: Create Training and Evaluation Splits”The training split shows the target behavior. The evaluation split uses similar but unseen tickets. The key is not the number of rows in this example; the key is separation and consistency.
mkdir -p worked-example/data worked-example/evalcat > worked-example/data/train.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"All customers receive an error when submitting payment.","output":"{\"priority\":\"critical\",\"reason\":\"payment flow is unavailable\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"A single internal report has a stale chart title.","output":"{\"priority\":\"low\",\"reason\":\"minor internal reporting issue\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"New users cannot complete account verification.","output":"{\"priority\":\"high\",\"reason\":\"new customer onboarding is blocked\"}"}EOF
cat > worked-example/eval/check.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Some invoices are generated two hours late, but checkout still works.","expected":"{\"priority\":\"medium\",\"reason\":\"billing communication is delayed\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"The incident banner is missing from the public status page.","expected":"{\"priority\":\"high\",\"reason\":\"incident visibility is impaired\"}"}EOFNotice what the example does not include. It does not include a customer apology, a policy citation, a multi-turn conversation, or a long retrospective summary. Those may be useful tasks later, but adding them now would blur the training signal.
Step 5: Predict Baseline Weakness Before Training
Section titled “Step 5: Predict Baseline Weakness Before Training”Before tuning, the team runs the base model on the held-out prompts and records outputs. Suppose the base model often returns friendly paragraphs like “This seems important, so you should investigate soon.” That may be semantically reasonable, but it violates strict JSON and does not provide a stable label.
The prediction is that LoRA may improve schema adherence and label consistency if the base model already understands the ticket language. If the base model cannot reason about business impact at all, LoRA on a tiny dataset may not be enough. This prediction gives the team a fair way to interpret the result later.
Step 6: Apply the Training Plan
Section titled “Step 6: Apply the Training Plan”The training plan uses LoRA because the task is narrow and the GPU is constrained. It keeps sequence length modest because tickets are short. It limits checkpoints because the team needs a final adapter and enough logs to reproduce, not dozens of intermediate files.
base_model: HuggingFaceTB/SmolLM2-135M-Instructtask: support_ticket_priority_jsonpeft_method: loralora: r: 8 alpha: 16 dropout: 0.05training: max_seq_length: 512 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0002 num_train_epochs: 2 save_total_limit: 2evaluation: compare: - json_parse_success - allowed_priority_label - business_impact_reasonThis is a plan, not proof. The adapter still has to be evaluated against the baseline. A disciplined engineer avoids declaring success until the same held-out prompts show better behavior under the same inference conditions.
Step 7: Decide Keep, Iterate, or Discard
Section titled “Step 7: Decide Keep, Iterate, or Discard”After training, the team compares baseline and tuned outputs. If the tuned adapter returns valid JSON with reasonable labels on held-out examples and the baseline did not, keeping the adapter for further testing is justified. If both models behave similarly, the adapter does not earn a place in the workflow.
If the tuned model overfits and labels nearly everything as critical, the next change should probably be data contrast, not more epochs. If the model follows JSON but chooses weak labels, the task rubric or base model may be the limiting factor. If the output improves only in a notebook but fails in the intended runtime, the runtime mismatch must be fixed before any deployment claim.
The lesson from the worked example is that fine-tuning is not one command. It is a chain of decisions that starts with scope, moves through data and constraints, and ends with evidence. The hands-on exercise later asks you to perform the same chain on a small local project.
6. Evaluate, Package, and Decide Whether to Keep It
Section titled “6. Evaluate, Package, and Decide Whether to Keep It”Evaluation is where local fine-tuning becomes engineering instead of folklore. The central question is not “did training finish?” The central question is “does the tuned adapter improve the target behavior on examples it did not train on, in the runtime where we expect to use it?”
Baseline comparison is mandatory. You need outputs from the original model and outputs from the tuned adapter on the same held-out prompts. Without that comparison, you cannot separate true improvement from the base model already being good, prompt variation, or selective memory of a few nice generations.
mkdir -p eval/resultscat > eval/prompts.txt <<'EOF'Classify the support ticket priority and return strict JSON.
Ticket: The public status page is unreachable during an active incident.EOFThe exact inference command depends on your chosen stack, but the output capture pattern should be stable. Save raw outputs, not just your interpretation. Later reviewers need to see what the model actually produced, including malformed JSON, extra commentary, or missing fields.
A small scoring rubric helps avoid vague impressions. For the support ticket example, score whether the output parses as JSON, whether the priority is one of the allowed labels, whether the reason references business impact, and whether the label is defensible. This still involves judgment, but it makes the judgment explicit.
| Evaluation Check | Passing Example | Failing Example | Why It Matters |
|---|---|---|---|
| JSON parseability | {"priority":"high","reason":"incident visibility is impaired"} | This is probably high priority. | Downstream systems need structured output |
| Allowed label | medium | urgent-ish | The task contract defines valid labels |
| Business-impact reason | checkout is unavailable | it sounds bad | The model must explain the classification basis |
| Held-out behavior | Works on unseen tickets | Only repeats training examples | Fine-tuning should generalize within scope |
| Runtime match | Evaluated in deployment runtime | Evaluated only in a different notebook path | Quantization and loading path can affect behavior |
Evaluation should include at least one borderline case. Borderline cases are where weak rubrics fail. For example, a delayed invoice email is not as severe as broken checkout, but it may still matter. If the model treats every customer-facing issue as critical, it is not learning the priority rubric; it is learning alarm words.
Stop and think: Your tuned adapter returns valid JSON on every held-out prompt, but the labels are worse than the base model’s labels. Do you keep the adapter because formatting improved, or discard it because the decision quality regressed? State the decision rule before you look at more examples.
Packaging comes after evaluation, not before it. A good final artifact directory contains the adapter, config, dataset snapshot or reference, evaluation outputs, environment log, and a short decision note. A bad artifact directory contains unlabeled checkpoints, merged models with unknown settings, and no baseline comparison.
find artifacts eval logs -maxdepth 3 -type f | sortdu -sh artifacts eval logsA decision note should be brief and specific. It should say whether the adapter is kept, why, what evidence supports the decision, and what limitation remains. If the adapter is discarded, the note should still be kept because it prevents the same failed path from being repeated later.
cat > logs/decision.md <<'EOF'# Decision
Adapter: artifacts/adapterBase model: HuggingFaceTB/SmolLM2-135M-InstructTask: support ticket priority as strict JSONDecision: keep for further testing or discard
Evidence:- Baseline JSON parse success:- Tuned JSON parse success:- Label quality notes:- Runtime used for comparison:
Remaining risks:- Dataset is small and needs broader edge-case coverage.- Priority rubric should be reviewed by support operations before production use.EOFStorage hygiene is not optional at home scale. Model caches, checkpoints, and merged outputs can fill disks quickly. Keep what is needed to reproduce and compare. Delete throwaway checkpoints only after you have confirmed that the final adapter, config, logs, and evaluation outputs are preserved.
A senior workflow also records licensing and data governance constraints. Before training on internal tickets, confirm that the data is allowed for the selected model license, storage location, and intended use. Remove secrets and personal data unless there is a documented reason and approved handling path. Local does not mean exempt from governance.
The final keep-or-discard decision should be boring. If evidence shows improvement on the target behavior without unacceptable regression, keep the adapter for the next review stage. If evidence is weak, discard or redesign. The point of a controlled local run is not to guarantee success; it is to make the next decision honest.
Did You Know?
Section titled “Did You Know?”- LoRA adapters are usually small compared with full model weights. That is why they are practical to store, move, compare, and discard during local experiments, but they still depend on the exact compatible base model.
- QLoRA reduces memory pressure by quantizing the frozen base model while training adapter weights. This can make larger local experiments possible, but it also makes runtime compatibility and evaluation path more important.
- A held-out evaluation set can be tiny and still valuable. Even a handful of unseen prompts can catch schema drift, label collapse, and overfitting that training loss alone will not reveal.
- A successful fine-tuning run may still be the wrong product decision. If retrieval, rules, or a better prompt solves the problem with less operational risk, the adapter should not become the default just because training completed.
Common Mistakes
Section titled “Common Mistakes”| Mistake | What Goes Wrong | Better Move |
|---|---|---|
| Using fine-tuning to store changing facts | The model becomes stale and may answer old policy as if it were current | Use retrieval for changing documents and reserve fine-tuning for stable behavior |
| Choosing the largest model before measuring VRAM | The run fails, becomes painfully slow, or requires extreme settings that hide other problems | Measure hardware first and choose a model that leaves memory headroom |
| Mixing unrelated tasks in one small dataset | The adapter receives contradictory signals and produces inconsistent outputs | Start with one narrow input-output contract and expand only after evaluation works |
| Evaluating only by intuition | The team keeps an adapter because a few outputs feel better during a demo | Compare baseline and tuned outputs on held-out prompts with explicit checks |
| Training on the same examples used for evaluation | The model appears better because it saw the answers during training | Reserve a separate check set before the first training command |
| Keeping chaotic checkpoint folders | Nobody can map artifacts back to a base model, dataset, or configuration | Limit checkpoints and preserve named configs, logs, and decision notes |
| Changing many settings after a failure | The team cannot identify which adjustment fixed or worsened the run | Change one major constraint at a time and record the reason |
| Evaluating in a different runtime than deployment | Quantization, loading, or inference differences hide practical failures | Compare baseline and tuned behavior in the runtime you intend to use |
Q1. Your team wants a local model to answer employee questions about a handbook that changes every month. A teammate suggests fine-tuning on the latest handbook because the model “needs to know company policy.” What should you recommend, and what evidence from the scenario drives your decision?
Answer
Recommend retrieval as the first design, not local fine-tuning. The key signal is that the handbook changes every month, so the knowledge needs to stay current and source-grounded. Fine-tuning would freeze policy into weights and require repeated retraining whenever the handbook changes. A fine-tuned adapter might still be useful later for response style or formatting, but the changing factual content should live outside the model.
Q2. A learner has one GPU with limited VRAM and starts with a model that barely loads. Training fails with out-of-memory errors after the first few steps. They propose increasing LoRA rank because “the adapter needs more capacity.” What should they debug first?
Answer
They should debug the resource fit before increasing adapter capacity. The first checks are GPU memory, model size, sequence length, per-device batch size, other GPU processes, and whether quantization or a smaller base model is needed. Increasing LoRA rank adds capacity and memory pressure, so it is unlikely to fix an out-of-memory problem. A controlled response would reduce sequence length or batch size, use gradient accumulation, choose a smaller model, or move to a supported QLoRA setup.
Q3. A support team creates a small dataset that mixes ticket priority labels, customer apology paragraphs, long summaries, and escalation instructions. After tuning, the model sometimes returns JSON, sometimes writes a friendly paragraph, and sometimes invents policy. How would you redesign the next experiment?
Answer
Redesign the experiment around one behavior with one output contract. For a first pass, ticket priority classification as strict JSON is a good candidate because labels and parseability can be evaluated. Customer tone, summarization, and escalation policy should be separated into later experiments or handled by templates, retrieval, or rules. The next dataset should use consistent instructions, realistic inputs, valid output labels, and a held-out evaluation split.
Q4. Your tuned adapter returns valid JSON on all held-out prompts, while the base model often adds prose around the JSON. However, the tuned adapter labels several medium incidents as critical. Should the team keep the adapter as a success?
Answer
Not automatically. The adapter improved schema adherence but regressed decision quality, so the team should apply the decision rule defined before evaluation. If correct prioritization is the primary business behavior, the adapter should be discarded or iterated with better contrast examples and a clearer rubric. If formatting is the only target, it may be a partial success, but the scenario says priority labels matter, so the regression is significant.
Q5. A training run completes successfully, and the final folder contains many checkpoints named test, new, final, and final-fixed. There is no saved config, no dataset checksum, and no baseline output. A month later, another engineer asks whether the adapter can be deployed. What is the correct engineering response?
Answer
The adapter should not be treated as deployable evidence. Without the base model reference, training config, dataset version, environment details, and baseline-versus-tuned evaluation outputs, the team cannot reproduce or review the result. The correct response is to rerun or reconstruct the experiment with artifact hygiene: named config, dataset snapshot or checksum, environment log, final adapter, raw evaluation outputs, and a decision note.
Q6. A colleague evaluates a QLoRA-trained adapter in a notebook using one loading path, then plans to deploy it through a different local runtime with different quantization behavior. The notebook outputs look better than baseline. What risk should you raise before approving the result?
Answer
Raise the risk that the evaluation runtime does not match the intended deployment runtime. Quantization and loading paths can affect compatibility, performance, and outputs. The team should compare baseline and tuned behavior in the runtime they plan to use, or at least document why the notebook path is equivalent. A successful notebook result is useful, but it is not enough to approve deployment when the runtime will change.
Q7. During evaluation, the tuned model performs well on obvious outage tickets but fails on borderline cases such as delayed invoices, degraded status-page communication, and limited internal impact. The training set mostly contains severe incidents. What should the next iteration change?
Answer
The next iteration should improve dataset coverage and contrast, not simply train longer. The model is likely learning a shallow association between incident language and high urgency because the training set lacks borderline and negative examples. Add well-reviewed examples for medium, low, and high-but-not-critical cases, clarify the priority rubric, and keep those categories represented in the held-out evaluation set. Training longer on the same skewed data may reinforce the failure.
Hands-On Exercise
Section titled “Hands-On Exercise”Goal: complete a reproducible single-GPU LoRA fine-tuning workflow on a narrow structured-output task, then decide whether the adapter deserves to be kept. This exercise uses a small model and small dataset so you can practice the engineering loop before attempting larger local runs.
The task is support ticket priority classification as strict JSON. That task is intentionally narrow because it lets you test the full workflow: scope, hardware measurement, data contract, baseline capture, adapter training, tuned evaluation, artifact packaging, and keep-or-discard decision.
- Create a clean workspace for one experiment and keep all artifacts under that workspace.
mkdir -p single-gpu-ft/{data,eval,artifacts,logs,scripts}cd single-gpu-ftpwd- Confirm that your virtual environment exists and install the packages for a small local LoRA run.
test -x .venv/bin/python || { echo "Create .venv before continuing"; exit 1; }.venv/bin/python -m pip install --upgrade pip.venv/bin/python -m pip install "transformers>=4.40" "datasets>=2.18" "peft>=0.10" "trl>=0.8" "accelerate>=0.28" torch pyyaml- Record hardware and environment details before choosing settings.
{ date nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version --format=csv || true df -h . .venv/bin/python -m pip freeze | sed -n '1,160p'} > logs/environment.txtsed -n '1,80p' logs/environment.txt- Write the task contract so the dataset has one clear behavior.
cat > data/contract.yaml <<'EOF'task: support_ticket_priority_jsoninstruction: "Classify the support ticket priority and return strict JSON."allowed_priorities: - critical - high - medium - lowrequired_output_fields: - priority - reasonscope: include: short support tickets with business-impact clues exclude: current policy lookup, long-form summaries, customer apology draftingEOFsed -n '1,120p' data/contract.yaml- Create a narrow training dataset with consistent instructions and output shape.
cat > data/train.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Checkout fails for every customer after the latest payment deployment.","output":"{\"priority\":\"critical\",\"reason\":\"customer checkout is unavailable\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"One analyst reports that an internal dashboard widget loads slowly.","output":"{\"priority\":\"low\",\"reason\":\"limited internal impact\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Password reset emails are not being delivered to customers.","output":"{\"priority\":\"high\",\"reason\":\"account recovery is impaired\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Invoice emails are delayed, but customers can still complete purchases.","output":"{\"priority\":\"medium\",\"reason\":\"billing communication is delayed\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"The public API returns errors for all authentication requests.","output":"{\"priority\":\"critical\",\"reason\":\"authentication service is unavailable\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"A typo appears in one paragraph of an internal onboarding document.","output":"{\"priority\":\"low\",\"reason\":\"minor internal documentation issue\"}"}EOFwc -l data/train.jsonl- Create a held-out evaluation set and do not train on it.
cat > eval/check.jsonl <<'EOF'{"instruction":"Classify the support ticket priority and return strict JSON.","input":"The public status page is unreachable during an active incident.","expected":"{\"priority\":\"high\",\"reason\":\"incident communication is impaired\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"Some invoice PDFs are generated two hours late, but payment still works.","expected":"{\"priority\":\"medium\",\"reason\":\"billing artifact delivery is delayed\"}"}{"instruction":"Classify the support ticket priority and return strict JSON.","input":"A color is wrong on one internal dashboard chart.","expected":"{\"priority\":\"low\",\"reason\":\"minor internal presentation issue\"}"}EOFwc -l eval/check.jsonl- Validate the dataset contract before any training run.
cat > scripts/check_data.py <<'PY'import jsonfrom pathlib import Path
allowed = {"critical", "high", "medium", "low"}seen_inputs = set()
for path in [Path("data/train.jsonl"), Path("eval/check.jsonl")]: output_field = "output" if path.name == "train.jsonl" else "expected" for line_number, line in enumerate(path.read_text().splitlines(), start=1): row = json.loads(line) for field in ["instruction", "input", output_field]: if not row.get(field): raise ValueError(f"{path}:{line_number} missing {field}") parsed = json.loads(row[output_field]) if set(parsed) != {"priority", "reason"}: raise ValueError(f"{path}:{line_number} output fields are wrong") if parsed["priority"] not in allowed: raise ValueError(f"{path}:{line_number} invalid priority") if row["input"] in seen_inputs: raise ValueError(f"{path}:{line_number} duplicate input") seen_inputs.add(row["input"])
print("dataset checks passed")PY
.venv/bin/python scripts/check_data.py- Create a training configuration with an intentionally small model for workflow validation.
cat > artifacts/config.yaml <<'EOF'base_model: HuggingFaceTB/SmolLM2-135M-Instructpeft_method: loralora: r: 8 alpha: 16 dropout: 0.05 target_modules: - q_proj - v_projtraining: max_seq_length: 512 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.0002 num_train_epochs: 2 save_total_limit: 2paths: train_file: data/train.jsonl eval_file: eval/check.jsonl output_dir: artifacts/adapterEOFsed -n '1,120p' artifacts/config.yaml- Save checksums so later changes to data or config are visible.
sha256sum data/train.jsonl eval/check.jsonl data/contract.yaml artifacts/config.yaml > logs/checksums.txtcat logs/checksums.txt- Create a minimal training script that formats each example as an instruction, input, and expected assistant output.
cat > scripts/train_lora.py <<'PY'import jsonfrom pathlib import Path
import torchimport yamlfrom datasets import Datasetfrom peft import LoraConfigfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom trl import SFTConfig, SFTTrainer
config = yaml.safe_load(Path("artifacts/config.yaml").read_text())base_model = config["base_model"]
def load_rows(path): rows = [] for line in Path(path).read_text().splitlines(): item = json.loads(line) text = ( f"Instruction: {item['instruction']}\n" f"Input: {item['input']}\n" f"Answer: {item['output']}" ) rows.append({"text": text}) return rows
tokenizer = AutoTokenizer.from_pretrained(base_model)if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained( base_model, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None,)
lora = config["lora"]peft_config = LoraConfig( r=lora["r"], lora_alpha=lora["alpha"], lora_dropout=lora["dropout"], target_modules=lora["target_modules"], task_type="CAUSAL_LM",)
train_dataset = Dataset.from_list(load_rows(config["paths"]["train_file"]))
training = config["training"]args = SFTConfig( output_dir=config["paths"]["output_dir"], max_seq_length=training["max_seq_length"], per_device_train_batch_size=training["per_device_train_batch_size"], gradient_accumulation_steps=training["gradient_accumulation_steps"], learning_rate=training["learning_rate"], num_train_epochs=training["num_train_epochs"], save_total_limit=training["save_total_limit"], logging_steps=1, save_strategy="epoch", report_to=[], dataset_text_field="text",)
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=train_dataset, peft_config=peft_config, args=args,)
trainer.train()trainer.model.save_pretrained(config["paths"]["output_dir"])tokenizer.save_pretrained(config["paths"]["output_dir"])print(f"saved adapter to {config['paths']['output_dir']}")PY- Create an evaluation script that can run the base model and the adapter on the same held-out prompts.
cat > scripts/evaluate.py <<'PY'import argparseimport jsonfrom pathlib import Path
import torchimport yamlfrom peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizer
parser = argparse.ArgumentParser()parser.add_argument("--adapter", action="store_true")parser.add_argument("--output", required=True)args = parser.parse_args()
config = yaml.safe_load(Path("artifacts/config.yaml").read_text())base_model = config["base_model"]
tokenizer = AutoTokenizer.from_pretrained(base_model)if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained( base_model, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None,)
if args.adapter: model = PeftModel.from_pretrained(model, config["paths"]["output_dir"])
model.eval()outputs = []
for line in Path(config["paths"]["eval_file"]).read_text().splitlines(): item = json.loads(line) prompt = ( f"Instruction: {item['instruction']}\n" f"Input: {item['input']}\n" "Answer:" ) encoded = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): generated = model.generate( **encoded, max_new_tokens=80, do_sample=False, pad_token_id=tokenizer.eos_token_id, ) decoded = tokenizer.decode(generated[0], skip_special_tokens=True) outputs.append({ "input": item["input"], "expected": item["expected"], "response": decoded, })
Path(args.output).write_text(json.dumps(outputs, indent=2) + "\n")print(f"wrote {args.output}")PY- Run the baseline evaluation before training and keep the raw outputs.
.venv/bin/python scripts/evaluate.py --output eval/baseline.jsonsed -n '1,200p' eval/baseline.json- Start the LoRA training run and record the exact command.
printf '%s\n' ".venv/bin/python scripts/train_lora.py" >> logs/run-command.txt.venv/bin/python scripts/train_lora.py 2>&1 | tee logs/train.log- Confirm that the adapter exists and that checkpoints are limited.
find artifacts -maxdepth 4 -type f | sortdu -sh artifacts- Run the tuned evaluation using the same held-out prompts.
.venv/bin/python scripts/evaluate.py --adapter --output eval/tuned.jsonsed -n '1,240p' eval/tuned.json- Compare baseline and tuned outputs without relying on memory.
diff -u eval/baseline.json eval/tuned.json || true- Write a decision note that explicitly keeps, iterates, or discards the adapter.
cat > logs/decision.md <<'EOF'# Decision
Task: support ticket priority as strict JSONBase model: HuggingFaceTB/SmolLM2-135M-InstructAdapter path: artifacts/adapterDecision: keep for more testing, iterate, or discard
Evidence:- Baseline JSON adherence:- Tuned JSON adherence:- Label quality on held-out prompts:- Runtime used for comparison:- Remaining failure modes:
Next action:EOFsed -n '1,120p' logs/decision.md- Keep only traceable artifacts and remove any throwaway outputs that are not tied to the saved config or evaluation.
find data eval artifacts logs scripts -maxdepth 3 -type f | sortSuccess criteria:
- You can explain why this task is a fine-tuning candidate instead of a retrieval problem.
-
data/train.jsonlandeval/check.jsonlare separate files with one consistent output contract. - The experiment records hardware, package versions, checksums, config, baseline outputs, tuned outputs, and a decision note.
- The adapter path is traceable to a base model, dataset, LoRA configuration, and evaluation set.
- The keep-or-discard decision is based on baseline-versus-tuned behavior, not on whether training merely completed.
- Any failure leads to a specific next change, such as cleaner labels, smaller model, lower sequence length, supported quantization, or a retrieval-based redesign.
Next Module
Section titled “Next Module”Sources
Section titled “Sources”- LoRA: Low-Rank Adaptation of Large Language Models — Original LoRA paper for claims about freezing base weights, training low-rank adapters, parameter-count reduction, memory savings, and PEFT trade-offs versus full fine-tuning.
- QLoRA: Efficient Finetuning of Quantized LLMs — Primary source for 4-bit fine-tuning, NF4, double quantization, paged optimizers, and realistic single-GPU fine-tuning claims under constrained VRAM.