Skip to content
Chapter 28. Fine-tune an Open Model

Chapter 28. Fine-tune an Open Model

In Chapter 27, you built a Transformer from scratch and trained it on Shakespeare. That was training from zero: random weights, a tiny dataset, a model that knew nothing until you taught it. Fine-tuning is the opposite. You start with a model that already knows language, code, math, and world knowledge (learned from trillions of tokens of pretraining), and you teach it something new. This chapter walks you through the entire process: picking a base model, preparing a dataset, running QLoRA fine-tuning on a single GPU, evaluating the result, and deploying it locally.


Why Fine-tune Instead of Prompting?

Before spending time and compute on fine-tuning, you should understand when it is worth doing and when it is not. Chapter 25 introduced LoRA and QLoRA conceptually. This chapter puts them into practice.

Prompting (including few-shot examples and system instructions) is the right choice when:

  • You need the model to follow a specific format or persona, and a well-crafted system prompt achieves it.
  • Your task changes frequently, and you do not want to retrain for each variation.
  • You are using a closed API model (GPT-5.4, Claude Opus 4.6) where fine-tuning is not available or is expensive.

Fine-tuning is the right choice when:

  • You need the model to consistently produce output in a specific style, format, or domain vocabulary that prompting alone cannot reliably achieve.
  • You want to reduce token usage. A fine-tuned model that “knows” your format does not need lengthy system prompts or few-shot examples, saving input tokens on every call.
  • You need to run the model locally for privacy, latency, or cost reasons, and you want it specialized for your use case.
  • You have a domain-specific task (legal analysis, medical coding, financial extraction) where the base model’s general knowledge is not precise enough.

The key insight: fine-tuning changes the model’s weights. Prompting changes the model’s input. Fine-tuning is permanent (until you retrain); prompting is temporary (resets every call). Fine-tuning requires compute and data; prompting requires only good instructions.

In practice, many production systems combine both: a fine-tuned base model with a carefully designed system prompt on top.


Step 1: Pick a Base Model

The base model is the pretrained model you will adapt. Your choice determines the quality ceiling (the fine-tuned model cannot exceed the base model’s fundamental capabilities), the hardware requirements, and the license terms.

For this chapter, we use Qwen3-8B, an 8.2-billion-parameter dense model released by Alibaba on April 29, 2025, under the Apache 2.0 license. Here is why:

  • Size: 8.2B parameters is large enough to be genuinely capable (it outperforms many 13B models from 2023-2024) but small enough to fine-tune on a single consumer GPU with QLoRA.
  • Architecture: Dense decoder-only Transformer with 36 layers, 4,096 hidden dimensions, 32 query attention heads, 8 key-value heads (Grouped Query Attention), SwiGLU FFN with SiLU activation (Chapter 9), RoPE positional embeddings (Chapter 6), and a vocabulary of 151,936 tokens. This is the same architecture described throughout this book.
  • Training data: Pretrained on 36 trillion tokens across 119 languages, with a three-stage training pipeline that includes general pretraining, reasoning-focused optimization, and long-context extension.
  • Context window: 32,768 tokens natively, extendable to 131,072 tokens via YaRN.
  • License: Apache 2.0, which means you can use, modify, and deploy it commercially with no restrictions.

Source: Qwen3-8B: 8.2B parameters, 36 layers, hidden_size 4096, intermediate_size 12288, 32 query heads, 8 KV heads, vocab_size 151936, head_dim 128, released April 29, 2025 under Apache 2.0 (confirmed from huggingface.co/Qwen/Qwen3-8B/blob/main/config.json, apxml.com/models/qwen3-8b, purevpn.com, eprnews.com). Pretrained on 36 trillion tokens across 119 languages (confirmed from huggingface.co/Qwen/Qwen3-4B-Base, qwenlm.github.io/blog/qwen3, aman.ai/primers/ai/qwen3).

Other Good Choices

Qwen3-8B is not the only option. Here are other models well-suited for single-GPU fine-tuning in March 2026:

ModelParametersLicenseReleasedNotes
Qwen3-8B8.2BApache 2.0Apr 2025Dense, 36 layers, 119 languages
Qwen3.5-9B9BApache 2.0Mar 2026Hybrid (Gated DeltaNet + attention), 32 layers, 262K native context, 201 languages, multimodal
LLaMA 4 Scout109B total / 17B activeLlama LicenseApr 2025MoE, 16 experts, 10M context
Gemma 3 4B4BGemma Terms of UseMar 2025Smaller, good for constrained hardware, 128K context
Mistral 7B7.3BApache 2.0Sep 2023Proven, well-supported ecosystem
Qwen3-4B4BApache 2.0Apr 2025Lighter alternative to 8B

Qwen3.5-9B is the newest option in this list. Released on March 2, 2026, it uses a hybrid architecture that alternates Gated DeltaNet layers (a linear attention variant) with standard attention layers in a 3:1 ratio across 32 layers. It supports 262,144 tokens natively (extendable to over 1 million), handles text, images, and video natively through early fusion training, and covers 201 languages (up from 119 in Qwen3). Its vocabulary is larger than Qwen3’s: 248,320 tokens versus 151,936. If you are starting a new fine-tuning project in March 2026, Qwen3.5-9B is worth considering. However, we use Qwen3-8B in this chapter because it has a more mature ecosystem of tutorials, adapters, and community support, and the fine-tuning workflow is identical for both models.

For LLaMA 4 Scout, despite having 109B total parameters, only 17B are active per token (MoE architecture, Chapter 12). QLoRA fine-tuning of Scout is possible on a single 48 GB GPU, though it requires more memory than the 8B dense models.

Note on Gemma 3: unlike the other models in this table, Gemma 3 uses Google’s custom Gemma Terms of Use, not Apache 2.0. The license permits commercial use but includes specific restrictions. Read the terms at ai.google.dev/gemma/terms before using it in production.

Source: LLaMA 4 Scout: 109B total, 17B active, 16 experts, released April 2025 (confirmed from apxml.com/models/llama-4-scout, arxiv.org/html/2601.11659v1). Qwen3.5-9B: 9B parameters, 32 layers, hidden_size 4096, vocab_size 248,320, hybrid Gated DeltaNet + attention in 3:1 ratio, 262K native context, released March 2, 2026, Apache 2.0 (confirmed from apxml.com/models/qwen35-9b, huggingface.co/Qwen/Qwen3.5-9B, codersera.com, aicost.org). Qwen3.5 Small series supports 201 languages, early fusion multimodal (confirmed from thenextgentechinsider.com, indiatoday.in, huggingface.co/blog/mlabonne/qwen35). Gemma 3 4B: released March 12, 2025, Gemma Terms of Use license (confirmed from llm-stats.com, markaicode.com, ai.google.dev/gemma/terms).


Step 2: Prepare a Dataset

The dataset is the most important part of fine-tuning. A model fine-tuned on 500 high-quality examples will almost always outperform one fine-tuned on 10,000 noisy examples. Quality matters more than quantity.

What Format Does the Data Need?

For instruction fine-tuning (teaching the model to follow instructions in a specific way), each example is a conversation with one or more turns. The standard format is a list of messages, each with a role (“system”, “user”, or “assistant”) and content. This is the same chat format used by the APIs described in Chapter 23.

# One training example: a single conversation.
example = {
    "messages": [
        {"role": "system", "content": "You are a legal assistant that summarizes contracts."},
        {"role": "user", "content": "Summarize the key obligations in this lease agreement: [contract text]"},
        {"role": "assistant", "content": "Key obligations:\n1. Tenant must pay $2,500/month...\n2. ..."}
    ]
}

The model learns to produce the assistant’s response given the system prompt and user message. During training, the loss is computed only on the assistant’s tokens (not the system or user tokens), so the model learns what to say, not what the user said.

How Many Examples Do You Need?

There is no universal answer, but here are practical guidelines based on the task:

Task TypeTypical Dataset SizeExample
Format/style adaptation200-500 examples“Always respond in bullet points”
Domain vocabulary500-2,000 examplesMedical terminology, legal jargon
Classification1,000-5,000 examplesSentiment analysis, intent detection
Complex reasoning2,000-10,000 examplesMulti-step math, code generation
Full instruction following5,000-50,000 examplesGeneral-purpose assistant

The Stanford Alpaca project demonstrated in 2023 that fine-tuning LLaMA 7B on just 52,000 instruction-response pairs (generated by text-davinci-003, part of the GPT-3.5 family) produced a model that could follow instructions comparably to text-davinci-003 itself. That was a landmark result: it showed that a relatively small, high-quality dataset could dramatically change a model’s behavior.

Source: Stanford Alpaca: 52,000 instruction-response pairs generated by text-davinci-003, fine-tuned on LLaMA 7B (confirmed from crfm.stanford.edu/2023/03/13/alpaca.html, huggingface.co/datasets/tatsu-lab/alpaca).

Building Your Dataset

For this walkthrough, we will create a small dataset for a specific task: teaching the model to extract structured information from product reviews. This is a realistic use case: you have unstructured text (customer reviews) and you want structured output (sentiment, key topics, action items).

import json

# Create a small training dataset.
# In practice, you would have hundreds or thousands of these.
training_data = [
    {
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a product review analyzer. Given a customer review, "
                    "extract: 1) sentiment (positive/negative/mixed), "
                    "2) key topics mentioned, 3) any specific complaints, "
                    "4) purchase recommendation (yes/no/conditional). "
                    "Respond in JSON format."
                )
            },
            {
                "role": "user",
                "content": (
                    "Review: I bought this laptop three months ago and the "
                    "performance is incredible for the price. The battery "
                    "lasts about 6 hours which is decent. My only complaint "
                    "is the trackpad feels a bit cheap and the speakers are "
                    "tinny. Overall though, great value for a $600 machine."
                )
            },
            {
                "role": "assistant",
                "content": json.dumps({
                    "sentiment": "positive",
                    "key_topics": [
                        "performance", "battery life",
                        "trackpad quality", "speaker quality", "price"
                    ],
                    "complaints": [
                        "trackpad feels cheap",
                        "speakers are tinny"
                    ],
                    "recommendation": "yes"
                }, indent=2)
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a product review analyzer. Given a customer review, "
                    "extract: 1) sentiment (positive/negative/mixed), "
                    "2) key topics mentioned, 3) any specific complaints, "
                    "4) purchase recommendation (yes/no/conditional). "
                    "Respond in JSON format."
                )
            },
            {
                "role": "user",
                "content": (
                    "Review: Terrible experience. The shoes fell apart after "
                    "two weeks of normal use. The sole separated from the "
                    "upper on both shoes. Customer service was unhelpful and "
                    "refused a refund because I had worn them outside. "
                    "Do not buy these."
                )
            },
            {
                "role": "assistant",
                "content": json.dumps({
                    "sentiment": "negative",
                    "key_topics": [
                        "durability", "build quality",
                        "customer service", "refund policy"
                    ],
                    "complaints": [
                        "shoes fell apart after two weeks",
                        "sole separated from upper on both shoes",
                        "customer service refused refund"
                    ],
                    "recommendation": "no"
                }, indent=2)
            }
        ]
    },
]

# Save as JSONL (one JSON object per line).
with open("train_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

print(f"Saved {len(training_data)} training examples to train_data.jsonl")

In a real project, you would have 500-5,000 examples like these. You can create them by:

  1. Manual annotation: Write the examples yourself or hire annotators. This produces the highest quality data but is slow and expensive.
  2. Synthetic generation: Use a stronger model (GPT-5.4, Claude Opus 4.6) to generate training examples, then review and correct them. This is the approach Stanford Alpaca used, and it remains the most common method in 2026.
  3. Existing datasets: Use publicly available instruction datasets from Hugging Face Hub. As of March 2026, Hugging Face hosts over 500,000 public datasets.
  4. Production logs: If you already have a system using an API model, you can use the (user query, model response) pairs as training data, after filtering for quality.

Source: Hugging Face Hub: over 500,000 public datasets, 2 million+ public models, 13 million users as of 2025 (confirmed from huggingface.co/blog/huggingface/state-of-os-hf-spring-2026). The Spring 2026 report notes that users, models, and datasets all nearly doubled during 2025.

The Chat Template

Different model families use different chat templates to format conversations. Qwen3 uses a template based on special tokens. When you use libraries like Unsloth or Hugging Face TRL, the chat template is applied automatically. But it helps to understand what is happening under the hood.

For Qwen3, a conversation is formatted like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

The <|im_start|> and <|im_end|> are special tokens in the vocabulary (not literal text). The tokenizer handles this conversion automatically when you use the apply_chat_template method. Qwen3 also supports a “thinking” mode where the model can reason step-by-step inside <think>...</think> tags before responding, but for fine-tuning on structured extraction tasks, the non-thinking mode is typically sufficient.

Source: Qwen3 chat template uses <|im_start|> and <|im_end|> special tokens, supports thinking/non-thinking modes via enable_thinking flag (confirmed from huggingface.co/blog/qwen-3-chat-template-deep-dive).


Step 3: QLoRA Fine-tuning on a Single GPU

This is the core of the chapter. We will fine-tune Qwen3-8B using QLoRA: the base model is loaded in 4-bit precision (NF4 quantization), and small LoRA adapter matrices are trained in 16-bit precision on top. Chapter 25 explained the theory; here we put it into practice.

Hardware Requirements

With QLoRA, Qwen3-8B requires approximately 6-8 GB of VRAM for fine-tuning with a 2,048-token sequence length. This fits on:

  • NVIDIA RTX 4060 (8 GB) or RTX 4070 (12 GB): consumer GPUs, ~$300-600
  • NVIDIA RTX 4090 (24 GB): high-end consumer GPU, ~$1,600-2,000
  • NVIDIA A100 40 GB or H100 80 GB: datacenter GPUs, available via cloud
  • Google Colab free tier (T4 16 GB): free, sufficient for 8B models with QLoRA
  • Apple M-series Macs (16+ GB unified memory): supported via MPS backend

The 4-bit quantized Qwen3-8B base model occupies roughly 4-5 GB of VRAM. The LoRA adapters, optimizer states, and activations add another 2-3 GB. Total: 6-8 GB, depending on batch size and sequence length.

Source: Unsloth makes Qwen3-8B fine-tuning use 70% less VRAM, fits on free Colab T4 (confirmed from unsloth.ai/blog/qwen3, docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune).

The Code

We use Unsloth, the optimized fine-tuning library introduced in Chapter 25. Unsloth rewrites performance-critical operations as custom Triton kernels, achieving 2x faster training and 70% less VRAM usage compared to standard Hugging Face training, with no accuracy loss. It is a drop-in replacement for the Hugging Face PEFT and TRL libraries. As of March 2026, Unsloth supports fine-tuning and reinforcement learning for Qwen3, Qwen3.5 (including the new hybrid Gated DeltaNet architecture), LLaMA 4, DeepSeek R1/V3, OpenAI gpt-oss, Gemma, and many other model families.

Here is the complete, runnable fine-tuning script:

"""
Fine-tune Qwen3-8B with QLoRA using Unsloth.

Requirements:
    pip install unsloth

Hardware: Any GPU with 8+ GB VRAM (RTX 4060, T4, etc.)
Time: ~30-60 minutes for 1,000 examples on a T4.
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ─── Step 1: Load the base model in 4-bit ───
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048,
    load_in_4bit=True,   # QLoRA: 4-bit NF4 quantization
)

# ─── Step 2: Add LoRA adapters ───
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                # LoRA rank: 16 is a good default
    lora_alpha=32,       # Scaling factor (typically 2x rank)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",      # FFN (SwiGLU)
    ],
    lora_dropout=0,      # No dropout (Unsloth recommendation)
    bias="none",         # Do not train bias terms
)

Let us break down what each parameter does:

load_in_4bit=True: This is the “Q” in QLoRA. The base model’s 8.2 billion parameters are loaded in 4-bit NormalFloat (NF4) precision instead of the standard 16-bit. This reduces the model’s memory footprint from ~16.4 GB (FP16) to ~4.1 GB (4-bit). The NF4 data type, introduced in the QLoRA paper (Dettmers et al., arXiv:2305.14314, NeurIPS 2023), is specifically designed for normally-distributed neural network weights and produces less quantization error than standard 4-bit integer quantization.

r=16: The LoRA rank. As explained in Chapter 25, LoRA decomposes weight updates into two small matrices of rank r. Higher rank means more trainable parameters and more expressive power, but also more memory. Rank 16 is a widely-used default that works well for most tasks. For simple format adaptation, rank 8 may suffice. For complex domain adaptation, rank 32 or 64 may help.

lora_alpha=32: The scaling factor for LoRA updates. The effective update is scaled by alpha / r. With alpha=32 and r=16, the scaling factor is 2.0. This controls how much the LoRA adapters influence the output relative to the frozen base weights.

target_modules: Which weight matrices get LoRA adapters. We target all the attention projections (Q, K, V, output) and all the FFN projections (gate, up, down in the SwiGLU architecture from Chapter 9). This is the most common configuration. You could target fewer modules to reduce trainable parameters, but targeting all of them generally produces better results.

How Many Parameters Are We Training?

Let us calculate. Qwen3-8B has 36 layers. Each layer has 7 target modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). Each module gets two LoRA matrices of rank 16.

def count_lora_params():
    """Calculate the number of trainable LoRA parameters."""
    hidden = 4096       # Qwen3-8B hidden_size
    intermediate = 12288  # Qwen3-8B intermediate_size (FFN)
    n_layers = 36
    rank = 16
    kv_dim = 8 * 128    # 8 KV heads * 128 head_dim = 1024

    # Attention projections: q, k, v, o
    # q_proj: (4096, 4096) -> LoRA A: (4096, 16) + B: (16, 4096)
    # k_proj: (4096, 1024) -> LoRA A: (4096, 16) + B: (16, 1024)
    # v_proj: (4096, 1024) -> LoRA A: (4096, 16) + B: (16, 1024)
    # o_proj: (4096, 4096) -> LoRA A: (4096, 16) + B: (16, 4096)
    attn_params_per_layer = (
        (hidden * rank + rank * hidden) +      # q_proj: 131,072
        (hidden * rank + rank * kv_dim) +       # k_proj:  81,920
        (hidden * rank + rank * kv_dim) +       # v_proj:  81,920
        (hidden * rank + rank * hidden)         # o_proj: 131,072
    )

    # FFN projections (SwiGLU): gate_proj, up_proj, down_proj
    # gate_proj: (4096, 12288) -> LoRA A: (4096, 16) + B: (16, 12288)
    # up_proj:   (4096, 12288) -> LoRA A: (4096, 16) + B: (16, 12288)
    # down_proj: (12288, 4096) -> LoRA A: (12288, 16) + B: (16, 4096)
    ffn_params_per_layer = (
        (hidden * rank + rank * intermediate) +     # gate_proj: 262,144
        (hidden * rank + rank * intermediate) +     # up_proj:   262,144
        (intermediate * rank + rank * hidden)       # down_proj: 262,144
    )

    total_per_layer = attn_params_per_layer + ffn_params_per_layer
    total = total_per_layer * n_layers

    print(f"LoRA Parameters per Layer")
    print(f"  Attention: {attn_params_per_layer:,}")
    print(f"  FFN:       {ffn_params_per_layer:,}")
    print(f"  Total:     {total_per_layer:,}")
    print(f"\nTotal LoRA parameters: {total:,}")
    print(f"Base model parameters: 8,200,000,000")
    print(f"Trainable fraction:    {total / 8_200_000_000 * 100:.2f}%")
    print(f"Parameter reduction:   {8_200_000_000 / total:.0f}x")

count_lora_params()

Output:

LoRA Parameters per Layer
  Attention: 425,984
  FFN:       786,432
  Total:     1,212,416

Total LoRA parameters: 43,646,976
Base model parameters: 8,200,000,000
Trainable fraction:    0.53%
Parameter reduction:   188x

We are training 43.6 million parameters out of 8.2 billion. That is 0.53% of the model, a 188x reduction. The other 99.47% of the model stays frozen at 4-bit precision. This is why QLoRA is so memory-efficient: you only need to store optimizer states (which track momentum and variance for gradient updates) for the 43.6 million trainable parameters, not for all 8.2 billion.

Source: Qwen3-8B config: hidden_size=4096, intermediate_size=12288, num_hidden_layers=36, num_attention_heads=32, num_key_value_heads=8, head_dim=128, vocab_size=151936 (confirmed from huggingface.co/Qwen/Qwen3-8B/blob/main/config.json).

Configure Training

# ─── Step 3: Load and format the dataset ───
# For this example, we use a local JSONL file.
# In practice, you can also load from Hugging Face Hub:
#   dataset = load_dataset("your-username/your-dataset")
dataset = load_dataset("json", data_files="train_data.jsonl", split="train")

# ─── Step 4: Configure the trainer ───
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./qwen3-review-analyzer",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size: 2 * 4 = 8
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        logging_steps=10,
        save_strategy="epoch",
        fp16=not hasattr(model, "supports_bf16"),
        bf16=hasattr(model, "supports_bf16"),
        optim="adamw_8bit",  # 8-bit AdamW saves memory
        seed=42,
    ),
)

Key training parameters explained:

per_device_train_batch_size=2: How many examples are processed per GPU per step. With 8 GB of VRAM, batch size 2 with 2,048-token sequences is a safe starting point. If you have more VRAM (24 GB on an RTX 4090), you can increase this to 4 or 8.

gradient_accumulation_steps=4: The gradients from 4 consecutive mini-batches are accumulated before updating the weights. This gives an effective batch size of 2 x 4 = 8 without requiring the memory to hold 8 examples simultaneously. Larger effective batch sizes generally produce more stable training.

learning_rate=2e-4: The learning rate for the LoRA adapters. This is higher than typical full fine-tuning learning rates (which are usually 1e-5 to 5e-5) because we are only updating a small number of parameters. The 2e-4 value is a widely-used default for LoRA fine-tuning.

lr_scheduler_type="cosine": The learning rate follows a cosine decay schedule, starting at 2e-4 and gradually decreasing to near zero. This is the same scheduler used in most pretraining runs (Chapter 14).

optim="adamw_8bit": Uses 8-bit AdamW optimizer, which stores optimizer states in 8-bit precision instead of 32-bit. This halves the memory used by optimizer states with negligible impact on training quality. For our 43.6 million trainable parameters, this saves about 262 MB of VRAM. The math: AdamW stores two states per parameter (momentum and variance). At 32-bit, that is 43.6M x 2 x 4 bytes = ~349 MB. At 8-bit, that is 43.6M x 2 x 1 byte = ~87 MB. The difference is ~262 MB.

Run Training

# ─── Step 5: Train ───
trainer.train()

# Training output will look something like:
# Step 10: loss = 1.823
# Step 20: loss = 1.245
# Step 30: loss = 0.891
# Step 40: loss = 0.654
# ...
# Training complete. Total time: ~35 minutes on T4.

The loss should decrease steadily during training. For a well-prepared dataset:

  • Starting loss (step 0): Typically 1.5-3.0, depending on how different your target format is from the base model’s default behavior.
  • Final loss (after training): Typically 0.3-0.8 for format adaptation tasks, 0.5-1.2 for more complex domain tasks.
  • If the loss does not decrease: Your data format may be incorrect, the learning rate may be too low, or the dataset may be too small.
  • If the loss drops to near zero: You are likely overfitting. Reduce the number of epochs or add more training data.

Save the Adapter

# ─── Step 6: Save the LoRA adapter ───
model.save_pretrained("qwen3-review-adapter")
tokenizer.save_pretrained("qwen3-review-adapter")

print("Adapter saved. Contents:")
import os
for f in os.listdir("qwen3-review-adapter"):
    size = os.path.getsize(f"qwen3-review-adapter/{f}")
    print(f"  {f}: {size / 1e6:.1f} MB")

The saved adapter is small: typically 50-150 MB for rank 16 on an 8B model. This is just the LoRA matrices (the A and B matrices from Chapter 25), not the full model. To use the adapter, you load the base model and then load the adapter on top of it. You can also have multiple adapters for different tasks, all sharing the same base model.

Source: Unsloth achieves 2x faster training and 70% less VRAM, supports Qwen3, Qwen3.5, LLaMA 4, DeepSeek, gpt-oss, and Gemma models, compatible with Hugging Face PEFT and TRL (confirmed from github.com/unslothai/unsloth, huggingface.co/blog/unsloth-trl, unsloth.ai/blog/qwen3, unsloth.ai/docs/models/qwen3.5/fine-tune).


Step 4: Evaluate the Fine-tuned Model

Training loss tells you the model is learning, but it does not tell you whether the model is learning the right things. Evaluation requires testing the model on examples it has never seen during training.

Quick Inference Test

The simplest evaluation: give the model a new review and check whether the output matches your expected format.

"""
Test the fine-tuned model on a new review.
"""
from unsloth import FastLanguageModel

# Load the base model + adapter.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="qwen3-review-adapter",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # Enable fast inference mode.

# Create a test prompt.
messages = [
    {
        "role": "system",
        "content": (
            "You are a product review analyzer. Given a customer review, "
            "extract: 1) sentiment (positive/negative/mixed), "
            "2) key topics mentioned, 3) any specific complaints, "
            "4) purchase recommendation (yes/no/conditional). "
            "Respond in JSON format."
        )
    },
    {
        "role": "user",
        "content": (
            "Review: The noise-canceling headphones are good but not great. "
            "Sound quality is excellent for music, but the ANC struggles "
            "with low-frequency rumble on airplanes. Comfortable for about "
            "2 hours, then the ear cups get warm. Battery life is amazing "
            "at 35 hours. For $250, I expected better ANC performance."
        )
    },
]

# Apply the chat template and generate.
# enable_thinking=False disables Qwen3's chain-of-thought mode,
# which would add <think>...</think> tags and break JSON parsing.
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.1,  # Low temperature for structured output.
    do_sample=True,
)

# Decode and print the response.
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Expected output (after successful fine-tuning):

{
  "sentiment": "mixed",
  "key_topics": [
    "sound quality",
    "active noise cancellation",
    "comfort",
    "battery life",
    "price"
  ],
  "complaints": [
    "ANC struggles with low-frequency rumble on airplanes",
    "ear cups get warm after 2 hours",
    "ANC performance not worth $250"
  ],
  "recommendation": "conditional"
}

If the output is valid JSON with the correct fields, the fine-tuning worked. If the output is garbled, missing fields, or in the wrong format, you need to check your training data, increase the number of examples, or train for more epochs.

Systematic Evaluation

For a production deployment, you need more than a spot check. Here is a systematic evaluation approach:

import json

def evaluate_model(model, tokenizer, test_examples):
    """
    Evaluate the fine-tuned model on a held-out test set.
    Returns metrics: format accuracy, field completeness, parse rate.
    """
    results = {
        "total": len(test_examples),
        "valid_json": 0,
        "correct_fields": 0,
        "correct_sentiment": 0,
    }

    required_fields = {"sentiment", "key_topics", "complaints", "recommendation"}

    for example in test_examples:
        # Generate response (same as above, omitted for brevity).
        response = generate_response(model, tokenizer, example)

        # Check 1: Is the output valid JSON?
        try:
            parsed = json.loads(response)
            results["valid_json"] += 1
        except json.JSONDecodeError:
            continue

        # Check 2: Does it have all required fields?
        if required_fields.issubset(parsed.keys()):
            results["correct_fields"] += 1

        # Check 3: Is the sentiment correct?
        if parsed.get("sentiment") == example["expected_sentiment"]:
            results["correct_sentiment"] += 1

    # Compute rates.
    n = results["total"]
    print(f"JSON parse rate:     {results['valid_json']}/{n} "
          f"({results['valid_json']/n*100:.1f}%)")
    print(f"Field completeness:  {results['correct_fields']}/{n} "
          f"({results['correct_fields']/n*100:.1f}%)")
    print(f"Sentiment accuracy:  {results['correct_sentiment']}/{n} "
          f"({results['correct_sentiment']/n*100:.1f}%)")

    return results

The three metrics above capture different aspects of quality:

  1. JSON parse rate: Can the model produce valid JSON at all? A well-fine-tuned model should achieve 95-100% on this metric. If it is below 90%, the model has not learned the output format reliably.

  2. Field completeness: Does the JSON contain all required fields? This catches cases where the model produces valid JSON but omits fields.

  3. Task accuracy: Is the extracted information correct? This is the hardest metric and requires ground-truth labels in your test set.

Common Evaluation Pitfalls

Evaluating on training data: Never evaluate on examples the model saw during training. Always hold out 10-20% of your dataset for evaluation. The model may have memorized training examples, giving you an inflated sense of quality.

Ignoring edge cases: Test with unusual inputs: very short reviews, reviews in other languages (if your model should handle them), reviews with sarcasm, reviews that are ambiguous. These are where fine-tuned models most often fail.

Not comparing to the base model: Always test the base model (without fine-tuning) on the same evaluation set. This tells you how much the fine-tuning actually helped. If the base model already achieves 90% accuracy with a good prompt, fine-tuning may not be worth the effort.


Step 5: Deploy the Fine-tuned Model

You have a trained adapter. Now you need to serve it. There are three main deployment paths, each suited to different use cases.

Option A: Merge and Export to GGUF (Local Deployment)

The most common deployment path for fine-tuned open models is to merge the LoRA adapter into the base model and export it as a GGUF file. GGUF is the format used by llama.cpp and Ollama for efficient local inference. Chapter 25 covered these tools; here we use them with our fine-tuned model.

"""
Merge LoRA adapter into base model and export as GGUF.
"""
from unsloth import FastLanguageModel

# Load the base model + adapter.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="qwen3-review-adapter",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Merge LoRA weights into the base model.
# After merging, the model behaves identically but no longer
# needs the separate adapter files.
model.save_pretrained_merged(
    "qwen3-review-merged",
    tokenizer,
    save_method="merged_16bit",  # Merge at 16-bit precision.
)

# Export to GGUF with Q4_K_M quantization.
# Q4_K_M is a good balance of quality and size (Chapter 25).
model.save_pretrained_gguf(
    "qwen3-review-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Result: a single GGUF file (~4.5 GB) that you can run with Ollama.
print("GGUF export complete.")
print("To run with Ollama:")
print("  1. Create a Modelfile:")
print('     echo "FROM ./qwen3-review-gguf/unsloth.Q4_K_M.gguf" > Modelfile')
print("  2. Import into Ollama:")
print("     ollama create review-analyzer -f Modelfile")
print("  3. Run:")
print("     ollama run review-analyzer")

The GGUF file is self-contained: it includes the model weights, the tokenizer configuration, and metadata. You can copy this single file to any machine with Ollama installed and run it immediately, with no Python environment, no GPU drivers, and no internet connection required.

Source: GGUF format used by llama.cpp and Ollama for efficient local inference, supports quantization from 2-bit to 8-bit (confirmed from mbrenndoerfer.com/writing/gguf-format-quantized-llm-storage-inference, dasroot.net/posts/2026/01/local-llm-deployment-ollama-llama.cpp). Ollama supports importing GGUF files and LoRA adapters directly (confirmed from docs.ollama.com/import).

Option B: Serve with vLLM (Production API)

For production deployments that need to handle multiple concurrent requests, vLLM (Chapter 24) is the standard choice. vLLM supports LoRA adapters natively, so you can serve the base model once and swap adapters per request.

# Start vLLM with LoRA adapter support.
# Run this from the command line:
#
# python -m vllm.entrypoints.openai.api_server \
#     --model Qwen/Qwen3-8B \
#     --enable-lora \
#     --lora-modules review-analyzer=./qwen3-review-adapter \
#     --max-lora-rank 16 \
#     --quantization bitsandbytes \
#     --load-format bitsandbytes \
#     --port 8000

# Then call it like any OpenAI-compatible API:
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "review-analyzer",  # Use the LoRA adapter name.
        "messages": [
            {"role": "system", "content": "You are a product review analyzer..."},
            {"role": "user", "content": "Review: Great product, fast shipping..."},
        ],
        "temperature": 0.1,
        "max_tokens": 512,
    },
)
print(response.json()["choices"][0]["message"]["content"])

The advantage of vLLM’s LoRA support is that you can serve multiple adapters from a single base model. One base Qwen3-8B instance can serve a review analyzer adapter, a legal summarizer adapter, and a code reviewer adapter, all sharing the same GPU memory for the base model weights. Each request specifies which adapter to use. In practice, this means five customers who each use only 10% of a dedicated GPU can be served from a single GPU with multi-LoRA, turning five underutilized GPUs into one efficiently shared GPU.

Option C: Push to Hugging Face Hub (Sharing)

If you want to share your fine-tuned model with others, push the adapter to Hugging Face Hub:

# Push the LoRA adapter to Hugging Face Hub.
model.push_to_hub("your-username/qwen3-review-analyzer", tokenizer)

# Others can then use it:
# model = FastLanguageModel.from_pretrained("your-username/qwen3-review-analyzer")

The adapter is small (50-150 MB), so uploading and downloading is fast. Anyone who downloads it loads the base Qwen3-8B model and applies your adapter on top.


Understanding What Fine-tuning Actually Changes

Fine-tuning with LoRA does not rewrite the model’s knowledge. It adds a thin layer of learned adjustments on top of the frozen base weights. To understand what this means in practice, consider the math.

At each target layer, the forward pass computes:

output = W_frozen @ input + (B @ A) @ input * (alpha / rank)

Where W_frozen is the original weight matrix (frozen at 4-bit), A is the down-projection (shape: input_dim x rank), B is the up-projection (shape: rank x output_dim), and alpha / rank is the scaling factor.

The LoRA update B @ A is a low-rank matrix. With rank 16, it can represent at most 16 independent “directions” of change in the weight space. This is a tiny fraction of the full weight matrix’s capacity (which has rank up to 4,096 for the attention projections). But it turns out that the changes needed for most fine-tuning tasks are low-rank: they involve shifting the model’s behavior in a small number of consistent directions, not rewriting its entire knowledge base.

This is why LoRA works so well: fine-tuning for a specific task does not require changing most of the model’s weights. It requires nudging a small number of weight directions to make the model prefer your desired output format, style, or domain vocabulary.

What LoRA Cannot Do

LoRA has limitations:

  • It cannot add fundamentally new knowledge. If the base model does not know a fact, LoRA fine-tuning on a few hundred examples will not reliably teach it that fact. For knowledge injection, you need either a much larger dataset or retrieval augmentation (RAG, Chapter 19).

  • It cannot change the model’s architecture. The context window, vocabulary, and attention mechanism remain the same. You cannot fine-tune a 32K-context model to handle 128K contexts.

  • It can degrade general capabilities. If your training data is narrow, the model may become worse at tasks outside your domain. This is called catastrophic forgetting. The frozen base weights mitigate this (the model retains most of its general knowledge), but it is not eliminated entirely.

  • Rank matters for complex tasks. Rank 16 is sufficient for format adaptation and simple domain tasks. For complex tasks that require the model to learn new reasoning patterns, rank 32 or 64 may be necessary. Higher rank means more trainable parameters, more memory, and longer training time.


Hyperparameter Tuning Guide

If your first fine-tuning run does not produce good results, here are the most impactful hyperparameters to adjust, in order of importance:

1. Dataset Quality (Most Important)

Before changing any hyperparameters, check your data:

  • Are the assistant responses in your training data actually correct and high-quality?
  • Is the format consistent across all examples?
  • Are there contradictory examples (same input, different expected outputs)?
  • Is the system prompt identical across all examples?

Bad data cannot be fixed by better hyperparameters.

2. Number of Epochs

SymptomDiagnosisFix
Loss barely decreasesUnderfittingIncrease epochs (3 to 5-10)
Loss drops then risesOverfittingDecrease epochs (3 to 1-2)
Train loss low, eval loss highOverfittingDecrease epochs, add more data
Both losses plateauLearning rate too lowIncrease learning rate

For small datasets (under 1,000 examples), 1-3 epochs is usually sufficient. For larger datasets (5,000+), 1 epoch may be enough.

3. LoRA Rank

RankTrainable Params (8B model)Best For
8~21.8M (0.27%)Simple format changes
16~43.6M (0.53%)General-purpose fine-tuning (default)
32~87.3M (1.06%)Complex domain adaptation
64~174.6M (2.13%)Maximum expressiveness

Higher rank is not always better. If your task is simple (e.g., “always respond in JSON”), rank 8 is sufficient and trains faster. If your task requires the model to learn complex new patterns, rank 32 or 64 may help.

4. Learning Rate

The learning rate is the most sensitive hyperparameter after data quality. Too high and the model diverges (loss spikes or oscillates). Too low and the model barely learns.

Learning RateTypical Use
1e-4Conservative, good for large datasets
2e-4Default for most LoRA fine-tuning
5e-4Aggressive, good for small datasets with few epochs
1e-3Very aggressive, risk of instability

If training is unstable (loss spikes), reduce the learning rate by 2-5x. If training is too slow (loss barely moves), increase it by 2-5x.

5. Batch Size

Larger effective batch sizes produce more stable gradients but require more memory. If you cannot increase per_device_train_batch_size due to VRAM limits, increase gradient_accumulation_steps instead. An effective batch size of 8-16 works well for most fine-tuning tasks.


Beyond LoRA: Other Fine-tuning Methods

LoRA is the dominant fine-tuning method in 2026, but it is not the only option. Here are the alternatives and when to consider them:

DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA (Liu et al., arXiv:2402.09353, ICML 2024) improves on LoRA by decomposing weight updates into magnitude and direction components. Standard LoRA updates both magnitude and direction simultaneously through the low-rank matrices. DoRA separates them: it trains a magnitude vector (one scalar per output dimension) and uses LoRA for the directional update only.

The result: DoRA consistently outperforms LoRA across multiple benchmarks and model families (LLaMA, LLaVA, Stable Diffusion XL) with minimal additional parameters (one magnitude vector per adapted layer). It is supported in Hugging Face PEFT and Unsloth.

# Using DoRA instead of LoRA (in Unsloth or PEFT):
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    use_dora=True,  # Enable DoRA decomposition.
)

Source: DoRA (arXiv:2402.09353, ICML 2024) decomposes weight updates into magnitude and direction, consistently outperforms LoRA with minimal overhead (confirmed from arxiv.org/html/2402.09353v6, developer.nvidia.com/blog/introducing-dora, huggingface.co/papers/2402.09353).

Full Fine-tuning

Full fine-tuning updates every parameter in the model. For Qwen3-8B, that means updating all 8.2 billion parameters. This requires:

  • ~32 GB VRAM minimum (FP16 weights + optimizer states + gradients)
  • Multiple GPUs for 8B+ models (2-4x A100 80 GB or equivalent)
  • Significantly longer training time

Full fine-tuning produces the best results when you have a large, high-quality dataset (10,000+ examples) and sufficient compute. But for most practical use cases, QLoRA with rank 16-32 achieves 95-99% of full fine-tuning quality at a fraction of the cost.

GRPO: Reinforcement Learning for Reasoning

The biggest shift in fine-tuning since LoRA is GRPO (Group Relative Policy Optimization), the reinforcement learning algorithm DeepSeek developed to train their R1 reasoning models (Chapter 15). Instead of supervised fine-tuning on input-output pairs, GRPO trains the model to reason through problems by rewarding correct final answers, without requiring a separate critic model (unlike PPO, which needs a value function that roughly doubles memory usage).

The key idea: generate multiple candidate responses for each prompt, score them with a verifiable reward function (e.g., “did the math answer match the ground truth?”), and update the model to make higher-scoring responses more likely. The model learns to allocate more “thinking time” to harder problems, producing the chain-of-thought reasoning behavior seen in DeepSeek-R1 and similar models.

Unsloth supports GRPO with the same memory efficiency as SFT. You can train a reasoning model on as little as 5 GB of VRAM with Qwen3-1.7B using FP8 precision, or run GRPO on Qwen3-8B with a consumer GPU. As of March 2026, Unsloth also supports FP8-precision GRPO training on consumer GPUs (RTX 40 and 50 series), and long-context GRPO reaching 110K tokens on an 80 GB H100 with Qwen3-8B via vLLM and QLoRA. The training loop uses TRL’s GRPOTrainer:

from trl import GRPOTrainer, GRPOConfig
import re

# GRPO requires a reward function, not labeled outputs.
# The function receives prompts and completions as keyword arguments,
# plus any extra dataset columns. Use **kwargs for forward compatibility.
def reward_fn(completions, ground_truth, **kwargs):
    """Score each completion by checking if the boxed answer matches."""
    matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
    contents = [m.group(1) if m else "" for m in matches]
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]

trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_fn,
    args=GRPOConfig(
        output_dir="./grpo-output",
        num_generations=4,  # Generate 4 candidates per prompt.
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        num_train_epochs=1,
        bf16=True,
    ),
    train_dataset=dataset,  # Must have "prompt" and "ground_truth" columns.
)
trainer.train()

The reward function above follows the TRL convention: it accepts completions and any dataset columns (here, ground_truth) as keyword arguments, and returns a list of floats. The **kwargs catch-all ensures forward compatibility as TRL adds new arguments. You can pass a single reward function or a list of them; when using multiple functions, the rewards are summed.

GRPO is the right choice when you want the model to develop reasoning capabilities, not just follow a format. For structured extraction tasks like the review analyzer in this chapter, SFT with LoRA is simpler and more appropriate. For math, code, and multi-step reasoning tasks, GRPO can produce dramatically better results. Note that GRPO datasets use a different format than SFT: each example needs a prompt column (the question) and optionally columns like ground_truth for the reward function, rather than the messages format used in SFT.

Source: GRPO developed by DeepSeek for R1 reasoning models, eliminates value function to reduce memory by ~50% vs PPO (confirmed from arxiv.org/abs/2501.12948, huggingface.co/docs/trl/main/en/grpo_trainer, huggingface.co/blog/Weyaxi/engineering-handbook-grpo-lora-with-verl). Unsloth GRPO: 5 GB VRAM for Qwen3-1.7B FP8, 80% less VRAM than Hugging Face + FA2, FP8 GRPO on consumer GPUs, 110K context on H100 with Qwen3-8B (confirmed from unsloth.ai/blog/r1-reasoning, unsloth.ai/blog/grpo, docs.unsloth.ai/new/fp8-reinforcement-learning, docs.unsloth.ai/new/grpo-long-context).

When to Use Each Method

MethodVRAM (8B model)QualitySpeedBest For
QLoRA SFT (rank 16)~6-8 GBVery goodFastFormat adaptation, domain tasks
LoRA SFT (FP16 base)~18-20 GBVery goodMediumWhen you have more VRAM
DoRA (rank 16)~7-9 GBExcellentFastWhen LoRA quality is not enough
GRPO (QLoRA)~8-15 GBExcellent for reasoningSlowerMath, code, multi-step reasoning
Full fine-tuning~32+ GBBestSlowLarge datasets, maximum quality

A Complete Walkthrough: From Raw Data to Deployed Model

Let us tie everything together with a realistic end-to-end example. Suppose you work at a company that receives thousands of customer support tickets daily, and you want a model that classifies tickets into categories and extracts key information.

Step 1: Collect and Prepare Data

import json

# In practice, you would export these from your ticketing system.
# Here we show the format with a few examples.
tickets = [
    {
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a support ticket classifier. Given a customer "
                    "message, extract: category (billing/technical/shipping/"
                    "account/other), urgency (low/medium/high/critical), "
                    "and a one-sentence summary. Respond in JSON."
                )
            },
            {
                "role": "user",
                "content": (
                    "I was charged twice for my subscription this month. "
                    "The amounts are $29.99 each, posted on March 3 and "
                    "March 5. Please refund the duplicate charge."
                )
            },
            {
                "role": "assistant",
                "content": json.dumps({
                    "category": "billing",
                    "urgency": "high",
                    "summary": "Customer charged twice for subscription, requesting refund of duplicate $29.99 charge."
                }, indent=2)
            }
        ]
    },
    {
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a support ticket classifier. Given a customer "
                    "message, extract: category (billing/technical/shipping/"
                    "account/other), urgency (low/medium/high/critical), "
                    "and a one-sentence summary. Respond in JSON."
                )
            },
            {
                "role": "user",
                "content": (
                    "My package was supposed to arrive yesterday but the "
                    "tracking still shows it in transit. Order number is "
                    "ORD-2026-88421. Not urgent, just want an update."
                )
            },
            {
                "role": "assistant",
                "content": json.dumps({
                    "category": "shipping",
                    "urgency": "low",
                    "summary": "Package delayed, customer requesting tracking update for order ORD-2026-88421."
                }, indent=2)
            }
        ]
    },
]

# Split into train (80%) and eval (20%).
split_idx = int(len(tickets) * 0.8)
train_set = tickets[:split_idx]
eval_set = tickets[split_idx:]

# Save.
with open("tickets_train.jsonl", "w") as f:
    for t in train_set:
        f.write(json.dumps(t) + "\n")

with open("tickets_eval.jsonl", "w") as f:
    for t in eval_set:
        f.write(json.dumps(t) + "\n")

print(f"Train: {len(train_set)} examples, Eval: {len(eval_set)} examples")

Step 2: Fine-tune

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
)

dataset = load_dataset("json", data_files="tickets_train.jsonl", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./ticket-classifier",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        logging_steps=10,
        save_strategy="epoch",
        bf16=True,
        optim="adamw_8bit",
        seed=42,
    ),
)

trainer.train()
model.save_pretrained("ticket-classifier-adapter")
tokenizer.save_pretrained("ticket-classifier-adapter")

Step 3: Evaluate

import json
from datasets import load_dataset

eval_data = load_dataset("json", data_files="tickets_eval.jsonl", split="train")

correct = 0
parseable = 0
total = len(eval_data)

for example in eval_data:
    response = generate_response(model, tokenizer, example)
    try:
        parsed = json.loads(response)
        parseable += 1
        expected = json.loads(example["messages"][-1]["content"])
        if parsed.get("category") == expected.get("category"):
            correct += 1
    except json.JSONDecodeError:
        pass

print(f"Parse rate:    {parseable}/{total} ({parseable/total*100:.1f}%)")
print(f"Category acc:  {correct}/{total} ({correct/total*100:.1f}%)")

Step 4: Deploy

# Export to GGUF for Ollama deployment.
model.save_pretrained_gguf(
    "ticket-classifier-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Create Ollama Modelfile.
modelfile_content = '''FROM ./ticket-classifier-gguf/unsloth.Q4_K_M.gguf
SYSTEM "You are a support ticket classifier. Given a customer message, extract: category (billing/technical/shipping/account/other), urgency (low/medium/high/critical), and a one-sentence summary. Respond in JSON."
PARAMETER temperature 0.1
'''

with open("Modelfile", "w") as f:
    f.write(modelfile_content)

print("Deploy with:")
print("  ollama create ticket-classifier -f Modelfile")
print("  ollama run ticket-classifier")

The entire pipeline, from raw data to a deployed local model, takes about 1-2 hours of hands-on work plus 30-60 minutes of training time on a single GPU.


Troubleshooting Common Issues

“CUDA out of memory”

This is the most common error. Solutions, in order of effectiveness:

  1. Reduce per_device_train_batch_size to 1.
  2. Reduce max_seq_length to 1024 or 512.
  3. Use gradient_checkpointing=True in TrainingArguments (trades compute for memory).
  4. Reduce LoRA rank from 16 to 8.
  5. Use a smaller model (Qwen3-4B instead of 8B).

Loss does not decrease

  1. Check your data format. The most common cause is incorrect chat template formatting. Print a few tokenized examples and verify they look correct.
  2. Increase the learning rate. Try 5e-4 instead of 2e-4.
  3. Check for data issues. Are there empty responses? Duplicate examples? Contradictory labels?

Model outputs garbage after fine-tuning

  1. You may have trained too long. Reduce epochs from 3 to 1.
  2. The learning rate may be too high. Reduce to 1e-4.
  3. The adapter may not be loading correctly. Verify the adapter path and that the base model matches.

Model ignores the fine-tuning and behaves like the base model

  1. The LoRA alpha may be too low. Increase lora_alpha (try 64 instead of 32).
  2. You may not be targeting enough modules. Add more target_modules.
  3. The dataset may be too small. Add more training examples.

Key Takeaways

  • Fine-tuning changes the model’s weights; prompting changes the model’s input. Fine-tuning is permanent and requires compute. Prompting is temporary and requires only good instructions. Use fine-tuning when prompting alone cannot reliably achieve your desired behavior.

  • QLoRA makes fine-tuning accessible on consumer hardware. By loading the base model in 4-bit precision and training small LoRA adapters in 16-bit, you can fine-tune an 8.2-billion-parameter model on a single GPU with 8 GB of VRAM. The trainable parameters are about 0.5% of the total model.

  • Data quality matters more than data quantity. 500 high-quality, consistent examples will outperform 10,000 noisy ones. Invest time in curating your training data before tuning hyperparameters.

  • The LoRA adapter is small and portable. A rank-16 adapter for an 8B model is 50-150 MB. You can store multiple adapters for different tasks, all sharing the same base model. vLLM can serve multiple adapters from a single base model instance.

  • Evaluation requires held-out data and task-specific metrics. Training loss alone does not tell you whether the model is useful. Test on examples the model has never seen, and measure the metrics that matter for your use case (parse rate, accuracy, format compliance).

  • Deployment is straightforward. Merge the adapter into the base model, export to GGUF, and run with Ollama. The result is a single file that runs on any machine with no Python environment or internet connection required.

  • DoRA improves on LoRA by decomposing weight updates into magnitude and direction components. It consistently outperforms standard LoRA with minimal additional overhead and is supported in Hugging Face PEFT and Unsloth.

  • GRPO enables reasoning fine-tuning. For tasks that require multi-step reasoning (math, code, logic), GRPO trains the model with reinforcement learning instead of supervised examples. It is the method behind DeepSeek-R1’s reasoning capabilities and runs on consumer hardware with Unsloth (as little as 5 GB VRAM for Qwen3-1.7B with FP8 precision).

  • Start simple, iterate. Begin with rank 16, learning rate 2e-4, 3 epochs, and a clean dataset. Evaluate. If the results are not good enough, adjust one hyperparameter at a time. Most fine-tuning failures are data problems, not hyperparameter problems.

  • The model ecosystem moves fast. This chapter uses Qwen3-8B (April 2025), but Qwen3.5-9B (March 2026) already offers a hybrid architecture with 262K native context and multimodal support under the same Apache 2.0 license. The fine-tuning workflow is identical across model families. Pick the best model available when you start your project, not the one in the tutorial.


What’s Next

You have fine-tuned a pretrained model on your own data and deployed it locally. In Chapter 29, you will build an agent with tool use: a system that can call external APIs, read files, search the web, and chain multiple steps together to answer complex questions, using the function calling and MCP protocol described in Chapter 23.