Fine-Tuning

Unsloth: Fine-Tune LLMs 12x Faster on a Single GPU

If you have a model that keeps getting the same domain-specific task wrong and prompting isn’t fixing it, fine-tuning is the next step. The problem is that most fine-tuning tutorials assume a cluster of A100s. Unsloth fine-tuning changes that math. It rewrites the heavy parts of the training loop so you can fine-tune an 8B model on a single consumer GPU, in Colab’s free tier, or even on a laptop with a modest card.

This guide is for backend and ML engineers who want practical, production-relevant fine-tuning without renting a multi-GPU node. You will learn how Unsloth achieves its speedups, how to run a real QLoRA fine-tune end to end, and how to export the result to GGUF or vLLM for serving. By the end, you will have a working training script you can adapt to your own dataset.

What Is Unsloth?

Unsloth is an open-source library that makes LLM fine-tuning roughly 2x faster while using up to 80% less VRAM on a single GPU. It achieves this by manually rewriting the model’s forward and backward passes as optimized Triton kernels, rather than relying on the default PyTorch autograd path. The result is the same model quality with dramatically lower hardware requirements.

The headline numbers vary by configuration. The open-source single-GPU version typically delivers around 2x faster training and 60–80% memory savings versus a standard Hugging Face plus PEFT setup. Higher multipliers, including the “12x faster” figure you may see quoted, come from Unsloth’s multi-GPU and enterprise benchmarks. For most readers fine-tuning one model on one card, the realistic win is “this now fits on the GPU I already own.”

Unsloth supports a broad range of architectures: Llama 3.x, Mistral, Gemma, Qwen, Phi, and many vision and multimodal models on Hugging Face. It also covers reinforcement learning methods like GRPO, DPO, and ORPO, so it is not limited to plain supervised fine-tuning.

How Unsloth Achieves Its Speedup

The speedup is not a trick or an approximation. Unsloth keeps the math identical and optimizes the implementation. Three things drive the gains.

First, custom Triton kernels replace generic PyTorch operations for attention, RMSNorm, RoPE, and the cross-entropy loss. These kernels fuse multiple steps into one GPU pass, which cuts memory reads and writes.

Second, manual gradient computation avoids the overhead of PyTorch’s general-purpose autograd. Unsloth knows exactly which gradients a LoRA fine-tune needs and computes only those.

Third, dynamic 4-bit quantization for QLoRA recovers accuracy that naive 4-bit quantization loses. Instead of quantizing every layer the same way, Unsloth skips quantizing the layers most sensitive to precision loss, which keeps quality high while still saving memory.

Because the optimizations are implementation-level, you do not trade accuracy for speed. The loss curves match a standard run; you just reach them faster and on smaller hardware.

Prerequisites

Before you start, make sure you have the following in place.

  • An NVIDIA GPU with at least 3GB VRAM for small models, or 8GB+ for comfortable 8B QLoRA training. Unsloth supports CUDA capability 7.0 and up (Tesla T4, RTX 20-series, and newer).
  • Python 3.10 or later on Linux, WSL2, or Windows.
  • A Hugging Face account and access token if you plan to use gated models like Llama 3.1.
  • A dataset in a chat or instruction format. This guide uses a small instruction dataset, but the pattern is identical for your own data.

If you do not have a local GPU, the official Unsloth notebooks on Google Colab run on the free T4 tier. The code below works unchanged in that environment.

Installing Unsloth

For a local Linux or WSL setup, install via pip. Unsloth pins compatible versions of its dependencies, so installing it alone pulls the correct stack.

# Create an isolated environment first
python -m venv .venv
source .venv/bin/activate

# Install Unsloth (pulls a compatible torch, transformers, trl, peft)
pip install unsloth

# Verify the install and your GPU is visible
python -c "import unsloth; import torch; print('CUDA available:', torch.cuda.is_available())"

Expected output confirms the GPU is detected:

CUDA available: True

On Windows without WSL, install a CUDA-enabled PyTorch build first, then add Unsloth. The maintained Docker image is the most reliable path if dependency conflicts appear, since it ships a known-good combination of CUDA, PyTorch, and Triton.

Loading a Model With FastLanguageModel

The core of Unsloth is the FastLanguageModel class. It loads a base model already wrapped with the optimized kernels and 4-bit quantization. You then attach LoRA adapters to it.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Unsloth handles RoPE scaling automatically
dtype = None           # None = auto-detect (bf16 on Ampere+, else fp16)
load_in_4bit = True    # 4-bit QLoRA: the big VRAM saver

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

The unsloth/...-bnb-4bit model names are pre-quantized uploads that download faster and skip the local quantization step. You can also pass a standard Hugging Face model ID; Unsloth quantizes it on load when load_in_4bit=True.

Why this matters: loading in 4-bit is what brings an 8B model from roughly 16GB down to under 6GB of weights, which is the difference between fitting on a free T4 and not fitting at all.

Adding LoRA Adapters

Full fine-tuning updates every weight, which needs far more memory. Instead, you train small LoRA adapters and freeze the base model. Unsloth configures this with get_peft_model.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank: higher = more capacity, more memory
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,           # Scaling; a common default is alpha = r
    lora_dropout=0,          # 0 is optimized and fine for most runs
    bias="none",             # "none" is optimized
    use_gradient_checkpointing="unsloth",  # Unsloth's memory-efficient variant
    random_state=3407,
)

The target_modules list covers attention and MLP projections, which is the standard choice for instruction tuning. The use_gradient_checkpointing="unsloth" setting is important: it uses Unsloth’s own implementation that trades a little compute for a large drop in activation memory, enabling longer context lengths on the same card.

Why rank 16: for most domain adaptation and style tuning, ranks of 8–32 are enough. Higher ranks help when you teach genuinely new knowledge, but they also raise memory use and overfitting risk on small datasets.

Preparing Your Dataset

Your data needs to match a chat template the model understands. Unsloth ships helpers to apply the correct template for each model family. The example below uses a small instruction dataset, but you swap in your own by loading any Hugging Face dataset or local JSON.

from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",  # Match the base model's template
)

def format_prompts(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
        for c in convos
    ]
    return {"text": texts}

# Replace this with your own dataset path or HF repo
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:2000]")
dataset = dataset.map(format_prompts, batched=True)

Each conversation should be a list of role/content messages, for example [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]. Applying the chat template here, before training, guarantees the model sees the exact format it will see at inference time. A mismatch here is the single most common cause of a fine-tune that “trained fine” but behaves oddly in production.

Running the Fine-Tune

Unsloth integrates with TRL’s SFTTrainer, so the training loop itself is standard. Unsloth’s optimizations apply transparently underneath.

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # Effective batch size = 8
        warmup_steps=5,
        max_steps=60,                    # Use num_train_epochs for full runs
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",              # 8-bit optimizer saves more VRAM
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()

A few choices deserve explanation. The gradient_accumulation_steps setting lets you simulate a larger batch size without the memory cost of actually loading more samples at once. The adamw_8bit optimizer stores optimizer state in 8-bit, which on an 8B model frees up several gigabytes. The max_steps=60 value is for a quick test run; for real training, switch to num_train_epochs=1 or higher and remove max_steps.

During training you will see the loss logged each step. On a free Colab T4, a 60-step run on an 8B model finishes in a few minutes, which makes iteration on hyperparameters genuinely fast.

Testing the Fine-Tuned Model

Before exporting, sanity-check that the model behaves the way you wanted. Unsloth’s for_inference call switches the model into its 2x-faster inference path.

FastLanguageModel.for_inference(model)  # Enable optimized inference

messages = [{"role": "user", "content": "Explain connection pooling in one sentence."}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=128, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run a handful of representative prompts that reflect your real use case. If the answers regress on general tasks, you likely trained too long or with too high a learning rate. If they barely changed, your rank or dataset may be too small for the behavior you want.

Exporting for Production

A trained model is only useful once it is deployed. Unsloth gives you three export paths depending on where the model will run.

For local inference with Ollama or LM Studio, export to GGUF with a quantization level:

# Save a quantized GGUF for llama.cpp-based runtimes
model.save_pretrained_gguf("model-gguf", tokenizer, quantization_method="q4_k_m")

For high-throughput serving, merge the LoRA adapters into 16-bit weights that vLLM can load directly:

# Merge adapters into full 16-bit weights for vLLM / TGI
model.save_pretrained_merged("model-merged", tokenizer, save_method="merged_16bit")

To keep the adapter small and portable, save just the LoRA weights. The adapter is often around 100MB, versus gigabytes for the merged model, and you load it on top of the base model at serve time. If you want to understand the quantization formats these exports use, our guide on LLM quantization with GGUF, AWQ, and GPTQ breaks down the trade-offs.

A Realistic Fine-Tuning Scenario

Consider a small team building an internal support assistant. Their base model answers general questions well, but it keeps formatting API responses inconsistently and ignores the company’s specific terminology. Prompting helped a little, but the instructions kept getting diluted in long conversations.

They assembled a few thousand example conversations from past support tickets, each showing the desired tone and format. Using Unsloth QLoRA on a single RTX 4090, a one-epoch run took under an hour. The resulting adapter was small enough to version in their model registry alongside the base weights.

The key trade-off they hit was data quality over quantity. Their first run used 500 lightly-edited tickets and the model picked up sloppy formatting from the raw data. After cleaning and standardizing the examples, a similar-sized dataset produced markedly better results. Fine-tuning amplifies whatever patterns live in your data, including the bad ones, so the curation effort matters more than raw example count.

When to Use Unsloth Fine-Tuning

  • You need a model to consistently follow a specific format, tone, or domain vocabulary that prompting cannot reliably enforce.
  • You want to fine-tune on a single consumer or free-tier GPU rather than renting a multi-GPU node.
  • You are iterating quickly and need fast training runs to test datasets and hyperparameters.
  • You plan to deploy locally via GGUF or self-host with vLLM and want a clean export path.

When NOT to Use Unsloth Fine-Tuning

  • Your problem is really about retrieving up-to-date facts. Reach for retrieval first; see fine-tuning vs RAG to decide which one fits.
  • You need full fine-tuning across many GPUs for pretraining-scale work, where a framework built for distributed training fits better.
  • You have only a handful of examples. Few-shot prompting will usually beat a fine-tune trained on too little data.
  • You need multi-GPU scaling in the free open-source tier, which is limited compared to the paid offering.

Common Mistakes With Unsloth Fine-Tuning

  • Using the wrong chat template, so the model trains on a format it never sees at inference. Always apply the template that matches your base model.
  • Setting the learning rate too high, which causes the model to forget general capabilities. Start around 2e-4 for LoRA and lower it if quality regresses.
  • Training for too many epochs on a small dataset, leading to overfitting and repetitive outputs.
  • Skipping a held-out evaluation. Without testing on prompts the model never trained on, you cannot tell adaptation from memorization.

Where Unsloth Fits in a Local LLM Stack

Fine-tuning is one piece of a larger self-hosting workflow. Once you export to GGUF, you serve the model with a local runtime; our walkthrough on running local LLMs with Ollama shows that side. For higher-throughput production serving of the merged weights, vLLM self-hosted serving is the natural next step. And if you are pushing larger models on limited hardware, the techniques in running 70B models on a Mac Mini pair well with quantized fine-tunes.

Conclusion

Unsloth fine-tuning makes adapting an LLM practical on hardware you already have. By rewriting the training kernels and using dynamic 4-bit QLoRA, it delivers roughly 2x faster runs and up to 80% less VRAM without sacrificing model quality. The workflow is straightforward: load with FastLanguageModel, attach LoRA adapters, train with SFTTrainer, test, and export to GGUF or merged weights.

Start with a small, clean dataset and a short training run to validate your pipeline before scaling up. Then decide where the model will live. If you are still weighing whether to fine-tune at all, read fine-tuning vs RAG first; for many use cases, the right answer is a combination of both.

Leave a Comment