Fine-Tuning

Fine-Tune Qwen3-30B MoE on One GPU With Unsloth

A year ago, the idea of training a 30-billion-parameter model on hardware you can rent for a few dollars an hour sounded absurd. Today it is routine. If you want to fine-tune Qwen3-30B, a mixture-of-experts model that activates only 3B parameters per token, you can do it on a single 24GB GPU thanks to Unsloth’s memory optimizations and 4-bit QLoRA. This tutorial walks through the full pipeline: environment setup, dataset formatting, training configuration tuned for MoE, and exporting to GGUF for local inference.

This guide is for backend and ML engineers who already understand the basics of supervised fine-tuning and want a practical, reproducible recipe rather than a research paper. By the end, you will have a trained adapter, a merged model, and a clear sense of when this approach beats the alternatives.

What Makes Qwen3-30B-A3B Different

Qwen3-30B-A3B is a mixture-of-experts (MoE) model with roughly 30.5B total parameters but only about 3.3B active per forward pass. It routes each token through 8 of its 128 experts, which means inference and training touch a fraction of the weights at any moment. As a result, you get the quality of a large model with the compute profile of a much smaller one.

This architecture matters for fine-tuning because memory pressure comes from two places: the static weights and the dynamic activations. The MoE design keeps activation costs low since only a slice of experts fire per token. However, the full weight set still has to live in memory, which is exactly the problem QLoRA solves by loading those weights in 4-bit precision.

The model ships under Apache 2.0, supports a 32K native context (extendable to 128K via YaRN), and includes a hybrid reasoning mode you can toggle. For most fine-tuning tasks, you will train the non-reasoning instruction-following behavior, though the same recipe adapts to reasoning data.

Why Unsloth for MoE Fine-Tuning

Unsloth rewrites the hot paths of transformer training with custom Triton kernels and a manual autograd implementation. Consequently, it cuts VRAM use significantly and speeds up training compared to a stock Hugging Face stack, with no loss in accuracy. For MoE models specifically, Unsloth added dedicated support so the expert routing layers train correctly under 4-bit quantization.

The practical payoff is concrete. Fine-tuning Qwen3-30B with vanilla transformers and peft would push you toward an 80GB A100 or multi-GPU sharding. With Unsloth’s dynamic 4-bit quantization and aggressive gradient checkpointing, the same job fits comfortably on a 24GB card such as an RTX 4090 or an A5000, and runs with headroom on a 48GB A6000.

If you are new to the library, our Unsloth fine-tuning guide for single-GPU LLMs covers the fundamentals this post builds on. For a broader view of the ecosystem, see Axolotl vs Unsloth vs TorchTune, which explains where each tool fits.

Prerequisites and Environment Setup

You need a CUDA-capable GPU with at least 24GB of VRAM, a recent NVIDIA driver, and Python 3.10 or newer. Cloud options such as a single RTX 4090 or L40S work well, and the recipe is identical on a local workstation.

Install Unsloth along with its dependencies. The library pins compatible versions of torchtransformerstrl, and bitsandbytes, so installing it through the official extra avoids version conflicts.

# Create an isolated environment first
python -m venv .venv && source .venv/bin/activate

# Install Unsloth with CUDA support; this also pulls in
# transformers, trl, peft, accelerate, and bitsandbytes
pip install "unsloth[cu124-torch260] @ git+https://github.com/unslothai/unsloth.git"

# Verify the GPU is visible before training
python -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.is_available())"

If the verification command prints your GPU name and True, you are ready. Should it fail, the usual culprit is a driver-toolkit mismatch rather than the package itself.

Loading Qwen3-30B in 4-Bit

Unsloth exposes a FastLanguageModel loader that handles quantization, device placement, and the patching of attention kernels in one call. Load the pre-quantized 4-bit MoE checkpoint that Unsloth maintains, which downloads faster and avoids quantizing on the fly.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # raise to 4096+ if your data needs it and VRAM allows

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-30B-A3B-unsloth-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,        # 4-bit QLoRA base weights
    full_finetuning=False,    # we attach LoRA adapters, not full weights
)

The load_in_4bit flag is what keeps the 30B weight set under ~18GB. Importantly, you never update those frozen 4-bit weights directly. Instead, you train small low-rank adapters layered on top, which is the core idea behind QLoRA.

Attaching LoRA Adapters

Next, wrap the model with LoRA adapters. The target modules include both the attention projections and the MoE expert layers, so the adapter can actually influence routing-dependent behavior.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                     # LoRA rank; 16 is a solid default
    lora_alpha=32,            # scaling, typically 2x the rank
    lora_dropout=0,           # 0 is optimized in Unsloth
    bias="none",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",  # MoE-aware checkpointing
    random_state=3407,
)

The use_gradient_checkpointing="unsloth" setting is not optional for a model this size. It trades a small amount of compute for a large reduction in activation memory by recomputing intermediate tensors during the backward pass. Without it, even a 48GB card will run out of memory on longer sequences.

Preparing Your Dataset

The most common mistake in fine-tuning is feeding the model text in a format that does not match its chat template. Qwen3 uses a specific structure for system, user, and assistant turns, and the tokenizer already knows it. Use apply_chat_template rather than hand-building prompt strings.

Format your data as a list of message dictionaries. A small instruction dataset works for demonstration; in production, aim for at least a few thousand high-quality examples that reflect your actual task.

from datasets import load_dataset

# Example: an instruction dataset in ShareGPT-style "conversations" format
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:5000]")

def format_chat(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(format_chat, batched=True)

The add_generation_prompt=False flag matters during training because you want the model to learn the full assistant turn, not just predict from an empty prompt. Switch it to True only at inference time when you actually want the model to generate.

Quality beats quantity here. A focused dataset of 2,000 to 5,000 examples that match your domain usually outperforms 100,000 generic ones. If you are deciding whether fine-tuning is even the right tool, our guide on fine-tuning vs RAG breaks down which problems each approach actually solves.

Configuring the Trainer

Unsloth works with the SFTTrainer from TRL. The configuration below targets a single 24GB GPU. The effective batch size is the per-device batch times the gradient accumulation steps, so adjust those two numbers together to control memory.

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        per_device_train_batch_size=1,   # keep small for MoE on 24GB
        gradient_accumulation_steps=8,   # effective batch size of 8
        warmup_steps=10,
        num_train_epochs=1,              # 1-3 epochs is typical
        learning_rate=2e-4,              # standard for LoRA
        logging_steps=5,
        optim="adamw_8bit",              # 8-bit optimizer saves VRAM
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",                # set to "wandb" to track runs
    ),
)

The adamw_8bit optimizer is a deliberate choice. Optimizer state for AdamW normally consumes twice the memory of the trainable parameters, so the 8-bit variant roughly halves that overhead. Combined with QLoRA, only the LoRA adapters carry optimizer state anyway, which keeps the footprint tiny.

Training Only on Responses

For instruction tuning, you usually want the loss computed only on the assistant’s responses, not the user’s prompts. Otherwise the model wastes capacity learning to reproduce questions. Unsloth provides a helper for this.

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
)

These markers correspond to Qwen3’s chat template tokens. As a result, the trainer masks everything except the assistant turns when computing gradients, which produces noticeably better instruction-following.

Running the Fine-Tune

With everything wired up, launch training. Unsloth prints a memory summary at startup so you can confirm you have headroom before committing to a long run.

import torch

start_mem = torch.cuda.max_memory_reserved() / 1024**3
print(f"Reserved before training: {start_mem:.2f} GB")

trainer_stats = trainer.train()

peak_mem = torch.cuda.max_memory_reserved() / 1024**3
print(f"Peak reserved during training: {peak_mem:.2f} GB")
print(f"Runtime: {trainer_stats.metrics['train_runtime']:.0f}s")

On a single RTX 4090, expect peak memory in the low 20s of gigabytes for a 2048-token sequence length with these settings. Should you hit an out-of-memory error, the first levers to pull are reducing max_seq_length, then lowering per_device_train_batch_size (already at 1 here), and finally trimming the LoRA rank.

Throughput depends heavily on your data length distribution, but the MoE architecture keeps step times reasonable because only the active experts participate in each forward pass. This is a genuine advantage over a dense 30B model, which would activate every parameter on every token.

Testing the Fine-Tuned Model

Before exporting anything, verify the adapter actually changed behavior. Switch the model to inference mode, which Unsloth optimizes separately, and run a prompt through the chat template.

FastLanguageModel.for_inference(model)  # ~2x faster generation

messages = [{"role": "user", "content": "Explain MoE routing in one paragraph."}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,   # True at inference time
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Run several prompts that resemble your training data. If the outputs reflect your dataset’s style and knowledge, the fine-tune worked. If they look identical to the base model, your learning rate may be too low or your dataset too small to move the needle.

Saving and Exporting to GGUF

You have three export paths, depending on where the model will run. Each serves a different deployment target.

# 1. Save just the LoRA adapter (small, ~100-300MB)
model.save_pretrained("qwen3-30b-lora")
tokenizer.save_pretrained("qwen3-30b-lora")

# 2. Merge adapter into base weights in 16-bit for vLLM / Transformers
model.save_pretrained_merged(
    "qwen3-30b-merged",
    tokenizer,
    save_method="merged_16bit",
)

# 3. Export to GGUF for llama.cpp and Ollama (quantized for local use)
model.save_pretrained_gguf(
    "qwen3-30b-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # good size-to-quality balance
)

The GGUF export is the path most readers want for local inference. The q4_k_m quantization keeps quality high while shrinking the file dramatically. To understand which quantization format fits your needs, our LLM quantization guide covering GGUF, AWQ, and GPTQ explains the trade-offs in depth.

Once you have a GGUF file, you can serve it locally. For production serving of the merged 16-bit model, vLLM for self-hosted LLM serving handles high-throughput inference far better than a naive loop.

A Realistic Use Case

Consider a small team building an internal support assistant for a niche developer tool. The base Qwen3-30B model knows general programming well, but it has no knowledge of the team’s proprietary SDK, error codes, or naming conventions. RAG helps with documentation lookup, yet the model still phrases answers in a generic voice and misuses domain terms.

Over a few days, the team curates around 3,000 question-answer pairs from support tickets and internal docs. They fine-tune on a single rented L40S, iterating on learning rate and epoch count across a handful of short runs. The trade-off they accept is real: the fine-tune bakes in knowledge as of the training date, so they pair it with RAG for anything that changes frequently. The result is a model that speaks the product’s language natively while RAG keeps the facts current. This hybrid pattern, rather than fine-tuning alone, is what most production teams actually ship.

When to Fine-Tune Qwen3-30B

  • You need consistent tone, format, or domain vocabulary that prompting alone cannot enforce reliably
  • You have at least a few thousand high-quality, task-specific examples
  • Latency or cost rules out a frontier API and you want to self-host
  • The MoE efficiency matters because you want 30B-class quality at lower active compute

When NOT to Fine-Tune Qwen3-30B

  • Your knowledge changes often; reach for retrieval instead, since fine-tuning freezes information at training time
  • You have fewer than a few hundred examples, where few-shot prompting usually wins
  • A smaller dense model already meets your quality bar at lower operational cost
  • You only need the model occasionally, making a hosted API cheaper than maintaining infrastructure

Common Mistakes With Qwen3-30B Fine-Tuning

  • Skipping apply_chat_template and hand-rolling prompts, which silently breaks the format the model expects
  • Forgetting train_on_responses_only, so the model wastes capacity learning to echo prompts
  • Setting the learning rate too high, which causes the adapter to overfit or collapse on small datasets
  • Omitting gradient checkpointing and then blaming the GPU when training runs out of memory
  • Exporting to merged 16-bit when you actually wanted GGUF for local inference, doubling your disk and download time

Conclusion

The ability to fine-tune Qwen3-30B on a single consumer-grade GPU collapses a barrier that used to require a cluster. Unsloth’s 4-bit QLoRA, MoE-aware gradient checkpointing, and 8-bit optimizer make the 30B mixture-of-experts model trainable in roughly 24GB of VRAM, and the GGUF export path takes you straight to local inference. Start with a small, focused dataset, verify the adapter changes behavior, and only then scale up your data and epochs.

For your next step, compare your QLoRA results against a full pipeline in Axolotl vs Unsloth vs TorchTune, or explore GRPO fine-tuning for reasoning models if your task rewards step-by-step thinking rather than imitation.

Leave a Comment