
If you have watched a model like DeepSeek R1 work through a math problem step by step and wondered how it learned to do that, the answer is reinforcement learning. Specifically, it is a technique called GRPO fine-tuning. This guide is for engineers who already know basic supervised fine-tuning and want to push further into teaching a model how to reason, not just what to answer.
By the end, you will understand what GRPO actually does, why it produces reasoning behavior that supervised fine-tuning struggles to create, and how to run a real training loop on a single consumer GPU. We will use Unsloth and the TRL library, write custom reward functions, and train a small model to show its work on math problems. The code is production-shaped, not a toy demo.
What Is GRPO Fine-Tuning?
GRPO fine-tuning is a reinforcement learning method that trains a language model to maximize a reward signal by comparing several answers to the same prompt and favoring the better ones. GRPO stands for Group Relative Policy Optimization. DeepSeek introduced it to train the R1 reasoning models, and it has since become the standard way to teach models to think before answering.
The key word is relative. Traditional reinforcement learning from human feedback (RLHF) uses an algorithm called PPO, which needs a separate value model to estimate how good each token is. That value model roughly doubles your memory footprint. GRPO throws it out. Instead, it samples a group of completions for one prompt, scores each with a reward function, and treats the group’s average score as the baseline. Completions above average get reinforced; those below get suppressed.
Because the baseline comes from the group itself, GRPO needs no critic network. As a result, it fits on hardware where PPO would not. That single design choice is why hobbyists can now train reasoning behavior that, until recently, required a research lab.
How GRPO Differs From PPO and Supervised Fine-Tuning
Supervised fine-tuning (SFT) teaches a model to imitate example outputs. You hand it prompt-response pairs, and it learns to reproduce them. This works well when you have the exact answers you want. However, it cannot teach a model to discover a good reasoning path, because you would have to write every correct chain of thought by hand.
Reinforcement learning flips the approach. Rather than showing the model what to say, you reward it for outcomes you like and let it find its own path there. The table below summarizes where each method fits.
| Aspect | Supervised Fine-Tuning | PPO (classic RLHF) | GRPO |
|---|---|---|---|
| Needs labeled outputs | Yes | No (uses reward model) | No (uses reward functions) |
| Separate value/critic model | No | Yes | No |
| Memory cost | Low | High | Moderate |
| Teaches reasoning paths | Weak | Strong | Strong |
| Reward signal | None | Learned reward model | Programmable functions |
Notably, GRPO lets you encode rewards as plain Python functions. You do not need to train a separate reward model first, which removes an entire stage of the pipeline. If you want to reward correct math answers, you write a function that checks the answer. That directness is a big part of why GRPO is approachable.
For a broader view of when training a model beats retrieval, see our guide on fine-tuning vs RAG. GRPO is a fine-tuning technique, so the same decision framework applies before you start.
Why Reasoning Models Need Reinforcement Learning
A reasoning model produces a visible chain of thought, then a final answer. DeepSeek R1-Zero demonstrated something surprising: a model trained purely with GRPO and a correctness reward learned to reason on its own. Nobody wrote the reasoning steps. The model discovered that thinking longer led to higher rewards, so it started thinking longer.
This emergent behavior is hard to get from SFT alone. When you fine-tune on human-written reasoning, the model copies the style of reasoning without necessarily learning why a step helps. With GRPO, the reward ties directly to whether the final answer is correct, so the model optimizes for actual problem-solving rather than surface imitation.
That said, pure RL from a base model is unstable and slow. In practice, most teams do a small SFT “cold start” first, then apply GRPO. We will focus on the GRPO stage here, since that is the part that creates reasoning behavior. If you want to explore how a frontier model exposes its thinking, our breakdown of Claude’s extended thinking shows the same idea from the API consumer’s side.
Prerequisites and Environment Setup
You will need a GPU with at least 16GB of VRAM for a small model like Qwen2.5-3B. A free Google Colab T4, a Kaggle notebook, or any RTX 3090/4090 will work. GRPO is memory-hungry during generation because it samples multiple completions per prompt, so smaller models are the sensible starting point.
We use Unsloth because it patches the model for faster training and lower memory use, and it ships first-class GRPO support. If you have not used it before, start with our Unsloth fine-tuning tutorial for the basics of loading and patching models.
Install the dependencies first:
# Install Unsloth, TRL (provides GRPOTrainer), and vLLM for fast sampling
pip install unsloth vllm
pip install --upgrade "trl>=0.14.0" "datasets>=3.0.0"
# Expected: pip resolves a CUDA-enabled torch build automatically.
# If you see a torch/CUDA mismatch, install torch for your CUDA version first.
The vllm package matters here. GRPO spends most of its time generating sample completions, and vLLM makes that generation several times faster than vanilla Hugging Face generation. Unsloth wires the two together for you.
Loading the Model With Unsloth
Load a small instruction-tuned model and attach LoRA adapters. We train only the adapters, which keeps memory low and means you can fine-tune a 3B model on a single 16GB card. The fast_inference=True flag turns on the vLLM generation path.
from unsloth import FastLanguageModel
import torch
max_seq_length = 1024 # Room for the reasoning chain plus the answer
lora_rank = 32 # Higher rank = more capacity, more memory
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B-Instruct",
max_seq_length=max_seq_length,
load_in_4bit=True, # 4-bit quantization to fit on a small GPU
fast_inference=True, # Use vLLM for fast sampling during GRPO
max_lora_rank=lora_rank,
gpu_memory_utilization=0.6, # Leave headroom for generation buffers
)
model = FastLanguageModel.get_peft_model(
model,
r=lora_rank,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=lora_rank,
use_gradient_checkpointing="unsloth", # Big memory savings on long sequences
random_state=3407,
)
The gpu_memory_utilization setting deserves attention. GRPO needs memory for both training and generation. If you set this too high, vLLM has no room to batch completions and you hit out-of-memory errors. On a 16GB card, 0.6 is a reasonable starting point. The 4-bit loading uses the same quantization ideas covered in our LLM quantization guide.
Preparing the Dataset
GRPO needs prompts, not full answers. The model generates its own completions; you only supply the question and, for the reward function, the ground-truth answer to check against. The GSM8K grade-school math dataset is the standard choice because answers are easy to verify automatically.
We format each example with a system prompt that asks for a reasoning section and an answer section. Clear structure makes rewards easier to compute.
from datasets import load_dataset
import re
SYSTEM_PROMPT = """Respond in the following format:
<reasoning>
Work through the problem step by step.
</reasoning>
<answer>
The final numeric answer only.
</answer>"""
def extract_gsm8k_answer(text: str) -> str:
# GSM8K stores the answer after a "####" marker
return text.split("####")[-1].strip().replace(",", "")
def build_dataset(split="train"):
data = load_dataset("openai/gsm8k", "main")[split]
return data.map(lambda ex: {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["question"]},
],
"answer": extract_gsm8k_answer(ex["answer"]),
})
dataset = build_dataset("train")
Each row now has a chat-formatted prompt and a clean answer string. The trainer will pass the prompt to the model, sample several completions, and hand each completion to our reward functions alongside the known answer.
Writing Reward Functions
Reward functions are the heart of GRPO fine-tuning. They define what “better” means. A reward function receives the generated completions and returns a list of float scores, one per completion. You can combine several functions, and their rewards add up.
We will use two complementary signals. The first checks whether the final answer is correct, which is the outcome we actually care about. The second rewards following the required format, which stabilizes early training before the model produces correct answers reliably.
def extract_xml_answer(text: str) -> str:
# Pull the content between <answer> tags
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
return match.group(1).strip() if match else ""
def correctness_reward(prompts, completions, answer, **kwargs):
# Reward 2.0 for a correct final answer, 0.0 otherwise
responses = [c[0]["content"] for c in completions]
extracted = [extract_xml_answer(r) for r in responses]
return [2.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
def format_reward(completions, **kwargs):
# Reward 0.5 when both tags are present and ordered correctly
pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
responses = [c[0]["content"] for c in completions]
return [
0.5 if re.search(pattern, r, re.DOTALL) else 0.0
for r in responses
]
Why split correctness and format into separate functions? Early in training, the model rarely gets the math right, so a correctness-only reward gives almost no signal. The format reward provides a gentle gradient toward well-structured output, which makes the answer extractable. Once structure is reliable, the correctness reward takes over as the dominant driver. This layering is a common pattern, and it prevents the training from stalling at zero reward.
Keep reward magnitudes sensible. If one reward dwarfs the others, the model optimizes only for that one. Here, correctness at 2.0 outweighs format at 0.5, which is intentional: format is a means, correctness is the goal.
Configuring and Running the GRPO Trainer
Now we wire everything into TRL’s GRPOTrainer. The configuration controls how many completions to sample per prompt (num_generations), how long they can be, and the usual learning-rate and batch settings. More generations per prompt give a more reliable group baseline but cost more memory and time.
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
learning_rate=5e-6,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
optim="adamw_8bit",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=8, # Group size: 8 samples per prompt
max_prompt_length=256,
max_completion_length=512, # Room for the reasoning chain
max_steps=500,
save_steps=100,
use_vllm=True, # Fast sampling via vLLM
output_dir="grpo_qwen_outputs",
report_to="none",
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[format_reward, correctness_reward],
args=training_args,
train_dataset=dataset,
)
trainer.train()
A few settings drive most of the behavior. The num_generations value sets the group size; eight is a good balance on a single GPU. The learning rate is deliberately tiny (5e-6) because RL updates are noisy and a large rate destabilizes the policy. Also note gradient_accumulation_steps: with a per-device batch of 1, you effectively train on 4 prompts per step, which smooths the gradient without raising peak memory.
Watch the reward in the logs as training runs. You should see the format reward climb quickly within the first 50 to 100 steps, followed by a slower rise in correctness reward. That ordering confirms the layered rewards are working as intended.
How to Read the Training Logs
GRPO logs more than loss. The numbers that matter are the per-function reward means and the completion length. When things go well, average reward trends upward and completion length grows as the model learns that longer reasoning earns more correct answers. This length increase is the same “thinking longer” behavior DeepSeek reported with R1-Zero.
If reward flatlines at zero, your reward function is probably never firing. Print a few completions and check that your regex matches the model’s actual output. If reward spikes then collapses, your learning rate is likely too high, or num_generations is too small to give a stable baseline. Lower the rate or raise the group size.
Saving and Using the Trained Adapters
After training, save the LoRA adapters and run inference. Because we trained adapters rather than the full model, the saved artifact is small, often only a few hundred megabytes.
# Save LoRA adapters
model.save_lora("grpo_qwen_lora")
# Run inference with the trained reasoning behavior
from vllm import SamplingParams
text = tokenizer.apply_chat_template([
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Natalia sold 48 clips in April and half as many in May. How many total?"},
], tokenize=False, add_generation_prompt=True)
output = model.fast_generate(
[text],
sampling_params=SamplingParams(temperature=0.7, max_tokens=512),
lora_request=model.load_lora("grpo_qwen_lora"),
)[0].outputs[0].text
print(output)
# Expect a <reasoning> block working through 48 + 24, then <answer>72</answer>
To merge the adapters into a standalone model for deployment, Unsloth can export to GGUF for local serving. Our walkthrough on running models on a Mac Mini covers the GGUF inference side once you have a merged model.
When to Use GRPO Fine-Tuning
- Your task has a verifiable outcome you can score with code, such as math, unit-test-passing code, or structured extraction
- You want the model to develop its own reasoning chains rather than copy human-written ones
- You have already tried prompt engineering and supervised fine-tuning and need a further accuracy gain
- You can express “good” as a reward function more easily than you can write thousands of labeled examples
When NOT to Use GRPO Fine-Tuning
- Your reward is purely subjective with no programmatic check (a learned reward model or SFT may fit better)
- You only have a small GPU and a large target model that will not fit even with 4-bit and LoRA
- A retrieval system would solve the problem, since adding knowledge is RAG’s job, not fine-tuning’s
- You need results this week, because RL runs are slower and noisier to tune than supervised passes
Common Mistakes With GRPO Fine-Tuning
- Writing a reward function whose regex never matches the model’s output, so reward stays at zero and nothing learns
- Setting
gpu_memory_utilizationtoo high, leaving vLLM no room to sample and triggering out-of-memory crashes - Using a learning rate borrowed from supervised fine-tuning; RL needs a much smaller rate to stay stable
- Choosing too small a group size, which makes the relative baseline noisy and training erratic
- Skipping the format reward and relying on correctness alone, which starves early training of any signal
A Realistic Training Scenario
Consider a small team adding a math-tutoring feature to an education app. Their base 3B model answers correctly only part of the time and rarely shows usable steps. Rather than hand-writing thousands of worked solutions for supervised fine-tuning, they apply GRPO with a correctness reward against a public math dataset, plus a format reward to enforce a clean reasoning block.
Over a few hundred steps on a single rented GPU, the team typically observes the format reward stabilize first, then a gradual climb in answer accuracy as completions grow longer. The trade-off is real: tuning the learning rate and group size takes several runs, and generation dominates the wall-clock time. Still, the payoff is a model that reasons in a verifiable way without anyone authoring the reasoning by hand. For a side-by-side of the frameworks that can run this kind of job, see our comparison of Axolotl vs Unsloth vs TorchTune.
Conclusion
GRPO fine-tuning gives you a practical path to reasoning behavior by rewarding outcomes instead of imitating answers. The method drops PPO’s value model, samples a group of completions, and uses their average as the baseline, which is exactly why it fits on a single GPU. With Unsloth, TRL, and a couple of well-shaped reward functions, you can reproduce the core training loop behind DeepSeek R1 on accessible hardware.
Start small: train Qwen2.5-3B on GSM8K with the two reward functions above, watch the reward curves, and iterate on your learning rate and group size before scaling up. From there, design reward functions for your own verifiable task. To decide whether GRPO fine-tuning is even the right tool versus retrieval, revisit our guide on fine-tuning vs RAG before committing GPU time.