A nine-step tutorial on agent reinforcement fine-tuning (RFT): how to build the dataset, stand up tool servers, write the grader, configure KL and group size, size GPUs, and pick between TRL, verl, OpenRLHF, Unsloth, and NeMo RL — with code and example outputs at every step.
The RL math, simplified.Where prompts come from,what a training row looks like,and which open-source frameworkto run it on your own GPUs.

A step-by-step companion to the conceptual primer: follow nine steps from a folder of prompts to a fine-tuned tool-using agent. Code, expected outputs, and the watch-outs at every step.
By the end of this tutorial you will have trained a tool-using agent end-to-end via reinforcement fine-tuning. The running example: a research agent that answers numerical questions about a small corpus of company filings, with three tools — search (semantic search), list (browse paths), and read (fetch a document). The same nine steps work for any agentic task; only the dataset and the tools change.
This is a hands-on tutorial. Every step has runnable code, an expected output to check against, and a flagged gotcha. The framework choice is deferred to Step 9 — the early steps work the same way whether you train against a managed RFT API or self-host on TRL, verl, OpenRLHF, Unsloth, or NeMo RL.
requests, fastapi, uvicorn, and datasets.Goal: decide what success looks like, then prove the base model can sometimes hit it. If the base model fails on every rollout, RFT can’t hill-climb a flat landscape — fix the prior first.
Take a small batch of representative prompts (50 is enough), run the base model 5 times each through your tools, and look at the per-sample success-rate distribution. The shape of that distribution decides whether you have a viable RFT task.
# step1_prior_check.py
import json, requests, random
from collections import Counter
BASE_MODEL_URL = "http://localhost:8001/v1/chat/completions" # vLLM endpoint
N_ROLLOUTS = 5
def run_agent(prompt, tools_spec):
"""Send the prompt to the base model with tools; return (success: bool, n_tool_calls: int)."""
# ... your model + tool-calling loop here ...
# Returns whether the final answer matched ground truth.
pass
samples = [json.loads(line) for line in open("data/research_val.jsonl")]
results = []
for s in samples[:50]:
successes = sum(run_agent(s["prompt"], s["tools"])[0] for _ in range(N_ROLLOUTS))
rate = successes / N_ROLLOUTS
results.append((s["id"], rate))
# Bucket each sample as never / sometimes / always
buckets = Counter(
"always" if r == 1.0 else
"sometimes" if 0 < r < 1 else
"never"
for _, r in results
)
print(buckets)Counter({'sometimes': 19, 'never': 18, 'always': 13})
# 19/50 (38%) of samples are in the signal band — well over the 15-30% rule of thumb.
# RFT has something to learn here. Move on.Goal: produce two JSONL files — research_train.jsonl and research_val.jsonl — where each row is a self-contained training example.
Three sourcing strategies (in order of production-realism): real production logs, synthetic generation from a stronger model, public benchmarks adapted into your tool environment. Aim for 200–1,000 prompts in train, 10–20% held out for val. Stratify the split by task type and difficulty so val isn’t accidentally easier than train.
Each row has this shape:
{
"id": "row-2842",
"prompt": "What was Henderson Industries' Q3 2025 operating margin?",
"reference_answer": "13.4%",
"expected": {
"tolerance_rel": 0.02,
"evidence_path": "filings/henderson-10q-2025-q3.html",
"must_use_tools": ["search", "read"]
},
"tools": [
{"type": "function", "function": {"name": "search", "parameters": {...}}},
{"type": "function", "function": {"name": "list", "parameters": {...}}},
{"type": "function", "function": {"name": "read", "parameters": {...}}}
],
"metadata": {"task_type": "numerical_qa", "difficulty": "medium"}
}Three things to keep in mind. The id field will save you the moment you start diffing per-sample reward across runs. The reference_answer gets read by the grader (we’ll wire that in Step 4) — keep it structured, not free-form. The tools field is optional in some platforms, required in others; including it makes the row self-describing and is the right default for tasks that span multiple toolsets.
A small validator before you submit anything:
# step2_validate_dataset.py
import json
from pathlib import Path
REQUIRED = {"id", "prompt", "reference_answer"}
ROWS = []
for path in ["data/research_train.jsonl", "data/research_val.jsonl"]:
with open(path) as f:
for ln, line in enumerate(f, 1):
row = json.loads(line)
missing = REQUIRED - row.keys()
assert not missing, f"{path}:{ln} missing {missing}"
assert row["id"] not in {r["id"] for r in ROWS}, f"dup id {row['id']}"
ROWS.append(row)
print(f"Total rows: {len(ROWS)}")
print(f"Train/val: {sum(1 for r in ROWS if r['id'].startswith('train'))} / "
f"{sum(1 for r in ROWS if r['id'].startswith('val'))}")
print(f"Difficulty distribution: "
f"{dict(Counter(r['metadata']['difficulty'] for r in ROWS))}")Total rows: 612
Train/val: 489 / 123
Difficulty distribution: {'easy': 184, 'medium': 287, 'hard': 141}Goal: expose each tool as a bearer-authenticated HTTP endpoint that the training platform can call from its rollout workers.
Tools are HTTP services, not in-process functions. The platform runs hundreds of rollouts in parallel and POSTs to your endpoints; the same servers should handle production calls later, possibly with different config. A minimal FastAPI skeleton:
# step3_tool_server.py
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer
app = FastAPI()
security = HTTPBearer()
def auth(creds=Depends(security)):
if creds.credentials != os.environ["TOOL_BEARER"]:
raise HTTPException(401, "bad token")
@app.post("/search")
async def search(req: Request, _=Depends(auth)):
body = await req.json()
args = body.get("arguments", {})
call_id = body.get("call_id")
query = args.get("query", "").strip()
if not query:
# Structured error, not 500 — see Watch For below
return {"output": "Error: query is required", "call_id": call_id}
hits = embed_and_search(query, top_k=3) # your retrieval logic
return {"output": format_hits(hits), "call_id": call_id}
@app.post("/list")
async def list_paths(req: Request, _=Depends(auth)):
body = await req.json()
prefix = body.get("arguments", {}).get("prefix", "")
paths = corpus_index.list(prefix)
return {"output": "\n".join(paths), "call_id": body.get("call_id")}
@app.post("/read")
async def read(req: Request, _=Depends(auth)):
body = await req.json()
path = body.get("arguments", {}).get("path")
return {"output": corpus_index.read(path)[:8000], # token budget!
"call_id": body.get("call_id")}Smoke-test before submitting any training job:
$ curl -s -X POST https://tools.example.com/search \
-H "Authorization: Bearer $TOOL_BEARER" \
-H "Content-Type: application/json" \
-d '{"arguments": {"query": "Henderson Q3 operating margin"},
"call_id": "test-001"}' | jq{
"output": "filings/henderson-10q-2025-q3.html (score 0.87)\n Q3 operating margin: 13.4% on revenue of $842M ...",
"call_id": "test-001"
}Goal: turn each rollout’s trajectory into a scalar reward in [0, 1]. The grader is the entire training signal — invest disproportionately here.
Three implementations, picked by complexity:
For the research agent, a model-as-judge with a numerical-grader rubric:
# step4_grader.py
GRADER = {
"type": "score_model",
"name": "numerical_grader",
"model": "gpt-4.1",
"range": [0, 1],
"pass_threshold": 0.75,
"sampling_params": {"temperature": 0},
"input": [
{"role": "system", "content": """\
You will be given a Reference Answer and a Model Answer.
Score the Model Answer on a 0-1 scale.
Return 1 if BOTH:
- the Model Answer contains only the final numeric answer
- the value matches the Reference Answer up to format
Format variations that still count as correct:
- currency symbols ($, USD, €)
- magnitude suffixes (M, million, K, B)
- percent forms (7% vs 0.07)
- commas and whitespace
Return 0.5 if off by a tenth of a percent or less, or a true rounding error.
Return 0 otherwise. Reply with ONLY the numeric score."""},
{"role": "user", "content":
"- Reference Answer: {{item.reference_answer}}\n"
"- Model Answer: {{sample.output_text}}"}
],
}Test the grader standalone before training — three quick cases:
# step4_grader_test.py
cases = [
({"reference_answer": "13.4%"}, {"output_text": "13.4%"}), # exact
({"reference_answer": "13.4%"}, {"output_text": "0.134"}), # format variant
({"reference_answer": "13.4%"}, {"output_text": "13.41%"}), # rounding
({"reference_answer": "13.4%"}, {"output_text": "12.0%"}), # wrong
]
for item, sample in cases:
print(f"ref={item['reference_answer']:6} out={sample['output_text']:8} "
f"reward={run_grader(item, sample):.2f}")ref=13.4% out=13.4% reward=1.00 ref=13.4% out=0.134 reward=1.00 ref=13.4% out=13.41% reward=0.50 ref=13.4% out=12.0% reward=0.00
Goal: pick a starting hyperparameter set you can defend. Reasonable defaults to start from, ranked by how much they’ll move your run:
# step5_config.yaml
algorithm: grpo # group-relative policy optimization
# The four numbers that matter most
group_size: 8 # rollouts per prompt; below 4 the baseline is too noisy
kl_coef: 0.01 # β: KL penalty against the frozen reference. Tune carefully.
learning_rate: 1.0e-5 # 10-100× lower than SFT
epochs: 2 # agent RFT overfits fast on small datasets
# Mostly-fine-as-is
batch_size: 16 # max within memory after rollout buffer
max_trajectory_tokens: 4096
max_tool_calls: 12 # rollout tool-call budget
warmup_steps: 5
seed: 42
# Inputs
train_path: data/research_train.jsonl
val_path: data/research_val.jsonl
base_model: Qwen/Qwen2.5-7B-Instruct
lora:
rank: 32
alpha: 64
target_modules: [q_proj, k_proj, v_proj, o_proj]KL coefficient β is the single biggest knob. Too low and the policy drifts away from the reference model (capability collapse — the model forgets formats, refuses unrelated tasks, breaks tool-call syntax). Too high and the model can’t learn anything new. Watch the KL curve during training; if it spikes, tighten β. Reasonable range: 0.001 to 0.05.
Group size controls baseline variance. 8–16 is the sweet spot. Below 4, group-mean advantages get noisy enough that gradients stop pointing in a useful direction.
Goal: kick off the job, monitor the right metrics in real time, and abort if things look wrong before you burn through your budget.
Training step, schematically:
A minimal launch script using TRL’s GRPO trainer:
# step6_train.py
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig
from datasets import load_dataset
from utils import build_grader, build_tools
config = GRPOConfig(
learning_rate=1e-5,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=2,
num_generations=8, # group size
beta=0.01, # KL coefficient
max_completion_length=4096,
bf16=True,
logging_steps=1,
output_dir="./runs/research-rft-v1",
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B-Instruct",
reward_funcs=[build_grader(GRADER)], # from Step 4
train_dataset=load_dataset("json", data_files="data/research_train.jsonl")["train"],
args=config,
peft_config=LoraConfig(r=32, lora_alpha=64,
target_modules=["q_proj","k_proj","v_proj","o_proj"]),
)
trainer.train()What healthy training output looks like (first 30 of 150 steps shown):
step val_reward tool_calls kl_to_ref loss 0 0.59 9.1 0.000 -0.18 2 0.61 8.4 0.003 -0.21 5 0.64 7.2 0.008 -0.27 10 0.69 5.8 0.014 -0.31 15 0.72 4.9 0.019 -0.29 20 0.74 4.5 0.024 -0.27 25 0.76 4.2 0.028 -0.26 30 0.77 4.1 0.031 -0.25 ← phase 1 plateau ... 100 0.81 4.0 0.043 -0.18 150 0.82 4.0 0.048 -0.14
Goal: compare the fine-tuned checkpoint against the base model on the held-out validation set, per sample. The aggregate number is rarely the most informative view.
# step7_evaluate.py
import json, statistics
def evaluate(model_id, val_path):
out = []
for row in (json.loads(l) for l in open(val_path)):
sample = run_agent(model_id, row["prompt"], row["tools"])
reward = run_grader(row, sample)
out.append({"id": row["id"],
"reward": reward,
"tool_calls": sample["n_tool_calls"]})
return out
base = evaluate("Qwen/Qwen2.5-7B-Instruct", "data/research_val.jsonl")
tuned = evaluate("./runs/research-rft-v1/checkpoint-150", "data/research_val.jsonl")
# Per-sample diff
diffs = []
for b_, t_ in zip(base, tuned):
diffs.append({
"id": b_["id"],
"d_reward": t_["reward"] - b_["reward"],
"d_tools": t_["tool_calls"] - b_["tool_calls"],
})
print(f"mean Δreward: {statistics.mean(d['d_reward'] for d in diffs):+.3f}")
print(f"mean Δtools: {statistics.mean(d['d_tools'] for d in diffs):+.2f}")
print(f"improved: {sum(1 for d in diffs if d['d_reward'] > 0)}/{len(diffs)}")
print(f"regressed: {sum(1 for d in diffs if d['d_reward'] < 0)}/{len(diffs)}")
print(f"unchanged: {sum(1 for d in diffs if d['d_reward'] == 0)}/{len(diffs)}")mean Δreward: +0.218 mean Δtools: -4.91 improved: 71/123 regressed: 12/123 unchanged: 40/123
Three things to check before declaring victory. First, the regressed bucket — 12 samples got worse. Is there a pattern (a specific task type, a specific tool, a specific document)? Second, the unchanged bucket — 40 samples didn’t move. Are they all already-solved (good) or all unsolvable (bad)? Third, per-slice reward — slice the diffs by metadata.task_type and metadata.difficulty. Aggregate gains can hide a regression on a slice that matters.
Goal: look at the rollouts the model got wrong (or suspiciously right) and decide whether to retrain, fix the grader, or fix the dataset.
# step8_inspect.py
import json
from pathlib import Path
# Sort regressions worst-first
regressions = sorted([d for d in diffs if d["d_reward"] < 0],
key=lambda d: d["d_reward"])
for d in regressions[:5]:
print(f"\n=== {d['id']} Δreward={d['d_reward']:+.2f} Δtools={d['d_tools']:+}")
row = next(r for r in val_rows if r["id"] == d["id"])
print(f"PROMPT: {row['prompt']}")
print(f"REFERENCE: {row['reference_answer']}")
trace = load_trace(model="tuned", row_id=d["id"])
for turn in trace["turns"]:
if turn["type"] == "tool_call":
print(f" → {turn['name']}({turn['args']})")
elif turn["type"] == "tool_result":
print(f" ← {turn['output'][:80]}")
elif turn["type"] == "message":
print(f" [model] {turn['content'][:120]}")=== val-0042 Δreward=-0.50 Δtools=-3
PROMPT: What was Atlas Corp's gross margin in fiscal 2024?
REFERENCE: 28.7%
→ search({"query": "Atlas Corp fiscal 2024 gross margin"})
← filings/atlas-10k-2024.html (score 0.91)\n Gross profit margin: 28.7% ...
→ read({"path": "filings/atlas-10k-2024.html"})
← Atlas Corp ... gross profit of $1.43B on revenue of $4.98B ...
[model] 28.7%
=== val-0117 Δreward=-1.00 Δtools=+2
PROMPT: What was Henderson's revenue in Q3 2024?
REFERENCE: $842M
→ search({"query": "Henderson Q3 2024"})
← filings/henderson-10q-2024-q3.html (score 0.84)\n Revenue: $842M
[model] $842 million dollars ← grader penalized "dollars" suffixThat second case — "$842 million dollars" scored zero — is a grader bug, not a model bug. The system prompt explicitly listed magnitude suffixes as acceptable but the implementation is brittle. Fix the grader, re-evaluate, don’t retrain. This is the most common kind of regression: the model is actually right, the grader is wrong.
Goal: decide where to run subsequent iterations. The first run can be on whatever was easiest; iteration speed depends on choosing well.
All five do PPO and GRPO. Differences show up at scale, on multi-turn rollouts, and in single-GPU friendliness:
And the GPU memory math you actually have to fit:
Two memory items teams underestimate. The rollout buffer (N parallel trajectories worth of activations + KV cache) is its own budget category — for 7B with group size 16 and 4K-token rollouts, that’s on the order of 10–20 GB. The frozen reference model used for the KL term doubles weight memory if it lives on the same GPU; offload it to CPU and KL becomes a network bottleneck. Most production setups run the reference on a dedicated GPU.
Goal: get the fine-tuned model into production without dragging training-time assumptions along.
Three substitutions to make on the way out:
This last point catches teams off guard: a fine-tuned model that scored well on val can still produce surprising production behavior, and your training-time tool calls don’t help debug it. Training-time and production-time observability are two systems, not one. Build both.