Agents · Engineering · Spring 2026

A Practical Guide to Agent Reinforcement Fine-Tuning

A nine-step tutorial on agent reinforcement fine-tuning (RFT): how to build the dataset, stand up tool servers, write the grader, configure KL and group size, size GPUs, and pick between TRL, verl, OpenRLHF, Unsloth, and NeMo RL — with code and example outputs at every step.

Claudio LemosMay 7, 20265 min read1,151 wordsFiled under Agents

Frontispiece· Spring 2026 · TensorOps Blog

The RL math, simplified.Where prompts come from,what a training row looks like,and which open-source frameworkto run it on your own GPUs.

Inside this dispatch14 sections · 5 minutes

01What we’re going to build
02Prerequisites
03Step 1: Define the task and sanity-check the prior
04Step 2: Build the training dataset
05Step 3: Stand up your tool servers
06Step 4: Write your grader
07Step 5: Configure hyperparameters
08Step 6: Launch training and watch the curves
09Step 7: Evaluate base vs fine-tuned
10Step 8: Inspect failures and reward hacking
11Step 9: Pick a framework and budget your hardware
12Step 10: Deploy with separate train and prod tool servers
13Common pitfalls, condensed
14References and further reading

A step-by-step companion to the conceptual primer: follow nine steps from a folder of prompts to a fine-tuned tool-using agent. Code, expected outputs, and the watch-outs at every step.

What we’re going to build

By the end of this tutorial you will have trained a tool-using agent end-to-end via reinforcement fine-tuning. The running example: a research agent that answers numerical questions about a small corpus of company filings, with three tools — search (semantic search), list (browse paths), and read (fetch a document). The same nine steps work for any agentic task; only the dataset and the tools change.

This is a hands-on tutorial. Every step has runnable code, an expected output to check against, and a flagged gotcha. The framework choice is deferred to Step 9 — the early steps work the same way whether you train against a managed RFT API or self-host on TRL, verl, OpenRLHF, Unsloth, or NeMo RL.

Prerequisites

Python 3.10+ with requests, fastapi, uvicorn, and datasets.
A base reasoning model — open-weights (Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct, or similar) for self-hosted; or API access if you’re using a managed RFT service.
One GPU with ≥24 GB for 7B-class LoRA, ≥80 GB (H100/A100) for full-parameter or larger models. None needed if managed.
Somewhere to host HTTP services — Modal, Fly, Cloud Run, or a plain VM. Tools and grader live behind URLs.
~50–500 prompts to start. We’ll show how to grow this in Step 2.

Step 1: Define the task and sanity-check the prior

Goal: decide what success looks like, then prove the base model can sometimes hit it. If the base model fails on every rollout, RFT can’t hill-climb a flat landscape — fix the prior first.

Take a small batch of representative prompts (50 is enough), run the base model 5 times each through your tools, and look at the per-sample success-rate distribution. The shape of that distribution decides whether you have a viable RFT task.

# step1_prior_check.py
import json, requests, random
from collections import Counter

BASE_MODEL_URL = "http://localhost:8001/v1/chat/completions"   # vLLM endpoint
N_ROLLOUTS = 5

def run_agent(prompt, tools_spec):
    """Send the prompt to the base model with tools; return (success: bool, n_tool_calls: int)."""
    # ... your model + tool-calling loop here ...
    # Returns whether the final answer matched ground truth.
    pass

samples = [json.loads(line) for line in open("data/research_val.jsonl")]
results = []
for s in samples[:50]:
    successes = sum(run_agent(s["prompt"], s["tools"])[0] for _ in range(N_ROLLOUTS))
    rate = successes / N_ROLLOUTS
    results.append((s["id"], rate))

# Bucket each sample as never / sometimes / always
buckets = Counter(
    "always"    if r == 1.0 else
    "sometimes" if 0 < r < 1 else
    "never"
    for _, r in results
)
print(buckets)

EXPECTED OUTPUT

Counter({'sometimes': 19, 'never': 18, 'always': 13})

# 19/50 (38%) of samples are in the signal band — well over the 15-30% rule of thumb.
# RFT has something to learn here. Move on.

WATCH FOREmpty signal band. If sometimes is below ~10%, the base model is either too strong (no headroom — already solving everything it can) or too weak (the task is out of distribution). Either fix the dataset, swap the base model, or simplify the tools before continuing.

Step 2: Build the training dataset

Goal: produce two JSONL files — research_train.jsonl and research_val.jsonl — where each row is a self-contained training example.

Three sourcing strategies (in order of production-realism): real production logs, synthetic generation from a stronger model, public benchmarks adapted into your tool environment. Aim for 200–1,000 prompts in train, 10–20% held out for val. Stratify the split by task type and difficulty so val isn’t accidentally easier than train.

Each row has this shape:

{
  "id": "row-2842",
  "prompt": "What was Henderson Industries' Q3 2025 operating margin?",
  "reference_answer": "13.4%",
  "expected": {
    "tolerance_rel": 0.02,
    "evidence_path": "filings/henderson-10q-2025-q3.html",
    "must_use_tools": ["search", "read"]
  },
  "tools": [
    {"type": "function", "function": {"name": "search", "parameters": {...}}},
    {"type": "function", "function": {"name": "list",   "parameters": {...}}},
    {"type": "function", "function": {"name": "read",   "parameters": {...}}}
  ],
  "metadata": {"task_type": "numerical_qa", "difficulty": "medium"}
}

Three things to keep in mind. The id field will save you the moment you start diffing per-sample reward across runs. The reference_answer gets read by the grader (we’ll wire that in Step 4) — keep it structured, not free-form. The tools field is optional in some platforms, required in others; including it makes the row self-describing and is the right default for tasks that span multiple toolsets.

A small validator before you submit anything:

# step2_validate_dataset.py
import json
from pathlib import Path

REQUIRED = {"id", "prompt", "reference_answer"}
ROWS = []
for path in ["data/research_train.jsonl", "data/research_val.jsonl"]:
    with open(path) as f:
        for ln, line in enumerate(f, 1):
            row = json.loads(line)
            missing = REQUIRED - row.keys()
            assert not missing, f"{path}:{ln} missing {missing}"
            assert row["id"] not in {r["id"] for r in ROWS}, f"dup id {row['id']}"
            ROWS.append(row)

print(f"Total rows: {len(ROWS)}")
print(f"Train/val: {sum(1 for r in ROWS if r['id'].startswith('train'))} / "
      f"{sum(1 for r in ROWS if r['id'].startswith('val'))}")
print(f"Difficulty distribution: "
      f"{dict(Counter(r['metadata']['difficulty'] for r in ROWS))}")

EXPECTED OUTPUT

Total rows: 612
Train/val: 489 / 123
Difficulty distribution: {'easy': 184, 'medium': 287, 'hard': 141}

Step 3: Stand up your tool servers

Goal: expose each tool as a bearer-authenticated HTTP endpoint that the training platform can call from its rollout workers.

Tools are HTTP services, not in-process functions. The platform runs hundreds of rollouts in parallel and POSTs to your endpoints; the same servers should handle production calls later, possibly with different config. A minimal FastAPI skeleton:

# step3_tool_server.py
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer

app = FastAPI()
security = HTTPBearer()

def auth(creds=Depends(security)):
    if creds.credentials != os.environ["TOOL_BEARER"]:
        raise HTTPException(401, "bad token")

@app.post("/search")
async def search(req: Request, _=Depends(auth)):
    body = await req.json()
    args = body.get("arguments", {})
    call_id = body.get("call_id")
    query = args.get("query", "").strip()
    if not query:
        # Structured error, not 500 — see Watch For below
        return {"output": "Error: query is required", "call_id": call_id}
    hits = embed_and_search(query, top_k=3)         # your retrieval logic
    return {"output": format_hits(hits), "call_id": call_id}

@app.post("/list")
async def list_paths(req: Request, _=Depends(auth)):
    body = await req.json()
    prefix = body.get("arguments", {}).get("prefix", "")
    paths = corpus_index.list(prefix)
    return {"output": "\n".join(paths), "call_id": body.get("call_id")}

@app.post("/read")
async def read(req: Request, _=Depends(auth)):
    body = await req.json()
    path = body.get("arguments", {}).get("path")
    return {"output": corpus_index.read(path)[:8000],   # token budget!
            "call_id": body.get("call_id")}

Smoke-test before submitting any training job:

$ curl -s -X POST https://tools.example.com/search \
    -H "Authorization: Bearer $TOOL_BEARER" \
    -H "Content-Type: application/json" \
    -d '{"arguments": {"query": "Henderson Q3 operating margin"},
          "call_id": "test-001"}' | jq

EXPECTED OUTPUT

{
  "output": "filings/henderson-10q-2025-q3.html (score 0.87)\n  Q3 operating margin: 13.4% on revenue of $842M ...",
  "call_id": "test-001"
}

WATCH FOR500s and silent auth failures poison the gradient. If a tool returns 500 or 401 inside a rollout, the model gets reward zero through no fault of its own. Repeated infra failures teach the model to avoid behaviors that correlated with failures — even when those behaviors were correct. Always return structured {"output": "Error: ..."} bodies. Always smoke-test auth from the same network the training workers run from.

Step 4: Write your grader

Goal: turn each rollout’s trajectory into a scalar reward in [0, 1]. The grader is the entire training signal — invest disproportionately here.

Three implementations, picked by complexity:

Rule-based / Python. A small function: regex match, normalized numerical comparison with tolerance, schema check. Cheap, deterministic, brittle if not normalized.
HTTP endpoint. A service you host. Right when scoring needs database lookups, calls to other models, or logic too big for a function.
Model-as-judge. A structured prompt to a strong model with a numeric rubric and a JSON-schema response. The most common production choice.

For the research agent, a model-as-judge with a numerical-grader rubric:

# step4_grader.py
GRADER = {
    "type": "score_model",
    "name": "numerical_grader",
    "model": "gpt-4.1",
    "range": [0, 1],
    "pass_threshold": 0.75,
    "sampling_params": {"temperature": 0},
    "input": [
        {"role": "system", "content": """\
You will be given a Reference Answer and a Model Answer.
Score the Model Answer on a 0-1 scale.

Return 1 if BOTH:
- the Model Answer contains only the final numeric answer
- the value matches the Reference Answer up to format

Format variations that still count as correct:
- currency symbols ($, USD, €)
- magnitude suffixes (M, million, K, B)
- percent forms (7% vs 0.07)
- commas and whitespace

Return 0.5 if off by a tenth of a percent or less, or a true rounding error.
Return 0 otherwise. Reply with ONLY the numeric score."""},
        {"role": "user", "content":
            "- Reference Answer: {{item.reference_answer}}\n"
            "- Model Answer: {{sample.output_text}}"}
    ],
}

Test the grader standalone before training — three quick cases:

# step4_grader_test.py
cases = [
    ({"reference_answer": "13.4%"}, {"output_text": "13.4%"}),       # exact
    ({"reference_answer": "13.4%"}, {"output_text": "0.134"}),       # format variant
    ({"reference_answer": "13.4%"}, {"output_text": "13.41%"}),      # rounding
    ({"reference_answer": "13.4%"}, {"output_text": "12.0%"}),       # wrong
]
for item, sample in cases:
    print(f"ref={item['reference_answer']:6}  out={sample['output_text']:8}  "
          f"reward={run_grader(item, sample):.2f}")

EXPECTED OUTPUT

ref=13.4%   out=13.4%     reward=1.00
ref=13.4%   out=0.134     reward=1.00
ref=13.4%   out=13.41%    reward=0.50
ref=13.4%   out=12.0%     reward=0.00

Step 5: Configure hyperparameters

Goal: pick a starting hyperparameter set you can defend. Reasonable defaults to start from, ranked by how much they’ll move your run:

# step5_config.yaml
algorithm: grpo            # group-relative policy optimization

# The four numbers that matter most
group_size: 8              # rollouts per prompt; below 4 the baseline is too noisy
kl_coef: 0.01              # β: KL penalty against the frozen reference. Tune carefully.
learning_rate: 1.0e-5      # 10-100× lower than SFT
epochs: 2                  # agent RFT overfits fast on small datasets

# Mostly-fine-as-is
batch_size: 16             # max within memory after rollout buffer
max_trajectory_tokens: 4096
max_tool_calls: 12         # rollout tool-call budget
warmup_steps: 5
seed: 42

# Inputs
train_path: data/research_train.jsonl
val_path:   data/research_val.jsonl
base_model: Qwen/Qwen2.5-7B-Instruct
lora:
  rank: 32
  alpha: 64
  target_modules: [q_proj, k_proj, v_proj, o_proj]

KL coefficient β is the single biggest knob. Too low and the policy drifts away from the reference model (capability collapse — the model forgets formats, refuses unrelated tasks, breaks tool-call syntax). Too high and the model can’t learn anything new. Watch the KL curve during training; if it spikes, tighten β. Reasonable range: 0.001 to 0.05.

Group size controls baseline variance. 8–16 is the sweet spot. Below 4, group-mean advantages get noisy enough that gradients stop pointing in a useful direction.

Step 6: Launch training and watch the curves

Goal: kick off the job, monitor the right metrics in real time, and abort if things look wrong before you burn through your budget.

Training step, schematically:

A minimal launch script using TRL’s GRPO trainer:

# step6_train.py
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig
from datasets import load_dataset
from utils import build_grader, build_tools

config = GRPOConfig(
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    num_generations=8,                  # group size
    beta=0.01,                           # KL coefficient
    max_completion_length=4096,
    bf16=True,
    logging_steps=1,
    output_dir="./runs/research-rft-v1",
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B-Instruct",
    reward_funcs=[build_grader(GRADER)],   # from Step 4
    train_dataset=load_dataset("json", data_files="data/research_train.jsonl")["train"],
    args=config,
    peft_config=LoraConfig(r=32, lora_alpha=64,
                           target_modules=["q_proj","k_proj","v_proj","o_proj"]),
)
trainer.train()

What healthy training output looks like (first 30 of 150 steps shown):

EXPECTED OUTPUT

step  val_reward  tool_calls  kl_to_ref  loss
   0       0.59        9.1      0.000  -0.18
   2       0.61        8.4      0.003  -0.21
   5       0.64        7.2      0.008  -0.27
  10       0.69        5.8      0.014  -0.31
  15       0.72        4.9      0.019  -0.29
  20       0.74        4.5      0.024  -0.27
  25       0.76        4.2      0.028  -0.26
  30       0.77        4.1      0.031  -0.25  ← phase 1 plateau
 ...
 100       0.81        4.0      0.043  -0.18
 150       0.82        4.0      0.048  -0.14

WATCH FORKL divergence spike. If kl_to_ref grows faster than ~0.005 per 10 steps, the policy is drifting too fast and you risk capability collapse. Halt, raise beta by 2–3×, and restart from the last checkpoint. Reward plateau before step 10 usually means group size is too small or learning rate is too high. Tool-calls rising instead of falling means your rollouts are going off-script — check tool-server failure rate first.

Step 7: Evaluate base vs fine-tuned

Goal: compare the fine-tuned checkpoint against the base model on the held-out validation set, per sample. The aggregate number is rarely the most informative view.

# step7_evaluate.py
import json, statistics

def evaluate(model_id, val_path):
    out = []
    for row in (json.loads(l) for l in open(val_path)):
        sample = run_agent(model_id, row["prompt"], row["tools"])
        reward = run_grader(row, sample)
        out.append({"id": row["id"],
                    "reward": reward,
                    "tool_calls": sample["n_tool_calls"]})
    return out

base = evaluate("Qwen/Qwen2.5-7B-Instruct", "data/research_val.jsonl")
tuned = evaluate("./runs/research-rft-v1/checkpoint-150", "data/research_val.jsonl")

# Per-sample diff
diffs = []
for b_, t_ in zip(base, tuned):
    diffs.append({
        "id": b_["id"],
        "d_reward": t_["reward"] - b_["reward"],
        "d_tools":  t_["tool_calls"] - b_["tool_calls"],
    })

print(f"mean Δreward: {statistics.mean(d['d_reward'] for d in diffs):+.3f}")
print(f"mean Δtools:  {statistics.mean(d['d_tools']  for d in diffs):+.2f}")
print(f"improved:    {sum(1 for d in diffs if d['d_reward'] > 0)}/{len(diffs)}")
print(f"regressed:   {sum(1 for d in diffs if d['d_reward'] < 0)}/{len(diffs)}")
print(f"unchanged:   {sum(1 for d in diffs if d['d_reward'] == 0)}/{len(diffs)}")

EXPECTED OUTPUT

mean Δreward: +0.218
mean Δtools:  -4.91
improved:    71/123
regressed:   12/123
unchanged:   40/123

Three things to check before declaring victory. First, the regressed bucket — 12 samples got worse. Is there a pattern (a specific task type, a specific tool, a specific document)? Second, the unchanged bucket — 40 samples didn’t move. Are they all already-solved (good) or all unsolvable (bad)? Third, per-slice reward — slice the diffs by metadata.task_type and metadata.difficulty. Aggregate gains can hide a regression on a slice that matters.

Step 8: Inspect failures and reward hacking

Goal: look at the rollouts the model got wrong (or suspiciously right) and decide whether to retrain, fix the grader, or fix the dataset.

# step8_inspect.py
import json
from pathlib import Path

# Sort regressions worst-first
regressions = sorted([d for d in diffs if d["d_reward"] < 0],
                     key=lambda d: d["d_reward"])

for d in regressions[:5]:
    print(f"\n=== {d['id']}  Δreward={d['d_reward']:+.2f}  Δtools={d['d_tools']:+}")
    row = next(r for r in val_rows if r["id"] == d["id"])
    print(f"PROMPT:    {row['prompt']}")
    print(f"REFERENCE: {row['reference_answer']}")

    trace = load_trace(model="tuned", row_id=d["id"])
    for turn in trace["turns"]:
        if turn["type"] == "tool_call":
            print(f"  → {turn['name']}({turn['args']})")
        elif turn["type"] == "tool_result":
            print(f"  ← {turn['output'][:80]}")
        elif turn["type"] == "message":
            print(f"  [model] {turn['content'][:120]}")

EXPECTED OUTPUT

=== val-0042  Δreward=-0.50  Δtools=-3
PROMPT:    What was Atlas Corp's gross margin in fiscal 2024?
REFERENCE: 28.7%
  → search({"query": "Atlas Corp fiscal 2024 gross margin"})
  ← filings/atlas-10k-2024.html (score 0.91)\n  Gross profit margin: 28.7% ...
  → read({"path": "filings/atlas-10k-2024.html"})
  ← Atlas Corp ... gross profit of $1.43B on revenue of $4.98B ...
  [model] 28.7%

=== val-0117  Δreward=-1.00  Δtools=+2
PROMPT:    What was Henderson's revenue in Q3 2024?
REFERENCE: $842M
  → search({"query": "Henderson Q3 2024"})
  ← filings/henderson-10q-2024-q3.html (score 0.84)\n  Revenue: $842M
  [model] $842 million dollars     ← grader penalized "dollars" suffix

That second case — "$842 million dollars" scored zero — is a grader bug, not a model bug. The system prompt explicitly listed magnitude suffixes as acceptable but the implementation is brittle. Fix the grader, re-evaluate, don’t retrain. This is the most common kind of regression: the model is actually right, the grader is wrong.

Step 9: Pick a framework and budget your hardware

Goal: decide where to run subsequent iterations. The first run can be on whatever was easiest; iteration speed depends on choosing well.

All five do PPO and GRPO. Differences show up at scale, on multi-turn rollouts, and in single-GPU friendliness:

TRL (Hugging Face) — easiest first run, best ecosystem fit. PEFT/LoRA built in. Multi-turn requires custom rollout code.
verl (ByteDance) — strongest agent / multi-turn support. Ray + Megatron under the hood; heavier setup.
OpenRLHF — production-RLHF on Ray + DeepSpeed; lighter than verl, heavier than TRL.
Unsloth — single-GPU efficiency. 7B-class with GRPO on a 24 GB consumer card; LoRA only.
NeMo RL (NVIDIA) — production scale at 70B+, NeMo-native, the heaviest setup.

And the GPU memory math you actually have to fit:

Two memory items teams underestimate. The rollout buffer (N parallel trajectories worth of activations + KV cache) is its own budget category — for 7B with group size 16 and 4K-token rollouts, that’s on the order of 10–20 GB. The frozen reference model used for the KL term doubles weight memory if it lives on the same GPU; offload it to CPU and KL becomes a network bottleneck. Most production setups run the reference on a dedicated GPU.

Step 10: Deploy with separate train and prod tool servers

Goal: get the fine-tuned model into production without dragging training-time assumptions along.

Three substitutions to make on the way out:

Tool servers split into modes. Training-time servers are bursty, latency-tolerant, single-bearer-auth. Production servers carry user traffic with per-tenant auth and SLOs. Same code, different config — most teams maintain two deployments.
The grader becomes an evaluator. Run the trained-against grader on a slice of live traffic to monitor for drift, regressions, or new failure modes. The grader code outlives the model.
Add an observability layer. Training rollouts are excellent retrospective audit logs but say nothing in production. Layer a real agent-observability system (the Job Card pattern) on the deployed model so support analysts and compliance reviewers can audit production rollouts the same way you audited training.

This last point catches teams off guard: a fine-tuned model that scored well on val can still produce surprising production behavior, and your training-time tool calls don’t help debug it. Training-time and production-time observability are two systems, not one. Build both.

Common pitfalls, condensed

Reward hacking — validation reward exceeds the human ceiling.
KL collapse / capability drift — policy walks too far from the reference, breaks general behaviors. Increase β.
Group-mean baseline noise — group size below 4 makes advantages too noisy to learn from.
Distribution mismatch — train and val don’t reflect production traffic. The reward curve looks great, the deployment disappoints.
Tool flakiness corrupts signal — transient failures get blamed on the model. Monitor failure rate; halt training above ~1%.
Premature convergence — single-mode policy that does one thing well and refuses to explore alternatives.

References and further reading

huggingface/trl — PPO, GRPO, DPO, RLOO with PEFT integration.
volcengine/verl — Ray + Megatron with strong multi-turn support.
OpenRLHF/OpenRLHF — production-oriented RLHF on Ray + DeepSpeed.
unslothai/unsloth — efficient single-GPU fine-tuning with GRPO.
NVIDIA-NeMo/RL — production-scale alignment.
DeepSeekMath: GRPO paper (arXiv:2402.03300) — the canonical reference for the group-relative advantage trick.
Agent Reinforcement Fine-Tuning — the conceptual primer this guide pairs with.
AgentOps: Why True Agent Observability Is About Documenting the Work — the production-observability layer referenced in Step 10.

End. Set in Fraunces, Newsreader & JetBrains Mono.

TensorOps · Blog · 2026