Agents · Engineering · Spring 2026

Agent Reinforcement Fine-Tuning

When prompt engineering stops being enough — a practitioner’s guide to agent reinforcement fine-tuning (RFT) for tool-using agents. What it changes, when to reach for it, and what early production results actually show.

Gad BenramMay 6, 202612 min read2,544 wordsFiled under Agents

A practitioner’s guide to reinforcement fine-tuning for tool-using agents — what it changes, when to reach for it, and what the early production results actually look like.

Agent reinforcement fine-tuning was introduced by OpenAI in late 2025 — see the AI Engineer Code Summit talk by Will Hang and Cathy Zhou and the OpenAI RFT documentation; references collected at the end of this post.

The wall every agent team hits

If you’ve shipped an agent into production, the curve probably looks familiar. The first prompt you write gets you to 60% on your eval. A week of prompt engineering, better tool descriptions, and tightening the task harness pushes you into the low 70s. Then progress flattens. You add a planner, you split tools, you rewrite system messages — and the needle barely moves. You’ve squeezed the prior dry.

This is the gap that agent reinforcement fine-tuning (agent RFT) is built to close. It’s a meaningfully different technique from supervised fine-tuning or even single-turn RFT, and it’s the first practical way to train a frontier reasoning model end-to-end on a multi-step, tool-using task — using your tools, your environment, and your definition of “good.”

This piece is for CTOs and engineers who are already past the prompt-engineering plateau and trying to decide whether agent RFT is the next investment. We’ll cover what it actually does under the hood, why it’s sample-efficient, what the training loop looks like in practice, and where it has and hasn’t worked in early production deployments.

The prompt-engineering plateau and where agent RFT picks up A learning curve climbing from 60 to 73 percent with prompt engineering, then plateauing, then resuming its climb after agent RFT begins. PLATEAU The 73% wall — and what gets through it Prompt engineering moves the agent within the prior. RFT changes the prior. 0% 40 60 80 100 accuracy effort → PROMPT ENGINEERING AGENT RFT 60% — first prompt 73% — wall ~85% with RFT where teams plateau 100–1000 examples RFT amplifies a working prior. It doesn’t create one.

What makes an agent different (and why training it is harder)

A model becomes an agent the moment it can act on the world without going through you for every step. That capability is mediated by tools: a coding agent has a terminal and a code interpreter; a customer service agent has a CRM and a refund API; a research agent has a browser and a file system. Each tool call writes its result back into the agent’s context window, and the agent reasons over that, calls another tool, and repeats — until it produces a final answer.

That feedback loop is exactly what makes agents useful, and exactly what makes them hard to train. The base model was post-trained on tools that look nothing like yours. Your tools have your naming conventions, your schemas, your latency characteristics, your edge cases. The agent will use them — but often inefficiently. It will call five tools when one would do. It will repeat the same search with slightly different parameters. It will reason over irrelevant outputs. None of that shows up cleanly as a “wrong answer” you can correct with a prompt; it shows up as a slow, expensive trajectory that happens to land on the right answer most of the time.

The gap between how the base model uses generic tools and how your agent should use your tools is what teams call distribution shift, and closing it is the core promise of agent RFT.

Before you fine-tune, exhaust the cheaper levers

Fine-tuning is the right answer to a specific problem, not the first answer to every problem. Before reaching for it, the standard playbook still applies, and it gets surprisingly far:

  • Prompt engineering. Steer the model with clearer instructions, better few-shot examples, and explicit guardrails.
  • Task simplification. Break a complex task into subtasks. An agent that has to plan, search, reason, and write is harder to train than one that only has to do the search step well.
  • Tool surgery. Rename tools so their semantics are obvious. Rewrite descriptions. Merge tools that should be one. Split tools that are doing too much. Teams routinely see double-digit accuracy gains from this alone — the model is more sensitive to tool naming than most engineers expect, because tool names are how it forms its prior over what each tool does.
  • Better tool implementations. A semantic search tool that returns cleaner results is worth more than any prompt tweak.

Only after these are exhausted does fine-tuning earn its place. The framing that holds up well in practice: prompt engineering and task design move the agent within the basin of behavior the base model already has; fine-tuning changes the basin itself.

What agent RFT actually does

Reinforcement fine-tuning, in its original form, was a single-turn technique. You gave the model a prompt, it produced an answer, a grader scored the answer, and the model’s weights were nudged to make high-scoring answers more likely. Useful, but agents don’t work in single turns.

Agent RFT extends the loop. During training, the model isn’t just generating answers — it’s generating full rollouts. A rollout is one complete trajectory: the model thinks, calls a tool, sees the result, thinks again, calls another tool, and eventually emits a final answer. Two things make this work:

  1. The model calls your tool endpoints during training. The training infrastructure makes real HTTP calls to endpoints you host, gets real responses back, and feeds them into the model’s context. The model is exploring the search space of trajectories in the environment it will eventually be deployed into.
  2. The model is graded by your grader. You can use a model-based grader, a string match, or — most powerfully — an HTTP endpoint you control, which receives the full trajectory (every tool call, every output, the final answer) and returns a scalar reward.

A unique rollout ID is attached to every tool call from the same trajectory, which means your tool servers and your grader can correlate calls, maintain per-rollout state, and grade based on the full trajectory rather than just the final answer. This matters more than it sounds — it’s what lets you reward how the agent got to the answer, not just whether it got there.

The training loop, then, looks like this: the platform issues many parallel rollout requests, the agent explores different ways of using your tools, the grader scores each trajectory, and the model’s weights are updated to make high-reward trajectories more likely. Over time, the model converges on a policy that uses your specific tools efficiently and reasons well over their outputs.

The agent RFT training loop The model emits actions that call your tool endpoints; the trajectory rolls up to your grader; the scalar reward updates the model's weights and the loop repeats across many parallel rollouts. ROLLOUT LOOP One trajectory, end-to-end, in your environment The model explores; your grader judges; the weights move. YOUR GRADER trajectory → scalar reward model · rubric · endpoint MODEL think → act → think → policy under training YOUR TOOLS search list cat your endpoints WEIGHTS UPDATE amplify high-reward trajectories HTTP · rollout_id: 7c4f tool result trajectory reward policy step × many parallel rollouts per training step Each tool call carries a rollout_id so your grader can reward how the agent reached the answer — not just whether. Real HTTP. Real responses. Real grader. Trained where it will run.

Why it’s sample-efficient (and when it isn’t)

One of the more counterintuitive properties of agent RFT is that it works with surprisingly little data. Production runs have succeeded with as few as 100–150 training examples, and several have done well with 1,000. Compare that to the tens of thousands typically needed for supervised fine-tuning.

The reason is structural. The model is generating its own training data through exploration. Each prompt in your training set isn’t one example — it’s a seed for many trajectories, each scored by your grader, each contributing gradient signal. The compute multiplier (a hyperparameter that controls how many rollouts per sample) lets you trade compute for exploration directly: more rollouts means more chances to stumble onto a good trajectory on a hard sample.

This also tells you when the technique won’t work. Agent RFT depends on the base model occasionally getting the answer right. If your model scores zero across every rollout on every sample, there’s no signal to amplify — you’re trying to nudge a flat landscape. Two diagnostic plots tell you whether you’re in good shape:

  • Per-sample variance. Run the base model 3+ times on each validation sample and plot the spread. You want a meaningful fraction of samples (15–30% is typically enough) where the model sometimes succeeds and sometimes fails. Those are the samples that will drive learning.
  • Pass-at-k. If you take the best of N rollouts per sample, what’s the average score? This is roughly the ceiling RFT can hill-climb toward — it’s saying “this is what the model already knows how to do; training will teach it to do this consistently.”

If both numbers are near zero, the task is too hard for the base model and RFT won’t rescue it. The fix is upstream: simpler tools, better prompts, or a stronger base model. RFT amplifies a working prior; it doesn’t create one.

Per-sample variance: where the training signal comes from A plot of 64 validation samples sorted by mean reward. The middle band — samples where the base model sometimes succeeds and sometimes fails — is the source of training signal. Legend below the chart explains the three bands. SIGNAL MAP Where the gradient comes from Run the base model 3+ times per sample. Sort by mean. Watch the middle. 0.0 0.2 0.5 0.8 1.0 reward samples (sorted by mean) → pass@k ceiling ALREADY SOLVED consistent success — no headroom for RFT SIGNAL BAND 15–30% of samples here drives learning TOO HARD RFT can’t hill-climb a flat landscape

A worked example: financial QA, made hard on purpose

It’s worth grounding all of this in a concrete task. The FinQA benchmark gives a model a financial report and asks numerical-reasoning questions about it. In the standard setup, the relevant report is included in the prompt — the task is purely reasoning over given context.

A more realistic agentic version: strip out the report, and force the model to find it. Give it a corpus of 2,800 financial documents and three tools:

  • search — semantic search over the corpus
  • list — directory and file listing
  • cat — read a document by path

Now the agent has to figure out which report is relevant, locate it, extract the right numbers, and reason over them — within a budget of 10 tool calls. The grader is a model-based grader that gives full credit for matches, partial credit for near-misses (rounding errors, formatting differences like “$32” vs “32 dollars”), and zero for wrong answers. Strict string matching would punish the model for trivial formatting mistakes; an unconstrained grader would let it game the rubric. The middle ground is a careful rubric the grader follows.

Running agent RFT on this task — 1,000 training samples, batch size 16, three epochs, GPT-5 with medium reasoning effort — produces results that are characteristic of what to expect:

  • Validation reward climbs from 0.59 to 0.63 in the first 10 steps (a 14-point relative jump), then continues to improve more gradually.
  • Tool calls per rollout drop sharply in those same 10 steps, from ~9 to ~4.
  • Average reasoning tokens fall from roughly 2,500 to 1,500.
  • End-to-end latency drops by ~10%.

The interpretation matters more than the numbers. The big early gains come from the model learning to use the specific tools efficiently — fewer redundant calls, less wasted reasoning over irrelevant outputs. The slower late-stage gains come from the model exploring genuinely better strategies for hard cases. Plotting per-sample reward delta against per-sample tool-call delta is one of the more useful diagnostics: you want most samples to land in the “higher reward, fewer tool calls” quadrant, and you want zero samples in the “lower reward, more tool calls” quadrant. That second condition is what tells you the new policy is a strict improvement, not a tradeoff.

Per-sample reward delta vs tool-call delta Scatter chart with quadrants. Most points cluster upper-left (higher reward, fewer tool calls), the strict-improvement zone. The lower-right quadrant should be empty. QUADRANT VIEW Strict improvement, not a tradeoff Per-sample Δreward against Δtool-calls. Hunt the upper-left. 0 −Δ 0 fewer tool calls ← → more tool calls reward STRICT IMPROVEMENT most samples land here ACCURACY ↑ AT COST acceptable, watch the volume REGRESSION ON SPEED re-check; rare DO NOT SHIP slower AND worse average · Δ

Patterns from production: where agent RFT pays off

The shapes that recur across early production deployments are more informative than any single headline number. Five patterns show up again and again.

A planning-stage compression. Many coding and document-editing agents have a pre-action planning phase — decide which files or sections to touch before doing anything. Latency on that phase is what users feel as “time to first useful output,” so it is a natural fine-tuning target. Restrict the planner to a small toolset (read, search, list), define the reward as F1 between predicted-and-actual touched files, and train on a few hundred to a thousand diverse examples. The recurring result: planning round-trips drop by roughly half, time-to-first-output drops in proportion, and accuracy improves rather than regresses. Keeping the train and eval splits disjoint matters, because the deployed model will see code and content it has never seen.

Modest gains against a hard human ceiling. Some agentic tasks are large-taxonomy classification problems — clinical coding, customs codes, internal product taxonomies — where the absolute best a model can do is constrained by genuine human disagreement among expert annotators. The base model lands at a moderate F1, RFT pushes it a few points higher, and the headline number understates the result because the practical ceiling is well below 1.0. The headroom that can be captured is captured. Latency typically drops 15–20% as the agent learns to reason more efficiently over the taxonomy.

Multi-tool generation with a multi-criteria rubric. Agents that produce structured artifacts — slide decks, briefs, reports — through a chain of tools, with a final harmonization step for coherence, benefit disproportionately from a graded rubric that scores content quality and structural fit as separate sub-criteria. The visible pattern in training: large improvements concentrated on previously-failing edge cases rather than uniform gains across the distribution. The investment goes into the rubric, not the dataset.

Code generation in a data-starved domain. When the target is a custom DSL, an unusual hardware platform, or an internal framework with no public corpus of correct examples, supervised fine-tuning has nothing to chew on. Agent RFT can work with as few as a hundred prompts and a strong correctness-and-performance grader: run the produced code, check that it executes correctly, measure whatever quantitative metric matters. The fine-tuned model can surpass prior approaches with no example outputs in the training data. The grader is carrying the entire signal.

Research and extraction on long documents. Agents that read filings, regulations, contracts, or research papers and produce structured insights for a human reviewer benefit from a custom LLM-as-judge grader exposed via endpoint, scoring factual accuracy, reasoning, completeness, and source attribution as separate criteria. Reported gains are typically double-digit on core accuracy, with substantial reductions in hallucination and citation-omission rates.

The common thread isn’t the domain — it’s that each of these patterns shares three properties: (a) a base model with non-trivial baseline performance, (b) a task with genuine reasoning content, and (c) an investment in the grader. None would have been a good fit for supervised fine-tuning, because none has a clean dataset of “ideal trajectories.” What they have instead is a reliable way to score trajectories, and that turns out to be enough.

The grader is the product

If there’s one piece of advice that comes up in every successful deployment, it’s this: invest disproportionately in your grader.

Binary vs graded reward — what the model can climb Two reward distributions side by side. A binary grader produces a 0/1 spike with no slope between. A graded rubric distributes reward across the interval, giving the model a gradient to follow. Three pitfalls listed below: brittle string match, reward hacking, under-specified tasks. GRADER What the model is allowed to climb Binary signal stalls. A graded rubric gives the model a gradient. BINARY GRADER — almost no gradient 0 1 reward ~90% ~10% no signal between extremes GRADED RUBRIC — climbable 0 1 reward model can climb Same agent, same task. The grader is the only thing that changed. PITFALL · BRITTLE MATCH "$32" ≠ "32 dollars" — model optimizes formatting, not truth. PITFALL · REWARD HACKING 100% reward, far above the human ceiling — the rubric got gamed. PITFALL · UNDER-SPECIFIED Experts disagree on the answer → grader is no more consistent.

Your grader is the entire training signal. A binary grader gives the model almost no gradient — most rollouts get zero, the rare success gets one, and there’s no way to distinguish “almost right” from “completely wrong.” A graded rubric gives partial credit for partial progress: reading the right file, extracting the right number, getting the reasoning chain right even if the final answer is off. That partial credit is what lets the model hill-climb.

Three failure modes show up repeatedly:

  1. Brittle string matching. If your grader penalizes “$32” vs “32 dollars,” the model learns to optimize formatting instead of correctness. Use a model-based grader or a normalized comparison.
  2. Reward hacking. The model is smart and will find any seam in your grader. Early training runs sometimes hit 100% validation reward — far above the human-achievable ceiling — which was the tell that the model had found a way to game the rubric. Hardening the grader (more criteria, harder-to-spoof checks, occasional adversarial spot-checks) is iterative work, not a one-shot.
  3. Under-specified tasks. If two human experts disagree on the right answer, your grader can’t be more consistent than they are, and the model gets contradictory signal. Tasks need a single defensible answer or a rubric that domain experts converge on.

The endpoint grader is a lever worth using. It lets you put arbitrary logic in your scoring — calling out to other models, querying databases, running code, comparing structured outputs. The grader runs in your environment, on your infrastructure, with your data. Whatever you can express programmatically, you can use as a reward signal.

Infrastructure realities

Hosting tool endpoints and a grader for training looks similar to hosting them for production, with two differences that matter operationally.

Bursty load. Training is parallel. At the start of each step, the platform may issue hundreds of concurrent rollout requests — hundreds of tool servers spinning up at once. Your endpoints need to handle this burst pattern, which is unlike production traffic. Teams have used isolated VMs (one per rollout, important if your tools include anything destructive like shell access), containers, or shared services with aggressive rate-limit handling.

Failure attribution. If a tool endpoint fails — even transiently — the model gets zero reward for that rollout. The model didn’t do anything wrong, but the gradient signal says it did. Repeated infrastructure failures can collapse training: the model learns to avoid behaviors that correlated with infrastructure failures, even if those behaviors were correct. Heavy monitoring on tool failures, distinguishing model errors from infrastructure errors, is essential. So is having retry logic that’s transparent to the model.

A few smaller operational notes: keep tool outputs lean (every redundant token costs context, latency, and money during training); set token budgets on tool outputs; and treat the compute multiplier as a primary tuning knob — higher multipliers mean more exploration but also more load on your endpoints.

When to reach for agent RFT, and when not to

Putting it together, the decision framework looks something like this:

Reach for agent RFT when:

  • You’ve exhausted prompt engineering, tool design, and task decomposition.
  • Your agent uses tools that are meaningfully out-of-distribution for the base model — your custom APIs, your internal data, your domain-specific schemas.
  • You have non-zero baseline performance: the base model sometimes succeeds.
  • You can express your definition of “good” as a graded reward signal, not just a binary.
  • Latency or tool-call efficiency matters as much as accuracy. The compression of trajectories that comes naturally with RFT is often the headline benefit.
  • You have, or can build, the infrastructure to host tool endpoints and a grader at training-time burst loads.

Don’t reach for it when:

  • The base model is at zero across the board. Fix the prior first.
  • Your task doesn’t have a defensible ground truth — even experts disagree on the right answer.
  • Your grader is binary and you can’t make it graded. The signal is too sparse.
  • You haven’t yet tried the cheaper interventions. RFT amplifies whatever is working; if nothing is working, there’s nothing to amplify.
When to reach for agent RFT — four gates A four-gate decision flow. Each Yes leads to the next gate; each No drops to a 'fix this first' branch. Only after all four gates is RFT the right move. DECISION FLOW Should you reach for agent RFT? Four gates. Any No, fix that first. All Yes, RFT is the right next move. GATE 1 Cheaper levers exhausted? prompts · tools · split GATE 2 Base model sometimes wins? non-zero variance GATE 3 Graded reward expressible? not just binary GATE 4 Burst-ready tool + grader infra? hundreds of parallel calls → RFT Reach for it. amplify the prior Yes Yes Yes Yes No Try cheaper levers first. RFT can wait. No Fix the prior: simpler tools or stronger base. No Build the rubric: partial credit, not 0/1. No Build the infra: isolated VMs, retry logic. RFT amplifies whatever is working. If nothing is working, there’s nothing to amplify.

What this changes

The deeper shift agent RFT represents is that the unit of optimization has moved. For most of the last few years, “improving an LLM application” meant improving the prompt — the model was a fixed asset, and engineering effort was concentrated in the orchestration around it. With RFT, and especially with multi-step agent RFT, the model itself is now a tunable component of your stack. You can specialize a frontier model to your tools, your data, and your definition of quality, end-to-end.

That’s a different posture for an engineering team. It means the grader is an artifact you maintain, version, and harden over time. It means your tool endpoints have a training mode as well as a production mode. It means evaluation is no longer a one-shot benchmark but a continuous signal feeding back into the model. The teams getting the most out of this aren’t the ones with the most data — they’re the ones who treat their grader and their tool environment as first-class engineering surfaces.

The plateau most agent teams hit isn’t a property of the base model. It’s a property of how much of the problem the prompt can express. Agent RFT moves that boundary.

References and further reading

This post leans heavily on the technical material OpenAI has published about agent reinforcement fine-tuning. The primary sources:

  • Agent Reinforcement Fine Tuning — Will Hang & Cathy Zhou, OpenAI. AI Engineer Code Summit, November 2025. The canonical technical talk on multi-step, tool-using agent RFT — covers rollouts, graders, the rollout_id mechanic, and the principles behind well-specified tasks and unhackable reward functions.
  • Reinforcement fine-tuning · OpenAI API documentation. The product-side reference for hyperparameters (compute multiplier, batch size, epochs), grader configuration (model graders, string graders, endpoint graders), and tool-server integration.
  • Will Hang, on the significance of training-time tool calls: “These two additions actually mark the first time that we at OpenAI have allowed models to interact with the outside world during the training process.” (AI Engineer Code Summit, 2025.)
  • AI Engineer Code Summit — the conference where the agent RFT talk was given, and a useful pointer to adjacent talks on agent training, evals, and tool design.
End.   Set in Fraunces, Newsreader & JetBrains Mono.
TensorOps · Blog · 2026