When prompt engineering stops being enough — a practitioner’s guide to agent reinforcement fine-tuning (RFT) for tool-using agents. What it changes, when to reach for it, and what early production results actually show.
Prompt engineering hits a wall.Agent reinforcement fine-tuningtrains a frontier model end-to-endon your tools, your environment,your definition of good.

A practitioner’s guide to reinforcement fine-tuning for tool-using agents — what it changes, when to reach for it, and what the early production results actually look like.
Agent reinforcement fine-tuning was introduced by OpenAI in late 2025 — see the AI Engineer Code Summit talk by Will Hang and Cathy Zhou and the OpenAI RFT documentation; references collected at the end of this post.
If you’ve shipped an agent into production, the curve probably looks familiar. The first prompt you write gets you to 60% on your eval. A week of prompt engineering, better tool descriptions, and tightening the task harness pushes you into the low 70s. Then progress flattens. You add a planner, you split tools, you rewrite system messages — and the needle barely moves. You’ve squeezed the prior dry.
This is the gap that agent reinforcement fine-tuning (agent RFT) is built to close. It’s a meaningfully different technique from supervised fine-tuning or even single-turn RFT, and it’s the first practical way to train a frontier reasoning model end-to-end on a multi-step, tool-using task — using your tools, your environment, and your definition of “good.”
This piece is for CTOs and engineers who are already past the prompt-engineering plateau and trying to decide whether agent RFT is the next investment. We’ll cover what it actually does under the hood, why it’s sample-efficient, what the training loop looks like in practice, and where it has and hasn’t worked in early production deployments.
A model becomes an agent the moment it can act on the world without going through you for every step. That capability is mediated by tools: a coding agent has a terminal and a code interpreter; a customer service agent has a CRM and a refund API; a research agent has a browser and a file system. Each tool call writes its result back into the agent’s context window, and the agent reasons over that, calls another tool, and repeats — until it produces a final answer.
That feedback loop is exactly what makes agents useful, and exactly what makes them hard to train. The base model was post-trained on tools that look nothing like yours. Your tools have your naming conventions, your schemas, your latency characteristics, your edge cases. The agent will use them — but often inefficiently. It will call five tools when one would do. It will repeat the same search with slightly different parameters. It will reason over irrelevant outputs. None of that shows up cleanly as a “wrong answer” you can correct with a prompt; it shows up as a slow, expensive trajectory that happens to land on the right answer most of the time.
The gap between how the base model uses generic tools and how your agent should use your tools is what teams call distribution shift, and closing it is the core promise of agent RFT.
Fine-tuning is the right answer to a specific problem, not the first answer to every problem. Before reaching for it, the standard playbook still applies, and it gets surprisingly far:
Only after these are exhausted does fine-tuning earn its place. The framing that holds up well in practice: prompt engineering and task design move the agent within the basin of behavior the base model already has; fine-tuning changes the basin itself.
Reinforcement fine-tuning, in its original form, was a single-turn technique. You gave the model a prompt, it produced an answer, a grader scored the answer, and the model’s weights were nudged to make high-scoring answers more likely. Useful, but agents don’t work in single turns.
Agent RFT extends the loop. During training, the model isn’t just generating answers — it’s generating full rollouts. A rollout is one complete trajectory: the model thinks, calls a tool, sees the result, thinks again, calls another tool, and eventually emits a final answer. Two things make this work:
A unique rollout ID is attached to every tool call from the same trajectory, which means your tool servers and your grader can correlate calls, maintain per-rollout state, and grade based on the full trajectory rather than just the final answer. This matters more than it sounds — it’s what lets you reward how the agent got to the answer, not just whether it got there.
The training loop, then, looks like this: the platform issues many parallel rollout requests, the agent explores different ways of using your tools, the grader scores each trajectory, and the model’s weights are updated to make high-reward trajectories more likely. Over time, the model converges on a policy that uses your specific tools efficiently and reasons well over their outputs.
One of the more counterintuitive properties of agent RFT is that it works with surprisingly little data. Production runs have succeeded with as few as 100–150 training examples, and several have done well with 1,000. Compare that to the tens of thousands typically needed for supervised fine-tuning.
The reason is structural. The model is generating its own training data through exploration. Each prompt in your training set isn’t one example — it’s a seed for many trajectories, each scored by your grader, each contributing gradient signal. The compute multiplier (a hyperparameter that controls how many rollouts per sample) lets you trade compute for exploration directly: more rollouts means more chances to stumble onto a good trajectory on a hard sample.
This also tells you when the technique won’t work. Agent RFT depends on the base model occasionally getting the answer right. If your model scores zero across every rollout on every sample, there’s no signal to amplify — you’re trying to nudge a flat landscape. Two diagnostic plots tell you whether you’re in good shape:
If both numbers are near zero, the task is too hard for the base model and RFT won’t rescue it. The fix is upstream: simpler tools, better prompts, or a stronger base model. RFT amplifies a working prior; it doesn’t create one.
It’s worth grounding all of this in a concrete task. The FinQA benchmark gives a model a financial report and asks numerical-reasoning questions about it. In the standard setup, the relevant report is included in the prompt — the task is purely reasoning over given context.
A more realistic agentic version: strip out the report, and force the model to find it. Give it a corpus of 2,800 financial documents and three tools:
search — semantic search over the corpuslist — directory and file listingcat — read a document by pathNow the agent has to figure out which report is relevant, locate it, extract the right numbers, and reason over them — within a budget of 10 tool calls. The grader is a model-based grader that gives full credit for matches, partial credit for near-misses (rounding errors, formatting differences like “$32” vs “32 dollars”), and zero for wrong answers. Strict string matching would punish the model for trivial formatting mistakes; an unconstrained grader would let it game the rubric. The middle ground is a careful rubric the grader follows.
Running agent RFT on this task — 1,000 training samples, batch size 16, three epochs, GPT-5 with medium reasoning effort — produces results that are characteristic of what to expect:
The interpretation matters more than the numbers. The big early gains come from the model learning to use the specific tools efficiently — fewer redundant calls, less wasted reasoning over irrelevant outputs. The slower late-stage gains come from the model exploring genuinely better strategies for hard cases. Plotting per-sample reward delta against per-sample tool-call delta is one of the more useful diagnostics: you want most samples to land in the “higher reward, fewer tool calls” quadrant, and you want zero samples in the “lower reward, more tool calls” quadrant. That second condition is what tells you the new policy is a strict improvement, not a tradeoff.
The shapes that recur across early production deployments are more informative than any single headline number. Five patterns show up again and again.
A planning-stage compression. Many coding and document-editing agents have a pre-action planning phase — decide which files or sections to touch before doing anything. Latency on that phase is what users feel as “time to first useful output,” so it is a natural fine-tuning target. Restrict the planner to a small toolset (read, search, list), define the reward as F1 between predicted-and-actual touched files, and train on a few hundred to a thousand diverse examples. The recurring result: planning round-trips drop by roughly half, time-to-first-output drops in proportion, and accuracy improves rather than regresses. Keeping the train and eval splits disjoint matters, because the deployed model will see code and content it has never seen.
Modest gains against a hard human ceiling. Some agentic tasks are large-taxonomy classification problems — clinical coding, customs codes, internal product taxonomies — where the absolute best a model can do is constrained by genuine human disagreement among expert annotators. The base model lands at a moderate F1, RFT pushes it a few points higher, and the headline number understates the result because the practical ceiling is well below 1.0. The headroom that can be captured is captured. Latency typically drops 15–20% as the agent learns to reason more efficiently over the taxonomy.
Multi-tool generation with a multi-criteria rubric. Agents that produce structured artifacts — slide decks, briefs, reports — through a chain of tools, with a final harmonization step for coherence, benefit disproportionately from a graded rubric that scores content quality and structural fit as separate sub-criteria. The visible pattern in training: large improvements concentrated on previously-failing edge cases rather than uniform gains across the distribution. The investment goes into the rubric, not the dataset.
Code generation in a data-starved domain. When the target is a custom DSL, an unusual hardware platform, or an internal framework with no public corpus of correct examples, supervised fine-tuning has nothing to chew on. Agent RFT can work with as few as a hundred prompts and a strong correctness-and-performance grader: run the produced code, check that it executes correctly, measure whatever quantitative metric matters. The fine-tuned model can surpass prior approaches with no example outputs in the training data. The grader is carrying the entire signal.
Research and extraction on long documents. Agents that read filings, regulations, contracts, or research papers and produce structured insights for a human reviewer benefit from a custom LLM-as-judge grader exposed via endpoint, scoring factual accuracy, reasoning, completeness, and source attribution as separate criteria. Reported gains are typically double-digit on core accuracy, with substantial reductions in hallucination and citation-omission rates.
The common thread isn’t the domain — it’s that each of these patterns shares three properties: (a) a base model with non-trivial baseline performance, (b) a task with genuine reasoning content, and (c) an investment in the grader. None would have been a good fit for supervised fine-tuning, because none has a clean dataset of “ideal trajectories.” What they have instead is a reliable way to score trajectories, and that turns out to be enough.
If there’s one piece of advice that comes up in every successful deployment, it’s this: invest disproportionately in your grader.
Your grader is the entire training signal. A binary grader gives the model almost no gradient — most rollouts get zero, the rare success gets one, and there’s no way to distinguish “almost right” from “completely wrong.” A graded rubric gives partial credit for partial progress: reading the right file, extracting the right number, getting the reasoning chain right even if the final answer is off. That partial credit is what lets the model hill-climb.
Three failure modes show up repeatedly:
The endpoint grader is a lever worth using. It lets you put arbitrary logic in your scoring — calling out to other models, querying databases, running code, comparing structured outputs. The grader runs in your environment, on your infrastructure, with your data. Whatever you can express programmatically, you can use as a reward signal.
Hosting tool endpoints and a grader for training looks similar to hosting them for production, with two differences that matter operationally.
Bursty load. Training is parallel. At the start of each step, the platform may issue hundreds of concurrent rollout requests — hundreds of tool servers spinning up at once. Your endpoints need to handle this burst pattern, which is unlike production traffic. Teams have used isolated VMs (one per rollout, important if your tools include anything destructive like shell access), containers, or shared services with aggressive rate-limit handling.
Failure attribution. If a tool endpoint fails — even transiently — the model gets zero reward for that rollout. The model didn’t do anything wrong, but the gradient signal says it did. Repeated infrastructure failures can collapse training: the model learns to avoid behaviors that correlated with infrastructure failures, even if those behaviors were correct. Heavy monitoring on tool failures, distinguishing model errors from infrastructure errors, is essential. So is having retry logic that’s transparent to the model.
A few smaller operational notes: keep tool outputs lean (every redundant token costs context, latency, and money during training); set token budgets on tool outputs; and treat the compute multiplier as a primary tuning knob — higher multipliers mean more exploration but also more load on your endpoints.
Putting it together, the decision framework looks something like this:
Reach for agent RFT when:
Don’t reach for it when:
The deeper shift agent RFT represents is that the unit of optimization has moved. For most of the last few years, “improving an LLM application” meant improving the prompt — the model was a fixed asset, and engineering effort was concentrated in the orchestration around it. With RFT, and especially with multi-step agent RFT, the model itself is now a tunable component of your stack. You can specialize a frontier model to your tools, your data, and your definition of quality, end-to-end.
That’s a different posture for an engineering team. It means the grader is an artifact you maintain, version, and harden over time. It means your tool endpoints have a training mode as well as a production mode. It means evaluation is no longer a one-shot benchmark but a continuous signal feeding back into the model. The teams getting the most out of this aren’t the ones with the most data — they’re the ones who treat their grader and their tool environment as first-class engineering surfaces.
The plateau most agent teams hit isn’t a property of the base model. It’s a property of how much of the problem the prompt can express. Agent RFT moves that boundary.
This post leans heavily on the technical material OpenAI has published about agent reinforcement fine-tuning. The primary sources: