Agents · Engineering · Spring 2026

A Practical Guide to Agent Reinforcement Fine-Tuning

A nine-step tutorial on agent reinforcement fine-tuning (RFT): how to build the dataset, stand up tool servers, write the grader, configure KL and group size, size GPUs, and pick between TRL, verl, OpenRLHF, Unsloth, and NeMo RL — with code and example outputs at every step.

Tiago SantosMay 7, 20265 min read1,151 wordsFiled under Agents

A step-by-step companion to the conceptual primer: follow nine steps from a folder of prompts to a fine-tuned tool-using agent. Code, expected outputs, and the watch-outs at every step.

What we’re going to build

By the end of this tutorial you will have trained a tool-using agent end-to-end via reinforcement fine-tuning. The running example: a research agent that answers numerical questions about a small corpus of company filings, with three tools — search (semantic search), list (browse paths), and read (fetch a document). The same nine steps work for any agentic task; only the dataset and the tools change.

This is a hands-on tutorial. Every step has runnable code, an expected output to check against, and a flagged gotcha. The framework choice is deferred to Step 9 — the early steps work the same way whether you train against a managed RFT API or self-host on TRL, verl, OpenRLHF, Unsloth, or NeMo RL.

Tutorial roadmap — nine steps from dataset to deployment Visual roadmap of the nine-step tutorial: dataset, prior check, tool servers, grader, hyperparameters, training, evaluation, failure inspection, and deployment. ROADMAP Nine steps from dataset to deployed model Each step has code, an expected output, and one watch-out. 1 Define + sanity-check the prior does the base model sometimes win? 2 Build the training dataset JSONL: prompt + ground truth 3 Stand up tool servers HTTP endpoints, bearer auth 4 Write the grader rule-based · endpoint · model-judge 5 Configure hyperparameters group size · KL · lr · epochs 6 Launch training watch reward + tool-calls + KL 7 Evaluate base vs fine-tuned per-sample diff scoreboard 8 Inspect failures + reward hacking rank by Δreward, eyeball traces 9 Deploy split train server from prod If the prior fails (Step 1), you don’t move on. RFT can’t hill-climb a flat landscape.

Prerequisites

  • Python 3.10+ with requests, fastapi, uvicorn, and datasets.
  • A base reasoning model — open-weights (Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct, or similar) for self-hosted; or API access if you’re using a managed RFT service.
  • One GPU with ≥24 GB for 7B-class LoRA, ≥80 GB (H100/A100) for full-parameter or larger models. None needed if managed.
  • Somewhere to host HTTP services — Modal, Fly, Cloud Run, or a plain VM. Tools and grader live behind URLs.
  • ~50–500 prompts to start. We’ll show how to grow this in Step 2.

Step 1: Define the task and sanity-check the prior

Goal: decide what success looks like, then prove the base model can sometimes hit it. If the base model fails on every rollout, RFT can’t hill-climb a flat landscape — fix the prior first.

Take a small batch of representative prompts (50 is enough), run the base model 5 times each through your tools, and look at the per-sample success-rate distribution. The shape of that distribution decides whether you have a viable RFT task.

# step1_prior_check.py
import json, requests, random
from collections import Counter

BASE_MODEL_URL = "http://localhost:8001/v1/chat/completions"   # vLLM endpoint
N_ROLLOUTS = 5

def run_agent(prompt, tools_spec):
    """Send the prompt to the base model with tools; return (success: bool, n_tool_calls: int)."""
    # ... your model + tool-calling loop here ...
    # Returns whether the final answer matched ground truth.
    pass

samples = [json.loads(line) for line in open("data/research_val.jsonl")]
results = []
for s in samples[:50]:
    successes = sum(run_agent(s["prompt"], s["tools"])[0] for _ in range(N_ROLLOUTS))
    rate = successes / N_ROLLOUTS
    results.append((s["id"], rate))

# Bucket each sample as never / sometimes / always
buckets = Counter(
    "always"    if r == 1.0 else
    "sometimes" if 0 < r < 1 else
    "never"
    for _, r in results
)
print(buckets)
EXPECTED OUTPUT
Counter({'sometimes': 19, 'never': 18, 'always': 13})

# 19/50 (38%) of samples are in the signal band — well over the 15-30% rule of thumb.
# RFT has something to learn here. Move on.

Step 2: Build the training dataset

Goal: produce two JSONL files — research_train.jsonl and research_val.jsonl — where each row is a self-contained training example.

Three sourcing strategies (in order of production-realism): real production logs, synthetic generation from a stronger model, public benchmarks adapted into your tool environment. Aim for 200–1,000 prompts in train, 10–20% held out for val. Stratify the split by task type and difficulty so val isn’t accidentally easier than train.

Each row has this shape:

{
  "id": "row-2842",
  "prompt": "What was Henderson Industries' Q3 2025 operating margin?",
  "reference_answer": "13.4%",
  "expected": {
    "tolerance_rel": 0.02,
    "evidence_path": "filings/henderson-10q-2025-q3.html",
    "must_use_tools": ["search", "read"]
  },
  "tools": [
    {"type": "function", "function": {"name": "search", "parameters": {...}}},
    {"type": "function", "function": {"name": "list",   "parameters": {...}}},
    {"type": "function", "function": {"name": "read",   "parameters": {...}}}
  ],
  "metadata": {"task_type": "numerical_qa", "difficulty": "medium"}
}

Three things to keep in mind. The id field will save you the moment you start diffing per-sample reward across runs. The reference_answer gets read by the grader (we’ll wire that in Step 4) — keep it structured, not free-form. The tools field is optional in some platforms, required in others; including it makes the row self-describing and is the right default for tasks that span multiple toolsets.

A small validator before you submit anything:

# step2_validate_dataset.py
import json
from pathlib import Path

REQUIRED = {"id", "prompt", "reference_answer"}
ROWS = []
for path in ["data/research_train.jsonl", "data/research_val.jsonl"]:
    with open(path) as f:
        for ln, line in enumerate(f, 1):
            row = json.loads(line)
            missing = REQUIRED - row.keys()
            assert not missing, f"{path}:{ln} missing {missing}"
            assert row["id"] not in {r["id"] for r in ROWS}, f"dup id {row['id']}"
            ROWS.append(row)

print(f"Total rows: {len(ROWS)}")
print(f"Train/val: {sum(1 for r in ROWS if r['id'].startswith('train'))} / "
      f"{sum(1 for r in ROWS if r['id'].startswith('val'))}")
print(f"Difficulty distribution: "
      f"{dict(Counter(r['metadata']['difficulty'] for r in ROWS))}")
EXPECTED OUTPUT
Total rows: 612
Train/val: 489 / 123
Difficulty distribution: {'easy': 184, 'medium': 287, 'hard': 141}

Step 3: Stand up your tool servers

Goal: expose each tool as a bearer-authenticated HTTP endpoint that the training platform can call from its rollout workers.

Tools are HTTP services, not in-process functions. The platform runs hundreds of rollouts in parallel and POSTs to your endpoints; the same servers should handle production calls later, possibly with different config. A minimal FastAPI skeleton:

# step3_tool_server.py
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer

app = FastAPI()
security = HTTPBearer()

def auth(creds=Depends(security)):
    if creds.credentials != os.environ["TOOL_BEARER"]:
        raise HTTPException(401, "bad token")

@app.post("/search")
async def search(req: Request, _=Depends(auth)):
    body = await req.json()
    args = body.get("arguments", {})
    call_id = body.get("call_id")
    query = args.get("query", "").strip()
    if not query:
        # Structured error, not 500 — see Watch For below
        return {"output": "Error: query is required", "call_id": call_id}
    hits = embed_and_search(query, top_k=3)         # your retrieval logic
    return {"output": format_hits(hits), "call_id": call_id}

@app.post("/list")
async def list_paths(req: Request, _=Depends(auth)):
    body = await req.json()
    prefix = body.get("arguments", {}).get("prefix", "")
    paths = corpus_index.list(prefix)
    return {"output": "\n".join(paths), "call_id": body.get("call_id")}

@app.post("/read")
async def read(req: Request, _=Depends(auth)):
    body = await req.json()
    path = body.get("arguments", {}).get("path")
    return {"output": corpus_index.read(path)[:8000],   # token budget!
            "call_id": body.get("call_id")}

Smoke-test before submitting any training job:

$ curl -s -X POST https://tools.example.com/search \
    -H "Authorization: Bearer $TOOL_BEARER" \
    -H "Content-Type: application/json" \
    -d '{"arguments": {"query": "Henderson Q3 operating margin"},
          "call_id": "test-001"}' | jq
EXPECTED OUTPUT
{
  "output": "filings/henderson-10q-2025-q3.html (score 0.87)\n  Q3 operating margin: 13.4% on revenue of $842M ...",
  "call_id": "test-001"
}

Step 4: Write your grader

Goal: turn each rollout’s trajectory into a scalar reward in [0, 1]. The grader is the entire training signal — invest disproportionately here.

Three implementations, picked by complexity:

  • Rule-based / Python. A small function: regex match, normalized numerical comparison with tolerance, schema check. Cheap, deterministic, brittle if not normalized.
  • HTTP endpoint. A service you host. Right when scoring needs database lookups, calls to other models, or logic too big for a function.
  • Model-as-judge. A structured prompt to a strong model with a numeric rubric and a JSON-schema response. The most common production choice.

For the research agent, a model-as-judge with a numerical-grader rubric:

# step4_grader.py
GRADER = {
    "type": "score_model",
    "name": "numerical_grader",
    "model": "gpt-4.1",
    "range": [0, 1],
    "pass_threshold": 0.75,
    "sampling_params": {"temperature": 0},
    "input": [
        {"role": "system", "content": """\
You will be given a Reference Answer and a Model Answer.
Score the Model Answer on a 0-1 scale.

Return 1 if BOTH:
- the Model Answer contains only the final numeric answer
- the value matches the Reference Answer up to format

Format variations that still count as correct:
- currency symbols ($, USD, €)
- magnitude suffixes (M, million, K, B)
- percent forms (7% vs 0.07)
- commas and whitespace

Return 0.5 if off by a tenth of a percent or less, or a true rounding error.
Return 0 otherwise. Reply with ONLY the numeric score."""},
        {"role": "user", "content":
            "- Reference Answer: {{item.reference_answer}}\n"
            "- Model Answer: {{sample.output_text}}"}
    ],
}

Test the grader standalone before training — three quick cases:

# step4_grader_test.py
cases = [
    ({"reference_answer": "13.4%"}, {"output_text": "13.4%"}),       # exact
    ({"reference_answer": "13.4%"}, {"output_text": "0.134"}),       # format variant
    ({"reference_answer": "13.4%"}, {"output_text": "13.41%"}),      # rounding
    ({"reference_answer": "13.4%"}, {"output_text": "12.0%"}),       # wrong
]
for item, sample in cases:
    print(f"ref={item['reference_answer']:6}  out={sample['output_text']:8}  "
          f"reward={run_grader(item, sample):.2f}")
EXPECTED OUTPUT
ref=13.4%   out=13.4%     reward=1.00
ref=13.4%   out=0.134     reward=1.00
ref=13.4%   out=13.41%    reward=0.50
ref=13.4%   out=12.0%     reward=0.00

Step 5: Configure hyperparameters

Goal: pick a starting hyperparameter set you can defend. Reasonable defaults to start from, ranked by how much they’ll move your run:

# step5_config.yaml
algorithm: grpo            # group-relative policy optimization

# The four numbers that matter most
group_size: 8              # rollouts per prompt; below 4 the baseline is too noisy
kl_coef: 0.01              # β: KL penalty against the frozen reference. Tune carefully.
learning_rate: 1.0e-5      # 10-100× lower than SFT
epochs: 2                  # agent RFT overfits fast on small datasets

# Mostly-fine-as-is
batch_size: 16             # max within memory after rollout buffer
max_trajectory_tokens: 4096
max_tool_calls: 12         # rollout tool-call budget
warmup_steps: 5
seed: 42

# Inputs
train_path: data/research_train.jsonl
val_path:   data/research_val.jsonl
base_model: Qwen/Qwen2.5-7B-Instruct
lora:
  rank: 32
  alpha: 64
  target_modules: [q_proj, k_proj, v_proj, o_proj]

KL coefficient β is the single biggest knob. Too low and the policy drifts away from the reference model (capability collapse — the model forgets formats, refuses unrelated tasks, breaks tool-call syntax). Too high and the model can’t learn anything new. Watch the KL curve during training; if it spikes, tighten β. Reasonable range: 0.001 to 0.05.

Group size controls baseline variance. 8–16 is the sweet spot. Below 4, group-mean advantages get noisy enough that gradients stop pointing in a useful direction.

Step 6: Launch training and watch the curves

Goal: kick off the job, monitor the right metrics in real time, and abort if things look wrong before you burn through your budget.

Training step, schematically:

One training step — sample, score, advantage, update A single prompt fans out into N parallel rollouts. Each rollout produces a trajectory scored by a grader. Group-relative advantages are computed. The policy is updated with a KL penalty against a frozen reference model. The loop repeats for the next prompt. TRAINING STEP One prompt, N rollouts, one update Sample. Score. Advantage. Update — with a leash to the reference model. PROMPT one row from your dataset N rollouts (group) r=0.18 r=0.42 r=0.55 r=0.71 r=0.92 r=0.50 policy π samples each lane independently GRADER trajectory → reward scalar in [0, 1] ADVANTAGE A_i = r_i − r̄_group group-relative baseline π_ref (frozen) base model checkpoint — prevents capability drift POLICY UPDATE ∇ E[log π(a|s) · A] − β · KL(π ‖ π_ref) amplify high-advantage trajectories; KL leashes drift KL leash repeat for next prompt

A minimal launch script using TRL’s GRPO trainer:

# step6_train.py
from trl import GRPOTrainer, GRPOConfig
from peft import LoraConfig
from datasets import load_dataset
from utils import build_grader, build_tools

config = GRPOConfig(
    learning_rate=1e-5,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    num_generations=8,                  # group size
    beta=0.01,                           # KL coefficient
    max_completion_length=4096,
    bf16=True,
    logging_steps=1,
    output_dir="./runs/research-rft-v1",
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B-Instruct",
    reward_funcs=[build_grader(GRADER)],   # from Step 4
    train_dataset=load_dataset("json", data_files="data/research_train.jsonl")["train"],
    args=config,
    peft_config=LoraConfig(r=32, lora_alpha=64,
                           target_modules=["q_proj","k_proj","v_proj","o_proj"]),
)
trainer.train()

What healthy training output looks like (first 30 of 150 steps shown):

EXPECTED OUTPUT
step  val_reward  tool_calls  kl_to_ref  loss
   0       0.59        9.1      0.000  -0.18
   2       0.61        8.4      0.003  -0.21
   5       0.64        7.2      0.008  -0.27
  10       0.69        5.8      0.014  -0.31
  15       0.72        4.9      0.019  -0.29
  20       0.74        4.5      0.024  -0.27
  25       0.76        4.2      0.028  -0.26
  30       0.77        4.1      0.031  -0.25  ← phase 1 plateau
 ...
 100       0.81        4.0      0.043  -0.18
 150       0.82        4.0      0.048  -0.14
What good training looks like — reward up, tool calls down Two curves over training steps: validation reward rising from 0.59 to 0.82 with rapid early gains; tool calls per rollout dropping from 9 to about 4 in the same period. CURVES What a healthy run looks like Reward rises fast then slow. Tool calls drop fast. KL grows slowly. 1.0 0.8 0.6 0.4 0.0 val_reward 12 8 6 3 0 tool_calls/rollout 0 25 50 100 150 training step 0.59 0.82 9.0 4.0 phase 1 tool compression phase 2 strategy improvement

Step 7: Evaluate base vs fine-tuned

Goal: compare the fine-tuned checkpoint against the base model on the held-out validation set, per sample. The aggregate number is rarely the most informative view.

# step7_evaluate.py
import json, statistics

def evaluate(model_id, val_path):
    out = []
    for row in (json.loads(l) for l in open(val_path)):
        sample = run_agent(model_id, row["prompt"], row["tools"])
        reward = run_grader(row, sample)
        out.append({"id": row["id"],
                    "reward": reward,
                    "tool_calls": sample["n_tool_calls"]})
    return out

base = evaluate("Qwen/Qwen2.5-7B-Instruct", "data/research_val.jsonl")
tuned = evaluate("./runs/research-rft-v1/checkpoint-150", "data/research_val.jsonl")

# Per-sample diff
diffs = []
for b_, t_ in zip(base, tuned):
    diffs.append({
        "id": b_["id"],
        "d_reward": t_["reward"] - b_["reward"],
        "d_tools":  t_["tool_calls"] - b_["tool_calls"],
    })

print(f"mean Δreward: {statistics.mean(d['d_reward'] for d in diffs):+.3f}")
print(f"mean Δtools:  {statistics.mean(d['d_tools']  for d in diffs):+.2f}")
print(f"improved:    {sum(1 for d in diffs if d['d_reward'] > 0)}/{len(diffs)}")
print(f"regressed:   {sum(1 for d in diffs if d['d_reward'] < 0)}/{len(diffs)}")
print(f"unchanged:   {sum(1 for d in diffs if d['d_reward'] == 0)}/{len(diffs)}")
EXPECTED OUTPUT
mean Δreward: +0.218
mean Δtools:  -4.91
improved:    71/123
regressed:   12/123
unchanged:   40/123

Three things to check before declaring victory. First, the regressed bucket — 12 samples got worse. Is there a pattern (a specific task type, a specific tool, a specific document)? Second, the unchanged bucket — 40 samples didn’t move. Are they all already-solved (good) or all unsolvable (bad)? Third, per-slice reward — slice the diffs by metadata.task_type and metadata.difficulty. Aggregate gains can hide a regression on a slice that matters.

Step 8: Inspect failures and reward hacking

Goal: look at the rollouts the model got wrong (or suspiciously right) and decide whether to retrain, fix the grader, or fix the dataset.

# step8_inspect.py
import json
from pathlib import Path

# Sort regressions worst-first
regressions = sorted([d for d in diffs if d["d_reward"] < 0],
                     key=lambda d: d["d_reward"])

for d in regressions[:5]:
    print(f"\n=== {d['id']}  Δreward={d['d_reward']:+.2f}  Δtools={d['d_tools']:+}")
    row = next(r for r in val_rows if r["id"] == d["id"])
    print(f"PROMPT:    {row['prompt']}")
    print(f"REFERENCE: {row['reference_answer']}")

    trace = load_trace(model="tuned", row_id=d["id"])
    for turn in trace["turns"]:
        if turn["type"] == "tool_call":
            print(f"  → {turn['name']}({turn['args']})")
        elif turn["type"] == "tool_result":
            print(f"  ← {turn['output'][:80]}")
        elif turn["type"] == "message":
            print(f"  [model] {turn['content'][:120]}")
EXPECTED OUTPUT
=== val-0042  Δreward=-0.50  Δtools=-3
PROMPT:    What was Atlas Corp's gross margin in fiscal 2024?
REFERENCE: 28.7%
  → search({"query": "Atlas Corp fiscal 2024 gross margin"})
  ← filings/atlas-10k-2024.html (score 0.91)\n  Gross profit margin: 28.7% ...
  → read({"path": "filings/atlas-10k-2024.html"})
  ← Atlas Corp ... gross profit of $1.43B on revenue of $4.98B ...
  [model] 28.7%

=== val-0117  Δreward=-1.00  Δtools=+2
PROMPT:    What was Henderson's revenue in Q3 2024?
REFERENCE: $842M
  → search({"query": "Henderson Q3 2024"})
  ← filings/henderson-10q-2024-q3.html (score 0.84)\n  Revenue: $842M
  [model] $842 million dollars     ← grader penalized "dollars" suffix

That second case — "$842 million dollars" scored zero — is a grader bug, not a model bug. The system prompt explicitly listed magnitude suffixes as acceptable but the implementation is brittle. Fix the grader, re-evaluate, don’t retrain. This is the most common kind of regression: the model is actually right, the grader is wrong.

Step 9: Pick a framework and budget your hardware

Goal: decide where to run subsequent iterations. The first run can be on whatever was easiest; iteration speed depends on choosing well.

Open-source RL fine-tuning frameworks at a glance Comparison matrix of TRL, verl, OpenRLHF, Unsloth, and NeMo RL on algorithm coverage, multi-step rollout support, single-GPU viability, and scale ceiling. FRAMEWORKS Open-source options for self-hosted training All five do PPO/GRPO. The differences show up at scale and on multi-turn agents. FRAMEWORK Algorithms Multi-turn / agent Single-GPU OK Scale ceiling TRL Hugging Face's reference. Best ecosystem fit. PEFT/LoRA built in. PPO · GRPO · DPO · RLOO verl Ray + Megatron. Strong on long trajectories and tool-using agents. PPO · GRPO · multi-turn OpenRLHF Ray + DeepSpeed. Tuned for production multi-node training. PPO · GRPO · KTO · REMAX Unsloth Single-GPU efficiency king. 7B-class models on a 24GB consumer card. GRPO · DPO (LoRA-only) NeMo RL NVIDIA stack. Massive scale, NeMo-native, more setup overhead. PPO · DPO · SteerLM ●●● strong · ●●○ workable · ●○○ limited. Multi-turn = native rollouts that span many tool calls; single-GPU = fits a 7B-class model on one accelerator with LoRA.

All five do PPO and GRPO. Differences show up at scale, on multi-turn rollouts, and in single-GPU friendliness:

  • TRL (Hugging Face) — easiest first run, best ecosystem fit. PEFT/LoRA built in. Multi-turn requires custom rollout code.
  • verl (ByteDance) — strongest agent / multi-turn support. Ray + Megatron under the hood; heavier setup.
  • OpenRLHF — production-RLHF on Ray + DeepSpeed; lighter than verl, heavier than TRL.
  • Unsloth — single-GPU efficiency. 7B-class with GRPO on a 24 GB consumer card; LoRA only.
  • NeMo RL (NVIDIA) — production scale at 70B+, NeMo-native, the heaviest setup.

And the GPU memory math you actually have to fit:

GPU memory budget — what fits where Stacked memory bars for 7B, 14B, 32B, and 70B models showing five distinct components: weights, optimizer state, activations, frozen reference model, and rollout buffer. MEMORY BUDGET What actually fits on a GPU during RFT RL needs more memory than SFT — rollout buffer + reference model. 0 40 80 120 160 200 memory (GB, bf16) 49 GB 7B (LoRA) 1× H100 80GB ✓ 92 GB 14B (LoRA) tight on 1×; 2× safe 180 GB 32B (LoRA) 4× H100 + FSDP 190 GB 70B (LoRA)* 8× H100 (sharded) weights optimizer activations reference (KL) rollout buffer * Sharded across GPUs in 70B case. Numbers are approximate, bf16, LoRA r=32. Full-parameter training is ~3–4× larger.

Two memory items teams underestimate. The rollout buffer (N parallel trajectories worth of activations + KV cache) is its own budget category — for 7B with group size 16 and 4K-token rollouts, that’s on the order of 10–20 GB. The frozen reference model used for the KL term doubles weight memory if it lives on the same GPU; offload it to CPU and KL becomes a network bottleneck. Most production setups run the reference on a dedicated GPU.

Step 10: Deploy with separate train and prod tool servers

Goal: get the fine-tuned model into production without dragging training-time assumptions along.

Three substitutions to make on the way out:

  • Tool servers split into modes. Training-time servers are bursty, latency-tolerant, single-bearer-auth. Production servers carry user traffic with per-tenant auth and SLOs. Same code, different config — most teams maintain two deployments.
  • The grader becomes an evaluator. Run the trained-against grader on a slice of live traffic to monitor for drift, regressions, or new failure modes. The grader code outlives the model.
  • Add an observability layer. Training rollouts are excellent retrospective audit logs but say nothing in production. Layer a real agent-observability system (the Job Card pattern) on the deployed model so support analysts and compliance reviewers can audit production rollouts the same way you audited training.

This last point catches teams off guard: a fine-tuned model that scored well on val can still produce surprising production behavior, and your training-time tool calls don’t help debug it. Training-time and production-time observability are two systems, not one. Build both.

Common pitfalls, condensed

  • Reward hacking — validation reward exceeds the human ceiling.
  • KL collapse / capability drift — policy walks too far from the reference, breaks general behaviors. Increase β.
  • Group-mean baseline noise — group size below 4 makes advantages too noisy to learn from.
  • Distribution mismatch — train and val don’t reflect production traffic. The reward curve looks great, the deployment disappoints.
  • Tool flakiness corrupts signal — transient failures get blamed on the model. Monitor failure rate; halt training above ~1%.
  • Premature convergence — single-mode policy that does one thing well and refuses to explore alternatives.

References and further reading

End.   Set in Fraunces, Newsreader & JetBrains Mono.
TensorOps · Blog · 2026