Technology · Spring 2026

How to train a custom LLM.

The four-stage pipeline — domain-adaptive continued pre-training, LoRA SFT, online DPO, and rejection-sampled SFT — for turning a strong open base model into a domain-specialized LLM, with the hyperparameters and trade-offs that actually matter.

Gad BenramMay 1, 20269 min read2,039 wordsFiled under Technology
Frontispiece· Spring 2026 · TensorOps Blog

Generic APIs answer everyone.Custom models answer you.From scratch is rarely the move —continued pre-training, LoRA, DPO,and rejection sampling are.

Inside this dispatch7 sections · 9 minutes
  1. 01From scratch vs. continued pre-training: when does each make sense?
  2. 02Domain-adaptive continued pre-training (DACPT)
  3. 03Supervised fine-tuning with LoRA: teaching task structure
  4. 04Preference alignment with online DPO
  5. 05Rejection-sampled SFT for multi-turn coherence
  6. 06End-to-end pipeline
  7. 07What you actually own at the end
How to train a custom LLM in 2026, four-stage pipeline (CPT, LoRA SFT, online DPO, rejection-sampled SFT)
How to train a custom LLM in 2026, four-stage pipeline (CPT, LoRA SFT, online DPO, rejection-sampled SFT)

The recipe most teams converge on for a domain-specialized LLM is a four-stage pipeline: domain-adaptive continued pre-training (DACPT) on a strong open base, supervised fine-tuning with parameter-efficient adapters (LoRA/QLoRA), preference alignment via direct preference optimization (DPO), and a final rejection-sampled SFT pass on synthetic multi-turn data. Each stage fixes a specific failure mode the previous one leaves behind. Skip a stage and it shows up as a measurable regression on your eval suite.

A recent end-to-end demonstration of this stack appeared in the Talkie project — a 13B model pre-trained from scratch on 260B tokens published before 1931 to produce a “1930-only” assistant. They trained from zero because the goal was the knowledge cutoff and they wanted no leakage of post-1930 priors. For almost every other use case, you want continued pre-training on a 70B-class open base instead. The rest of this article walks the four stages and the hyperparameters that actually move the needle.

This is written for ML engineers and applied researchers shipping the model, not for the budget owner approving it. Where I give numbers — learning rates, LoRA ranks, replay ratios, KL betas — they're starting points calibrated for 7B-to-70B dense decoders, not load-bearing claims. Validate against your own eval harness.

From scratch vs. continued pre-training: when does each make sense?

Train from scratch when one of three conditions holds: (1) you need a hard knowledge cutoff that contamination from a public model would violate; (2) your tokenizer needs vocabulary the base model can't represent — protein sequences, raw genomic data, novel scripts, custom DSLs; (3) you have trillions of in-distribution tokens and the compute to match. Otherwise continued pre-training on a strong open base is strictly Pareto-better — you inherit the language modeling, instruction following, and reasoning that the original lab spent millions of GPU-hours producing, then specialize from there.

Concrete defaults for the base: Llama 3.1 70B, Qwen2.5-72B, or Mistral Large 2 if the license terms work for your deployment. Smaller bases (7B–13B) are the right pick when latency-sensitive serving or single-GPU inference is a hard constraint; larger ones (70B+) win on reasoning-heavy tasks and tolerate weaker SFT data. Don't pick the base on benchmark scores alone — pick on the perplexity it gets on a held-out slice of your corpus before any training.

The DACPT → SFT → DPO → rejection-sampled SFT pipeline closes most of the gap to a hypothetical from-scratch run on the same data, at one to two orders of magnitude less compute. “Most” is unfalsifiable in absolute terms — no team outside the frontier labs has the budget to actually run the comparison — but the public ablations from the Llama, DeepSeek, and Qwen technical reports all point the same direction.

Fig. 01 · Two paths to a domain-specific LLM From-scratch pre-training vs. continued pre-training + targeted fine-tuning From scratch e.g. fixed-cutoff 13B model · 260B tokens $80M+ total compute & data spend 9–18 mo to a usable checkpoint 100% of weights trained Continued pre-train + targeted FT strong open base · LoRA · DPO $80–300K production-ready pilot 6–12 wks to a usable checkpoint ~1% of weights trained (LoRA) 95% of businesses should not train from scratch · the right base + the right adaptation gets 80–90% of the lift

Domain-adaptive continued pre-training (DACPT)

This is the foundation step most teams under-invest in.

DACPT is the same loss as the original training — causal LM, next-token cross-entropy — run on a mixture of your private corpus and a replay slice of general data, typically 5–15% of the original distribution, to mitigate catastrophic forgetting. Without the replay mix you'll see general-capability regression on benchmarks like MMLU and GSM8K within a few thousand steps. Pack documents to the model's max sequence length (8K–32K depending on base) with document-boundary attention masking so the loss isn't computed across unrelated documents.

Inputs are your private data lake — internal reports, compliance documents, technical specs, earnings transcripts, anonymized customer logs. Optimizer settings that work as a starting point: AdamW, learning rate 1e-5 to 3e-5 (10–100× lower than original pre-training), cosine schedule with a short linear warmup (a few hundred steps), weight decay 0.1, gradient clipping at 1.0. Token budget is the lever that matters most: 5B–50B is the usual range. Below 1B tokens you're effectively running extended SFT — call it that and stop pretending.

Data filtering is the dominant predictor of how this stage turns out:

  • OCR and layout-aware extraction for scanned PDFs (Marker, Nougat, or commercial equivalents). Quality here is a step function, not a slope — an extra 5% of pages parsed correctly often beats a 2× increase in raw token count.
  • Near-duplicate removal with MinHash/SimHash at the document level, exact dedup at the chunk level. Repeated tokens silently degrade loss and waste budget.
  • PII redaction with a model-based classifier, not regex alone. Regex catches SSNs and emails; it misses paraphrased identifiers and the long tail of internal codes.
  • Quality filtering: train a fastText classifier on a few thousand hand-labeled in-domain vs. out-of-domain documents and discard the bottom quartile. The DoReMi and FineWeb-Edu papers are the public references for why this works.
  • Domain mixture weights tuned on a held-out perplexity eval, not assumed. Uniform mixing across sub-domains is rarely optimal.

End state to aim for: in-domain validation perplexity drops 30–60% with ≤2% regression on a general-capability suite. If you see >5% MMLU drop, your replay ratio is too low or your LR is too high. Always run the general-capability eval — DACPT failures usually look fine on in-domain loss.

Fig. 02 · The 2026 custom-LLM pipeline Four stages from open base model to defensible domain expert 01 Domain CPT internal reports, specs, transcripts teach vocabulary 02 Targeted SFT (LoRA) policy manuals, Q&A logs, templates teach format 03 Online DPO stronger model ranks responses teach the rules 04 Rejection-sampled SFT keep only the best multi-turn dialogues teach coherence Each stage compounds the previous one — skipping stage 03 is the most common reason a custom model still hallucinates

Supervised fine-tuning with LoRA: teaching task structure

DACPT teaches the model what your domain talks about. SFT teaches it how to respond — the input/output structure of the tasks you actually care about.

Build the instruction set from your real artifacts: policy manuals → Q&A pairs, audit responses → cited summaries, support tickets → resolutions, RFPs → drafts. The LIMA result holds in 2026: 1K–10K high-quality, hand-vetted examples typically beat 100K mediocre ones. Synthetic augmentation with a stronger teacher model is fine for scaling, but the seed set must be human-reviewed or it amplifies whatever the teacher's biases are.

  • Internal policy manuals and SOPs reformatted as instruction → answer with citation spans.
  • Compliance playbooks turned into (scenario, required action, justification) triples.
  • Historical Q&A logs from support, sales, and legal — filtered for resolved and correct, not just recent.
  • Structured report and email templates as (brief → output) pairs in the exact schema you'll deploy.

By the end of SFT the model produces outputs in your house format — same field order, same citation style, same hedging register — without prompt engineering at inference. That's the right success criterion, not stylometric vibes.

Use LoRA or QLoRA. Trainable parameters drop to 0.5–1.5% of the model — rank 16–64 on the q/k/v/o projections plus the MLP, alpha = 2×rank as a sane default, dropout 0.05. On a 70B base in 4-bit (QLoRA), full SFT runs on 2–4×H100 instead of 64+. Training: 2–3 epochs over the curated set, LR 1e-4 to 3e-4 (much higher than DACPT — only the adapter is moving), cosine decay, batch size tuned so each global step sees ≥2K tokens. Merge the adapter back at the end if you want a single deployable checkpoint, or keep it separate to swap behaviors per tenant.

Fig. 03 · Why LoRA changes the economics A 70B base model · only the low-rank adapters move during fine-tuning Full fine-tuning all 70B parameters update LoRA / QLoRA ~0.5–1.5% of parameters update · 90%+ less compute Same final-task quality · single-GPU runs replace multi-node clusters · checkpoints in megabytes, not terabytes

Preference alignment with online DPO

Following instructions is easy. Following your rules — citation format, refusal policies, escalation thresholds, tone — needs a preference signal.

DPO replaces the PPO loop in classic RLHF with a closed-form contrastive loss on (chosen, rejected) preference pairs. The online variant generates fresh pairs from the current policy each round and judges them with a stronger model (or a panel of them), instead of training once on a static human-labeled set that goes stale the moment the policy moves. The KL coefficient β controls how far the policy can drift from the SFT reference: start at 0.1, drop to 0.01–0.05 if the model is under-responsive to the preferences, raise to 0.3 if it starts collapsing onto the reference and ignoring the signal.

Why DPO is the current default over PPO:

  • No separate reward model to train, monitor, or have hacked. The reward is implicit in the log-probability ratio between policy and reference.
  • One forward pass per pair, gradient straight back into the policy. Roughly 5–10× cheaper wall-clock than equivalent PPO at the same data scale.
  • Stable under standard SGD/AdamW dynamics — no KL blow-ups, no advantage normalization, no value-head debugging.
  • Empirically competitive with PPO on AlpacaEval, MT-Bench, and most internal preference evals. Where PPO still wins is reasoning tasks with verifiable rewards — and there you should be using GRPO or RLVR-style methods, not vanilla PPO.

Legal note: using outputs from a closed model (OpenAI, Anthropic, Google) as training signal almost always violates terms of service and may forfeit your right to deploy commercially. For the judge, use an open-weights model you've licensed — Llama 3.1 70B, Qwen2.5-72B, DeepSeek-V3 — or a stronger checkpoint of your own.

Fig. 04 · Online DPO with a stronger judge No reward model · no RL loop · just preference pairs from a more capable teacher Your custom LLM post-CPT + SFT the student response A candidate response B candidate Stronger judge Claude Opus / Sonnet ranks A vs. B preference pair → DPO update Check the teacher model's terms before using its outputs as training signal · open judges (Grok, Llama-3.1) sidestep the licensing question

Rejection-sampled SFT for multi-turn coherence

DPO fixes single-turn alignment. Multi-turn conversations — where the model must remember a constraint set from turn 1 while answering turn 7 — degrade differently and need their own pass.

The technique: generate N synthetic multi-turn dialogues per seed prompt (N = 8–32), score each completed conversation with a judge against your rubric (correctness, citation, refusal calibration, on-policy formatting), and run another SFT pass on the top 20–30%. This is the same family of methods as STaR, ReST, and Constitutional AI's rejection-sampling stage — keep the trajectories that succeeded, discard the rest, fine-tune on the survivors. The cost is dominated by inference for the synthetic generation, which is embarrassingly parallel and runs cheap on spot instances.

The behavior change is concrete and measurable: hold-out multi-turn eval scores typically jump 10–25 points on internal rubrics, and the failure mode where the model goes robotic or drops constraints by turn 6 disappears. The risk to watch is mode collapse — if the rubric is too narrow, top-k filtering will train the model on a single response template. Diversify the seed prompts and judge with a panel where you can.

End-to-end pipeline

In the order you'd actually run it:

  1. Pick 3–5 concrete tasks with measurable success criteria. “Better support” doesn't count; “reduce time-to-first-resolution on tier-1 tickets by 30% with ≥95% citation accuracy” does. Build the eval harness first.
  2. Curate the corpus. Targets: 10–50B tokens for DACPT, 1K–10K SFT pairs, 5K–50K preference pairs, 1K–10K seed prompts for rejection sampling.
  3. Pick the base. 70B if you can serve it, 7–13B if you can't. Confirm by running pre-training perplexity on a held-out in-domain slice across a few candidates.
  4. DACPT for 5–50B tokens with a 5–15% general-data replay mix. Validate on held-out perplexity and a general-capability suite (MMLU, GSM8K, ARC) before unfreezing the next stage.
  5. LoRA SFT on the curated instruction set. Rank 32, 2–3 epochs, LR 1e-4 with cosine decay.
  6. Online DPO with an open-weights judge. β = 0.1 to start. 1–3 epochs over 5–50K preference pairs.
  7. Rejection-sampled SFT on multi-turn synthetics. Keep the top 20–30% of trajectories.
  8. Eval suite end-to-end: in-domain QA accuracy, citation precision/recall, refusal calibration, multi-turn coherence, latency under realistic load. Run before and after every stage to catch regressions early.
  9. Serve behind your own inference stack — vLLM, TGI, or SGLang — on-prem or in a private VPC. Quantize to INT8 or AWQ if latency demands it; benchmark before, not after.
Fig. 05 · Hyperparameters at a glance Starting points per stage for a 70B-class open base · validate against your eval harness 01 DACPT domain language Data 5–50B tokens Replay 5–15% general Trainable 100% of weights LR 1–3e-5 · cosine Watch MMLU Δ ≤ 2% 02 LoRA SFT task structure Data 1K–10K pairs Adapter rank 32 · α 64 Trainable 0.5–1.5% LR 1–3e-4 · cosine Watch format adherence 03 Online DPO rule alignment Data 5K–50K pairs Judge open-weights 70B β (KL) 0.1 → tune 0.01–0.3 Epochs 1–3 Watch reference collapse 04 RS-SFT multi-turn coherence Seeds 1K–10K prompts Samples N = 8–32 / seed Keep top 20–30% Pass 1 SFT epoch Watch mode collapse Wall-clock 6–12 wks for 2–4 engineers · GPU spend in the low-to-mid five figures · data curation dominates both

Wall-clock for the full pipeline on a 70B base is typically 6–12 weeks for a team of 2–4 engineers, dominated by data curation rather than GPU time. Compute on rented H100s for the model side runs in the low-to-mid five figures end-to-end; data engineering and eval-harness work are the cost centers most pilots underestimate by an order of magnitude.

What you actually own at the end

Done well, the resulting checkpoint is a defensible artifact: the weights encode your domain, the eval suite proves it, and neither can be reproduced from an API call. The failure mode to watch is the opposite — a checkpoint that's fine-tuned but not evaluated, where regressions on rare-but-important behaviors hide behind aggregate accuracy numbers.

Practical wins, in roughly the order they show up: outputs match your house format without prompt scaffolding; private terminology is handled correctly; refusal and citation policies are baked into weights instead of fragile system prompts; latency and cost are decoupled from a vendor's pricing curve and roadmap.

Build it incrementally. Land DACPT and SFT first against a real eval suite, then add DPO once you have preference data, then add rejection-sampled SFT for multi-turn polish. Each stage is independently shippable and independently measurable — and that's the only way you'll know which one is paying for itself.

End.   Set in Fraunces, Newsreader & JetBrains Mono.
TensorOps · Blog · 2026