The four-stage pipeline — domain-adaptive continued pre-training, LoRA SFT, online DPO, and rejection-sampled SFT — for turning a strong open base model into a domain-specialized LLM, with the hyperparameters and trade-offs that actually matter.
Generic APIs answer everyone.Custom models answer you.From scratch is rarely the move —continued pre-training, LoRA, DPO,and rejection sampling are.

The recipe most teams converge on for a domain-specialized LLM is a four-stage pipeline: domain-adaptive continued pre-training (DACPT) on a strong open base, supervised fine-tuning with parameter-efficient adapters (LoRA/QLoRA), preference alignment via direct preference optimization (DPO), and a final rejection-sampled SFT pass on synthetic multi-turn data. Each stage fixes a specific failure mode the previous one leaves behind. Skip a stage and it shows up as a measurable regression on your eval suite.
A recent end-to-end demonstration of this stack appeared in the Talkie project — a 13B model pre-trained from scratch on 260B tokens published before 1931 to produce a “1930-only” assistant. They trained from zero because the goal was the knowledge cutoff and they wanted no leakage of post-1930 priors. For almost every other use case, you want continued pre-training on a 70B-class open base instead. The rest of this article walks the four stages and the hyperparameters that actually move the needle.
This is written for ML engineers and applied researchers shipping the model, not for the budget owner approving it. Where I give numbers — learning rates, LoRA ranks, replay ratios, KL betas — they're starting points calibrated for 7B-to-70B dense decoders, not load-bearing claims. Validate against your own eval harness.
Train from scratch when one of three conditions holds: (1) you need a hard knowledge cutoff that contamination from a public model would violate; (2) your tokenizer needs vocabulary the base model can't represent — protein sequences, raw genomic data, novel scripts, custom DSLs; (3) you have trillions of in-distribution tokens and the compute to match. Otherwise continued pre-training on a strong open base is strictly Pareto-better — you inherit the language modeling, instruction following, and reasoning that the original lab spent millions of GPU-hours producing, then specialize from there.
Concrete defaults for the base: Llama 3.1 70B, Qwen2.5-72B, or Mistral Large 2 if the license terms work for your deployment. Smaller bases (7B–13B) are the right pick when latency-sensitive serving or single-GPU inference is a hard constraint; larger ones (70B+) win on reasoning-heavy tasks and tolerate weaker SFT data. Don't pick the base on benchmark scores alone — pick on the perplexity it gets on a held-out slice of your corpus before any training.
The DACPT → SFT → DPO → rejection-sampled SFT pipeline closes most of the gap to a hypothetical from-scratch run on the same data, at one to two orders of magnitude less compute. “Most” is unfalsifiable in absolute terms — no team outside the frontier labs has the budget to actually run the comparison — but the public ablations from the Llama, DeepSeek, and Qwen technical reports all point the same direction.
This is the foundation step most teams under-invest in.
DACPT is the same loss as the original training — causal LM, next-token cross-entropy — run on a mixture of your private corpus and a replay slice of general data, typically 5–15% of the original distribution, to mitigate catastrophic forgetting. Without the replay mix you'll see general-capability regression on benchmarks like MMLU and GSM8K within a few thousand steps. Pack documents to the model's max sequence length (8K–32K depending on base) with document-boundary attention masking so the loss isn't computed across unrelated documents.
Inputs are your private data lake — internal reports, compliance documents, technical specs, earnings transcripts, anonymized customer logs. Optimizer settings that work as a starting point: AdamW, learning rate 1e-5 to 3e-5 (10–100× lower than original pre-training), cosine schedule with a short linear warmup (a few hundred steps), weight decay 0.1, gradient clipping at 1.0. Token budget is the lever that matters most: 5B–50B is the usual range. Below 1B tokens you're effectively running extended SFT — call it that and stop pretending.
Data filtering is the dominant predictor of how this stage turns out:
End state to aim for: in-domain validation perplexity drops 30–60% with ≤2% regression on a general-capability suite. If you see >5% MMLU drop, your replay ratio is too low or your LR is too high. Always run the general-capability eval — DACPT failures usually look fine on in-domain loss.
DACPT teaches the model what your domain talks about. SFT teaches it how to respond — the input/output structure of the tasks you actually care about.
Build the instruction set from your real artifacts: policy manuals → Q&A pairs, audit responses → cited summaries, support tickets → resolutions, RFPs → drafts. The LIMA result holds in 2026: 1K–10K high-quality, hand-vetted examples typically beat 100K mediocre ones. Synthetic augmentation with a stronger teacher model is fine for scaling, but the seed set must be human-reviewed or it amplifies whatever the teacher's biases are.
By the end of SFT the model produces outputs in your house format — same field order, same citation style, same hedging register — without prompt engineering at inference. That's the right success criterion, not stylometric vibes.
Use LoRA or QLoRA. Trainable parameters drop to 0.5–1.5% of the model — rank 16–64 on the q/k/v/o projections plus the MLP, alpha = 2×rank as a sane default, dropout 0.05. On a 70B base in 4-bit (QLoRA), full SFT runs on 2–4×H100 instead of 64+. Training: 2–3 epochs over the curated set, LR 1e-4 to 3e-4 (much higher than DACPT — only the adapter is moving), cosine decay, batch size tuned so each global step sees ≥2K tokens. Merge the adapter back at the end if you want a single deployable checkpoint, or keep it separate to swap behaviors per tenant.
Following instructions is easy. Following your rules — citation format, refusal policies, escalation thresholds, tone — needs a preference signal.
DPO replaces the PPO loop in classic RLHF with a closed-form contrastive loss on (chosen, rejected) preference pairs. The online variant generates fresh pairs from the current policy each round and judges them with a stronger model (or a panel of them), instead of training once on a static human-labeled set that goes stale the moment the policy moves. The KL coefficient β controls how far the policy can drift from the SFT reference: start at 0.1, drop to 0.01–0.05 if the model is under-responsive to the preferences, raise to 0.3 if it starts collapsing onto the reference and ignoring the signal.
Why DPO is the current default over PPO:
Legal note: using outputs from a closed model (OpenAI, Anthropic, Google) as training signal almost always violates terms of service and may forfeit your right to deploy commercially. For the judge, use an open-weights model you've licensed — Llama 3.1 70B, Qwen2.5-72B, DeepSeek-V3 — or a stronger checkpoint of your own.
DPO fixes single-turn alignment. Multi-turn conversations — where the model must remember a constraint set from turn 1 while answering turn 7 — degrade differently and need their own pass.
The technique: generate N synthetic multi-turn dialogues per seed prompt (N = 8–32), score each completed conversation with a judge against your rubric (correctness, citation, refusal calibration, on-policy formatting), and run another SFT pass on the top 20–30%. This is the same family of methods as STaR, ReST, and Constitutional AI's rejection-sampling stage — keep the trajectories that succeeded, discard the rest, fine-tune on the survivors. The cost is dominated by inference for the synthetic generation, which is embarrassingly parallel and runs cheap on spot instances.
The behavior change is concrete and measurable: hold-out multi-turn eval scores typically jump 10–25 points on internal rubrics, and the failure mode where the model goes robotic or drops constraints by turn 6 disappears. The risk to watch is mode collapse — if the rubric is too narrow, top-k filtering will train the model on a single response template. Diversify the seed prompts and judge with a panel where you can.
In the order you'd actually run it:
Wall-clock for the full pipeline on a 70B base is typically 6–12 weeks for a team of 2–4 engineers, dominated by data curation rather than GPU time. Compute on rented H100s for the model side runs in the low-to-mid five figures end-to-end; data engineering and eval-harness work are the cost centers most pilots underestimate by an order of magnitude.
Done well, the resulting checkpoint is a defensible artifact: the weights encode your domain, the eval suite proves it, and neither can be reproduced from an API call. The failure mode to watch is the opposite — a checkpoint that's fine-tuned but not evaluated, where regressions on rare-but-important behaviors hide behind aggregate accuracy numbers.
Practical wins, in roughly the order they show up: outputs match your house format without prompt scaffolding; private terminology is handled correctly; refusal and citation policies are baked into weights instead of fragile system prompts; latency and cost are decoupled from a vendor's pricing curve and roadmap.
Build it incrementally. Land DACPT and SFT first against a real eval suite, then add DPO once you have preference data, then add rejection-sampled SFT for multi-turn polish. Each stage is independently shippable and independently measurable — and that's the only way you'll know which one is paying for itself.