Domain‑tuned models on your data — three pieces, not one. An adapter that learns the domain cheaply, an evaluation harness that gates every release, and a retraining loop that catches drift before users do.
Open with the thesis: a fine-tuned model isn't a checkpoint, it's a small engineered system. Three pieces — adapter (LoRA/QLoRA, parameter-efficient deltas on the base), evaluation harness (a held-out suite plus CI gates that block regressions), and retraining loop (scheduled cadence triggered by drift signals). Without all three you ship a one-off that decays the moment the world moves on. Audience is AI pros — frame this as the difference between "we trained a model" and "we operate a model."
LoRA / QLoRA fits a small ΔW on top of a frozen base. Cheap, fast, easy to swap or roll back.
A held-out eval suite + regression gates wired into CI. No green, no ship.
Drift signals trigger; cadence is set. The model gets refreshed — not heroically rescued.
The pipeline that makes a fine-tune durable. Curate labelled, versioned data with a real holdout. Train a parameter-efficient adapter (LoRA/QLoRA) — you're learning a small ΔW on top of a frozen base, which is what makes swap/rollback cheap. Eval against the harness — domain tasks, regression suite, behavioural probes. Ship with a canary or shadow before 100%. Monitor drift, quality and cost in production — and crucially the arrow goes back to curate. Without that return arrow you've built a deploy, not a system. The model is replaceable; the loop is the asset.
Suite versioned in the repo. Every PR runs it. No green, no merge.
Task accuracy · regression suite (frozen) · behavioural probes (safety, refusal, format).
A pretty leaderboard nobody enforces. If it doesn't block a release, it's vibes.
Reproducible · versioned · gated. A red harness is a stop‑ship signal — and that's the point.
The eval harness is what turns "we trained a model" into something defensible. Numbers shown are illustrative — typical lift you see when you actually fine-tune on the right domain data: support classification ~58 → 91; ticket triage ~54 → 88; clinical NER ~49 → 84; code review ~61 → 79; contract Q&A ~47 → 82. Three layers: task accuracy on the domain tasks you actually care about; a regression suite (frozen forever — catches "we improved X but broke Y"); and behavioural probes for safety, refusal patterns, output format. Wire it into CI: PR opens, harness runs, gate at 70% (or whatever your domain demands) — below the line, the merge button stays disabled. A pretty leaderboard nobody enforces is just vibes.
Distribution shift in inputs · new product surfaces · new labels · seasonality. The model didn't change — the world did.
By cadence (e.g. every 6 weeks) and by signal — a drop in eval score, a rise in disagreement, or a labelled feedback delta.
Not a model — a repo that produces a fresh model on demand, with the harness as its referee.
Why "a one-off checkpoint that drifts the week after it ships" is the failure mode worth naming. The faint dashed line is the one-off — same model in production for 24 weeks; accuracy decays from 90% at launch to ~66% by week 24 — and crucially, it crosses the regression gate around week 14. From that point on, production is silently below your stop-ship threshold. The accent line is the loop: scheduled retrains at week 6, 12, 18 (vertical accent markers, "v2/v3/v4" releases) hold the model around 87%. Drift is real — distribution shift, new surfaces, new labels, seasonality. Trigger retrain by cadence and by signal (eval drop, disagreement rate, labelled feedback delta). The deliverable isn't a model — it's a repo that produces a fresh model on demand, with the harness as its referee. That's what we mean by fine-tuned models, engineered.