Fine-tuned models · the engineering view

Fine-tuned models, engineered to keep working.

Domain‑tuned models on your data — three pieces, not one. An adapter that learns the domain cheaply, an evaluation harness that gates every release, and a retraining loop that catches drift before users do.

LoRA / QLoRA adapters held-out eval suite regression gates scheduled retraining
Not a one‑off checkpoint that drifts the week after it ships.
Jentrix · Fine-tuned models — the engineering view 01 / 04

Open with the thesis: a fine-tuned model isn't a checkpoint, it's a small engineered system. Three pieces — adapter (LoRA/QLoRA, parameter-efficient deltas on the base), evaluation harness (a held-out suite plus CI gates that block regressions), and retraining loop (scheduled cadence triggered by drift signals). Without all three you ship a one-off that decays the moment the world moves on. Audience is AI pros — frame this as the difference between "we trained a model" and "we operate a model."

02 — The system around the model

The model is replaceable. The loop around it is the asset.

CURATE → TRAIN → EVAL → SHIP → MONITOR  ·  a continuous, closed loop
① CURATE labels · versions · holdout ② TRAIN ADAPTER LoRA · QLoRA · small ΔW ③ EVAL harness · gates · CI ④ SHIP canary · shadow · 100% ⑤ MONITOR drift · quality · cost THE LOOP an asset, not an event
The model is the centre. The structure around it — curate · train · eval · ship · monitor — is what stays.
 Adapter, not full re-train

LoRA / QLoRA fits a small ΔW on top of a frozen base. Cheap, fast, easy to swap or roll back.

 Harness gates every release

A held-out eval suite + regression gates wired into CI. No green, no ship.

 Retrain on a schedule

Drift signals trigger; cadence is set. The model gets refreshed — not heroically rescued.

Jentrix · Fine-tuned models— the loop 02 / 04

The pipeline that makes a fine-tune durable. Curate labelled, versioned data with a real holdout. Train a parameter-efficient adapter (LoRA/QLoRA) — you're learning a small ΔW on top of a frozen base, which is what makes swap/rollback cheap. Eval against the harness — domain tasks, regression suite, behavioural probes. Ship with a canary or shadow before 100%. Monitor drift, quality and cost in production — and crucially the arrow goes back to curate. Without that return arrow you've built a deploy, not a system. The model is replaceable; the loop is the asset.

03 — The evaluation harness

The harness decides what ships.

DOMAIN ACCURACY  ·  base vs. fine-tuned, with the regression gate on top
1007550250 REGRESSION GATE — 70% 5891 5488 4984 6179 4782 support classify ticket triage clinical NER code review contract Q&A base model · zero-shot fine-tuned · LoRA adapter regression gate · 70% held-out suite · v17 · n=2,400
Same prompt, same holdout, every release. The harness isn't an A/B post-mortem — it's the gate before ship.
Built it like CI

Suite versioned in the repo. Every PR runs it. No green, no merge.

Three layers

Task accuracy · regression suite (frozen) · behavioural probes (safety, refusal, format).

The trap

A pretty leaderboard nobody enforces. If it doesn't block a release, it's vibes.

What good looks like

Reproducible · versioned · gated. A red harness is a stop‑ship signal — and that's the point.

Jentrix · Fine-tuned models— the harness 03 / 04

The eval harness is what turns "we trained a model" into something defensible. Numbers shown are illustrative — typical lift you see when you actually fine-tune on the right domain data: support classification ~58 → 91; ticket triage ~54 → 88; clinical NER ~49 → 84; code review ~61 → 79; contract Q&A ~47 → 82. Three layers: task accuracy on the domain tasks you actually care about; a regression suite (frozen forever — catches "we improved X but broke Y"); and behavioural probes for safety, refusal patterns, output format. Wire it into CI: PR opens, harness runs, gate at 70% (or whatever your domain demands) — below the line, the merge button stays disabled. A pretty leaderboard nobody enforces is just vibes.

04 — Drift & the retraining cadence

A one-off checkpoint decays. A loop holds the line.

PRODUCTION ACCURACY  vs.  RETRAINING CADENCE  ·  24 weeks post-launch
95%85%75%65%55% week 0 4 8 12 16 20 24 WEEKS IN PRODUCTION → regression gate · stop-ship below this retrain · v2 retrain · v3 retrain · v4 one-off · ≈ 66% below the gate by week 14 loop · ≈ 87% cadence: every 6 weeks launch · 90%
Real systems don't plateau — usage changes, schemas change, language changes. Retrain on cadence, or watch quality fall through the gate.
What drift looks like

Distribution shift in inputs · new product surfaces · new labels · seasonality. The model didn't change — the world did.

Trigger the retrain

By cadence (e.g. every 6 weeks) and by signal — a drop in eval score, a rise in disagreement, or a labelled feedback delta.

The deliverable

Not a model — a repo that produces a fresh model on demand, with the harness as its referee.

Start with one process → Book a working session with Jentrix
Jentrix · Fine-tuned models— drift & cadence 04 / 04

Why "a one-off checkpoint that drifts the week after it ships" is the failure mode worth naming. The faint dashed line is the one-off — same model in production for 24 weeks; accuracy decays from 90% at launch to ~66% by week 24 — and crucially, it crosses the regression gate around week 14. From that point on, production is silently below your stop-ship threshold. The accent line is the loop: scheduled retrains at week 6, 12, 18 (vertical accent markers, "v2/v3/v4" releases) hold the model around 87%. Drift is real — distribution shift, new surfaces, new labels, seasonality. Trigger retrain by cadence and by signal (eval drop, disagreement rate, labelled feedback delta). The deliverable isn't a model — it's a repo that produces a fresh model on demand, with the harness as its referee. That's what we mean by fine-tuned models, engineered.

← / → move · N notes · F fullscreen
Speaker notesslide 1 / 4