Fine-tuned models · the engineering view

Fine-tuned models, engineered to keep working.

Domain‑tuned models on your data — three pieces, not one. An adapter that learns the domain cheaply, an evaluation harness that gates every release, and a retraining loop that catches drift before users do.

LoRA / QLoRA adapters held-out eval suite regression gates scheduled retraining

Not a one‑off checkpoint that drifts the week after it ships.

Jentrix · Fine-tuned models — the engineering view 01 / 04

Open with the thesis: a fine-tuned model isn't a checkpoint, it's a small engineered system. Three pieces — adapter (LoRA/QLoRA, parameter-efficient deltas on the base), evaluation harness (a held-out suite plus CI gates that block regressions), and retraining loop (scheduled cadence triggered by drift signals). Without all three you ship a one-off that decays the moment the world moves on. Audience is AI pros — frame this as the difference between "we trained a model" and "we operate a model."

02 — The system around the model

The model is replaceable. The loop around it is the asset.

CURATE → TRAIN → EVAL → SHIP → MONITOR · a continuous, closed loop

The model is the centre. The structure around it — curate · train · eval · ship · monitor — is what stays.

① Adapter, not full re-train

LoRA / QLoRA fits a small ΔW on top of a frozen base. Cheap, fast, easy to swap or roll back.

② Harness gates every release

A held-out eval suite + regression gates wired into CI. No green, no ship.

③ Retrain on a schedule

Drift signals trigger; cadence is set. The model gets refreshed — not heroically rescued.

Jentrix · Fine-tuned models— the loop 02 / 04

The pipeline that makes a fine-tune durable. Curate labelled, versioned data with a real holdout. Train a parameter-efficient adapter (LoRA/QLoRA) — you're learning a small ΔW on top of a frozen base, which is what makes swap/rollback cheap. Eval against the harness — domain tasks, regression suite, behavioural probes. Ship with a canary or shadow before 100%. Monitor drift, quality and cost in production — and crucially the arrow goes back to curate. Without that return arrow you've built a deploy, not a system. The model is replaceable; the loop is the asset.

03 — The evaluation harness

The harness decides what ships.

DOMAIN ACCURACY · base vs. fine-tuned, with the regression gate on top

Same prompt, same holdout, every release. The harness isn't an A/B post-mortem — it's the gate before ship.

Built it like CI

Suite versioned in the repo. Every PR runs it. No green, no merge.

Three layers

Task accuracy · regression suite (frozen) · behavioural probes (safety, refusal, format).

The trap

A pretty leaderboard nobody enforces. If it doesn't block a release, it's vibes.

What good looks like

Reproducible · versioned · gated. A red harness is a stop‑ship signal — and that's the point.

Jentrix · Fine-tuned models— the harness 03 / 04

The eval harness is what turns "we trained a model" into something defensible. Numbers shown are illustrative — typical lift you see when you actually fine-tune on the right domain data: support classification ~58 → 91; ticket triage ~54 → 88; clinical NER ~49 → 84; code review ~61 → 79; contract Q&A ~47 → 82. Three layers: task accuracy on the domain tasks you actually care about; a regression suite (frozen forever — catches "we improved X but broke Y"); and behavioural probes for safety, refusal patterns, output format. Wire it into CI: PR opens, harness runs, gate at 70% (or whatever your domain demands) — below the line, the merge button stays disabled. A pretty leaderboard nobody enforces is just vibes.

04 — Drift & the retraining cadence

A one-off checkpoint decays. A loop holds the line.

PRODUCTION ACCURACY vs. RETRAINING CADENCE · 24 weeks post-launch

Real systems don't plateau — usage changes, schemas change, language changes. Retrain on cadence, or watch quality fall through the gate.

What drift looks like

Distribution shift in inputs · new product surfaces · new labels · seasonality. The model didn't change — the world did.

Trigger the retrain

By cadence (e.g. every 6 weeks) and by signal — a drop in eval score, a rise in disagreement, or a labelled feedback delta.

The deliverable

Not a model — a repo that produces a fresh model on demand, with the harness as its referee.

Start with one process → Book a working session with Jentrix

Jentrix · Fine-tuned models— drift & cadence 04 / 04

Why "a one-off checkpoint that drifts the week after it ships" is the failure mode worth naming. The faint dashed line is the one-off — same model in production for 24 weeks; accuracy decays from 90% at launch to ~66% by week 24 — and crucially, it crosses the regression gate around week 14. From that point on, production is silently below your stop-ship threshold. The accent line is the loop: scheduled retrains at week 6, 12, 18 (vertical accent markers, "v2/v3/v4" releases) hold the model around 87%. Drift is real — distribution shift, new surfaces, new labels, seasonality. Trigger retrain by cadence and by signal (eval drop, disagreement rate, labelled feedback delta). The deliverable isn't a model — it's a repo that produces a fresh model on demand, with the harness as its referee. That's what we mean by fine-tuned models, engineered.