Jack Pan

Phase 2: don't retrain on every export

· 3 min read

You’ve harvested ground truth from a Label Studio export and have a fresh training slice. The natural next thought is “retrain now”. Don’t — not on every export. Here’s why, and what to do instead.

The temptation

The pipeline runs episodes through Phase 1, labelers review them in Label Studio, you export. Each export is a “win” — more corrected frames than yesterday. Why wouldn’t you retrain?

Because retraining has three costs that aren’t visible from the export side:

  1. Fine-tunes have variance. A single fine-tune run on N samples ≠ the limit of “perfect training on N samples”. Two consecutive fine-tunes on the same data, different seeds, will differ in F1 by 1-2 points. Single-batch fine-tunes show up as noise on the eval curve.
  2. Each fine-tune invalidates downstream caches. Phase-1 outputs (predictions JSON, review frames, LS task JSONs) were generated by model v_N. A new model means re-running Phase 1 on whatever episodes you want to re-pre-annotate, plus the review frames they produce. That’s hours per episode.
  3. GPU time. A meaningful fine-tune on a hand-keypoint model + a small YOLO + a segmenter is ~2 hours of GPU per cycle. Triggering this for every 10-episode batch of corrections is roughly $50-100/cycle for a result that probably doesn’t move the eval needle.

So: don’t retrain after every export. The question is when.

What “enough corrections to matter” looks like

A reasonable trigger is “the cumulative corrected episodes since the last training cross some fraction of the training corpus”. Concretely:

if (new_corrected_episodes >= 0.10 * cumulative_training_corpus):
    trigger_retrain()

Why 10%:

  • Below ~5% new data, fine-tunes are dominated by noise. Improvement signal is washed out by run-to-run variance.
  • Above ~20%, you’ve left improvements on the table — predictions on the new episodes were generated by an older model and were less helpful to labelers than they could’ve been.

10% is the rough middle. It’s a starting point, not a tuned hyperparameter.

What else should gate a retrain

Two more conditions worth checking before the GPU bill arrives:

  1. The new corrections have to span tasks. If 80% of the new batch is from the same task_subdir, you’re going to overfit to that task’s distribution. Hold the retrain until corrections are spread.
  2. The eval set hasn’t drifted. If you’ve updated the eval set since the last retrain (added new clean-from-scratch frames), the previous model’s eval numbers aren’t directly comparable. Decide whether you’re measuring “model improvement” or “new eval distribution”; don’t mix.

What I’d build first

A should_retrain.py check that runs after every aggregate step:

def should_retrain(state) -> tuple[bool, str]:
    delta = state.corrected_episodes_since_last_train
    corpus = state.cumulative_training_corpus
    if delta < 0.10 * corpus:
        return False, f"need 10% delta, have {delta/corpus:.1%}"
    if state.task_coverage_gini > 0.5:
        return False, "corrections concentrated in too few tasks"
    if state.eval_set_changed_since_last_train:
        return False, "eval set drifted, pin it before retraining"
    return True, "ready"

Just stdout. Don’t auto-trigger; print the decision and let the human kick off the fine-tune. Auto-triggering on quantitative thresholds is a recipe for late-night GPU bills.

The orthogonal case: prediction quality has actually degraded

If labelers report that pre-annotations are worse this cycle than last, ignore the cadence rule and retrain immediately. That’s a regression signal, not an improvement opportunity, and the cadence math doesn’t apply. Track this separately — per-release corrections-per-episode — and treat sustained regressions as drop-everything events.

Recap

  • Don’t retrain on every export. Variance dominates; caches invalidate; GPU bills add up.
  • A 10%-of-corpus delta is a reasonable trigger. Tune later.
  • Add gates: corrections spread across tasks, eval set pinned since last train.
  • Print the decision; let a human pull the trigger.
  • If pre-annotation quality regresses, ignore cadence and retrain.