Phase 1: Four signals for review-frame sampling
Deep dive on the review-frame sampling strategy from the Phase-1 overview. The pitch in the overview: don’t just uniform-sample at N fps — pick frames that are worth looking at. This post is the strategy.
What review-frame sampling is for
Project B is “human verifies model’s hand keypoints and operated-object bbox on N selected ego frames per episode”. The whole point of sampling is to spend labeler time on frames where humans add value, not where the model was already correct.
A uniform 1 fps downsample of a 2-minute clip is 120 frames. Project B’s default density is 30–50. Cutting frames in half is fine if you’re cutting the right half.
The four signals
For Project B (hand + object on ego), the default sampler combines four signals:
- Segment boundaries — every frame where the action segmenter starts or ends a segment. These are the highest-information frames per second: the model is most likely to fuzz the start/end of an action by a few frames, and a single correction propagates to the whole segment.
- Per-segment uniform — 2 frames from inside each segment, uniformly spaced. Covers the mid-action state. Catches drift in keypoint tracking during the segment.
- Low-confidence frames — any frame where MediaPipe’s hand-presence confidence drops below 0.7. Model uncertainty is a near-perfect proxy for “this is where you’ll find errors”.
- Bbox-area jumps — any frame where the operated-object bbox changes area by >50% from the previous frame. Usually means the detector flickered or jumped to a different object. Cheap to flag.
For a 2-minute clip the union of these lands at 30–50 frames per Project-B task. Project C (body pose on exo) is plain uniform 12 frames per clip — body pose drifts more smoothly and the four-signal payoff is smaller.
Why uniform-N-fps fails
The intuition that “1 fps = 120 frames, density looks reasonable” is wrong in two ways:
- Redundancy. Most of those 120 frames are mid-action, model-confident, near-identical to the frame before. Labelers click through with no corrections; you paid for 120 frames of UI to get 5 frames of actual signal.
- Missing rare events. A grasp that lasts 200ms gets sampled zero or one times at 1 fps. The model never gets corrected on it.
Uniform sampling treats every second of video as equally interesting. It isn’t.
Tunables
The strategy has knobs:
--review-hand-per-segment N— frames per segment (default 2). Tighten for longer or more variable actions.- Confidence threshold for the low-conf signal (default 0.7). Lower → fewer flagged frames → smaller review set, but you’ll miss borderline cases.
- Bbox-jump threshold (default 50%). Lower → more sensitive → more false positives from camera motion.
I’ve left these at defaults for a year. Tune only if the review-frame distribution starts looking pathological.
When to override with uniform fps
Two scenarios where --review-hand-fps N (forcing uniform sampling) beats the four-signal default:
- Spot-checking a new dataset. You don’t trust the segmenter or the detector yet. Uniform sampling gives you a known density to compare against.
- Smoke test for a new task vocabulary. Same reason. Uniform is predictable; the four-signal sampler’s output depends on model behavior.
For production batches, leave the default on. The point of the strategy isn’t optimality — it’s giving labelers frames that are worth looking at.
What I’d monitor
- Average frames per Project-B task. If it climbs from 35 to 80, something upstream is producing more low-confidence frames or more bbox jumps. Could be a model regression, could be a hard new task type, could be footage from a different camera.
- Labeler correction rate per signal. If “low-confidence” frames are corrected 90% of the time but “per-segment uniform” frames are corrected 5% of the time, the per-segment density is too high. Use that to tune.
Recap
- Uniform N-fps sampling spends labeler time on frames the model already nailed.
- The four signals (segment boundaries, per-segment uniform, low-confidence, bbox-jump) pick frames where humans actually correct things.
- Project B uses all four; Project C is plain uniform because body pose doesn’t have the same failure modes.
- Knobs exist for tuning but the defaults have held up.
- Monitor frame count distributions and per-signal correction rates if you want to keep the sampler honest.