Jack Pan

#data pipeline

7 posts

Phase 1: Canonical filenames buy zero-config batching

A short naming convention like `NN_NNN_ego.mp4` encodes enough metadata to run the entire batch with one CLI flag. Why this is cheap to add, and what it forces you to *not* build.

data pipelinecli

Phase 1: Don't use Label Studio Source Storage for local files

A short warning. LS's "Cloud Storage → Source Storage" feature looks like exactly what you want for local data. Use it and you get tens of thousands of phantom tasks that collide with the ones you actually imported.

label studiodata pipeline

Phase 1: Four signals for review-frame sampling

A uniform "every Nth frame" sampler wastes labeler time on near-identical frames the model already nailed. Four signals do better — segment boundaries, per-segment uniform, low confidence, bbox jumps.

computer visiondata pipelinelabel studio

Phase 1: One module owns every path on disk

Why a video pre-annotation pipeline ends up with one `layout` module that knows where everything lives, and what breaks when six different parts of the codebase each compute paths their own way.

data pipelinepythonarchitecture

Phase 1: Why one episode becomes three Label Studio projects

A deep dive on the multi-project pattern for video pre-annotation — what forces the split, how one episode fans out, and when not to fight Label Studio's data model.

computer visionlabel studiodata pipeline

Phase 1: Two fps knobs in a video pre-annotation pipeline

Inference frame rate and review-frame sampling look like one thing and aren't. What each knob actually buys, and what breaks if you treat them as the same.

computer visiondata pipelinemediapipe

Phase 1: Notes from building a video pre-annotation pipeline

A Phase-1 pipeline for embodied-robot video data — MediaPipe + YOLO inference, action segmentation, Label Studio import — plus the boring path-abstraction decision that kept it from collapsing.

computer visiondata pipelinelabel studio