Multi-agent experiment loop for vision model finetuning. Adapted from multiautoresearch - same disciplined single-change methodology, but for vision tasks with YAML configs as the experiment surface instead of editing training scripts directly.
Agents propose a hypothesis, change one config knob, run a finetune, and auto-promote when the metric beats the current master. Works locally on consumer GPUs or on HF Jobs.
| Task | Script | Default Model | Dataset | Metric |
|---|---|---|---|---|
| Classify | train_classify.py |
google/vit-base-patch16-224 |
food101 | accuracy |
| Detect | train_detect.py |
ustc-community/dfine-small-coco |
cppe-5 | mAP |
| Segment | train_segment.py |
facebook/sam2.1-hiera-small |
— | IoU |
All metrics are higher-is-better.
uv syncuv run scripts/refresh_master.py
# edit configs/base_classify.yaml — one knob change
uv run prepare.py --dataset food101 --task classify --split train
CUDA_VISIBLE_DEVICES=0 uv run scripts/run_local.py --task classify --config configs/base_classify.yaml
uv run scripts/submit_patch.py --comment "classify: lr 1e-4"uv run scripts/refresh_master.py
uv run scripts/hf_job.py launch --task classify --config configs/base_classify.yaml
uv run scripts/hf_job.py logs <JOB_ID> --follow --output /tmp/vision-run.log
uv run scripts/parse_metric.py /tmp/vision-run.log
uv run scripts/submit_patch.py --comment "classify: lr 1e-4"Send a single prompt — the agent handles refresh, config edit, run, parse, and submit:
CUDA_VISIBLE_DEVICES=0 opencode run "
Finetune google/vit-base-patch16-224 on food101 using the classify task.
Read AGENTS.md for repo conventions. Follow the standard workflow end-to-end.
"For parallel experiments on multi-GPU:
CUDA_VISIBLE_DEVICES=0 uv run scripts/opencode_worker.py run exp-01 &
CUDA_VISIBLE_DEVICES=1 uv run scripts/opencode_worker.py run exp-02 &Each worker runs in an isolated git worktree under .runtime/worktrees/.
- Config YAML is the experiment surface. Edit
configs/*.yaml, never training scripts. Configs are parsed natively viaHfArgumentParser.parse_yaml_file()— keys map 1:1 toTrainingArguments/ dataclass fields. - One hypothesis per run. Change exactly one config knob per experiment.
- Refresh before each experiment.
refresh_master.pyrestores configs to the current promoted master. - Auto-promotion.
submit_patch.pyappends toresearch/results.tsvand promotes the config if the metric beats master. research/live/master.jsonis the source of truth.
learning_rate, weight_decay, warmup_steps, lr_scheduler_type
per_device_train_batch_size, gradient_accumulation_steps
num_train_epochs, fp16
freeze_backbone, image_square_size (detect)
use_albumentations (detect), prompt_type (segment).
├── configs/
│ ├── base_classify.yaml
│ ├── base_detect.yaml
│ └── base_segment.yaml
├── train_classify.py # stable — do not edit
├── train_detect.py # stable — do not edit
├── train_segment.py # stable — do not edit
├── prepare.py # dataset validation CLI (vision_lab.dataset_validation)
├── scripts/
│ ├── refresh_master.py # restore config from promoted master
│ ├── run_local.py # local GPU execution
│ ├── hf_job.py # HF Jobs launcher
│ ├── parse_metric.py # extract metrics from logs
│ ├── submit_patch.py # record run + auto-promote
│ ├── opencode_worker.py # agent worktree isolation
│ ├── worker_common.py # shared worker utilities
│ ├── trackio_reporter.py # experiment monitoring
│ ├── dataset_inspector.py # dataset format validation
│ ├── estimate_cost.py # cost estimation
│ └── local_results.py # results ledger management
├── research/
│ ├── results.tsv # append-only run ledger
│ ├── notes.md # experiment notebook
│ ├── do-not-repeat.md # failed experiment guidance
│ ├── paper-ideas.md # literature-derived hypotheses
│ ├── live/ # promoted master + DAG
│ ├── reference/ # seed master snapshots
│ ├── campaigns/ # active campaign docs
│ ├── experiments/ # per-experiment docs
│ └── templates/ # campaign/experiment templates
├── AGENTS.md # agent roles + rules
├── program.md # benchmark entrypoint
└── pyproject.toml
Based on multiautoresearch by @burtenshaw.