WIP: nightly_best_launch — multi-agent nightly derived from oignons2 by eugenevinitsky · Pull Request #467 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-06-01T22:33:22Z

Summary

Adds scripts/cluster_configs/nightly_best_launch.yaml and scripts/launch_nightly_best.sh — a multi-agent nightly mirroring the single_agent_speed_run.yaml / launch_single_agent.sh pair but with the oignons2 policy + reward + perturbation config (1600 agents across 8 CARLA towns, 3×1024 split network with gigaflow encoder, reward conditioning + randomization + partner blindness + phantom braking on, route-based [20, 60m] goals).
Default 3 seeds, date-stamped wandb run names (<date>_seed{0,1,2}), wandb project nightly-multi-agent, --mem 192gb, 30h wall.

What was derived from oignons2

env: max_agents_per_env=150, num_agents=1600, dynamics_model=jerk, target_type=dijkstra, scenario_length=3840, resample_frequency=38400, full perturbation + reward conditioning/randomization, oignons2 obs ranges (200m partner / 200m road_front), oignons2 obs dropouts.
policy: input_size=256, backbone=3×1024, actor=critic=1024, split_network=true, encoder_gigaflow=true.
train: total_timesteps=125B, minibatch=153600, update_epochs=3, bptt=128, compile=true, precision=bfloat16, normalize_rewards=false, checkpoint_interval=500.
adapted: map_dir → local pufferlib/resources/drive/binaries/carla, num_maps=8 (vs. oignons2's 16 — we ship 8 CARLA towns).

Evals

validation_gigaflow inline (egl render), interval 250 to keep eval ~5% of wall-clock instead of ~85%.
validation_replay disabled (nuPlan bins not on cluster path).
All behaviors_* disabled.

Test plan

Verify 3 seeds (9997175, 9997176, 9997177) train past first eval cycle on h200_tandon without OOM/crash
Confirm wandb runs land in nightly-multi-agent project with run names 2026-06-01_seed{0,1,2}
Verify policy architecture in saved config.yaml matches oignons2 (3×1024 split network)

🤖 Generated with Claude Code

Adds scripts/cluster_configs/nightly_best_launch.yaml and scripts/launch_nightly_best.sh — a multi-agent nightly mirroring the single_agent_speed_run/launch_single_agent pair but with the oignons2 (emerge/temp_training/weights/oignons2/config.yaml) policy + reward + perturbation config: 1600 agents across 8 CARLA towns, 3x1024 split network with gigaflow encoder, reward conditioning + randomization + partner blindness + phantom braking on, route-based [20, 60m] goals. Defaults wandb_project=nightly-multi-agent, 3 seeds, --mem 192gb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…jkstra) Drive.__init__ on this branch only accepts target_type in {static,dynamic}; oignons2's "dijkstra" raised at first env construction (3 seeds died at 30s). Goals still follow the route via goal_on_lane=True (default), so dropping to "static" preserves the chained-route semantics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Sub-envs keep their initial map for the full run instead of cycling every 38400 steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ength Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Direct `puffer train puffer_drive` equivalent of nightly_best_launch.yaml for running without SLURM/submit_cluster. Booleans are passed as Python literals (True/False) since pufferl parses values via ast.literal_eval and the C binding rejects lowercase yaml-style true/false. NUM_AGENTS is parameterized because batch_size is auto (num_agents * bptt_horizon), so the on-GPU obs buffer scales with agent count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The headless EGL render path is compile-gated on <EGL/egl.h> (DRIVE_HAS_EGL in drive.h) and that header isn't installed here, so the binding was built without it and render_backend=egl would fall back to Xvfb/software. Set validation_gigaflow.render=False so the gigaflow eval still runs and logs metrics while skipping the render pass (base.py: _render_pass is gated on self.render). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- First positional arg is now NUM_GPUS: >1 launches DDP via `torchrun --standalone --nproc-per-node N -m pufferlib.pufferl train`, else single-GPU `puffer train` on device 0. Under DDP num_agents is per-rank, so `4 2048` = 2048 agents/GPU = 8192 effective. - Flags moved into a bash array shared by both launch paths. - cd to repo root via $0 so relative config paths (env.map_dir) resolve regardless of invocation cwd. - Enable wandb (project nightly-multi-agent, group Nightly_MultiAgent) with a dated RUN_TAG, matching the cluster nightly config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pufferl divides train.total_timesteps by world size when LOCAL_RANK is set, so a fixed 10B total became 2.5B/rank on 4 GPUs. Scale the total by NUM_GPUS (PER_RANK_TIMESTEPS * NUM_GPUS) so each rank always targets 10B: 4 GPUs -> 40B total -> 10B/rank (40B aggregate env-steps), 1 GPU -> 10B (no division). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

obs_dropout structurally shrinks the road-obs width via compute_effective_road_obs_count (kept = n * (1 - dropout)), so a policy trained with dropout (e.g. lane 40 / boundary 48) cannot slice the clean-eval obs (dropout forced to 0 by CLEAN_EVAL_OVERRIDES -> lane 80 / boundary 80). The misaligned slide_idx walk read garbage into traffic_control_type/state, and F.one_hot's scatter asserted "index out of bounds" (device-side assert), killing eval at epoch 250. Branch the encoder slicing on self.training: use *_kept while training, full *_n during eval (policy.eval() at benchmark base.py). The per-slot encoders + max-pool are slot-count-agnostic, so the same weights consume both widths. Net diff of PR #453 (restore obs_slots_boundary_kept for obs dim), torch.py only; applied directly to skip the branch's unrelated merge/render commits. Source commits: e7c5031, c613020, e4e8d02, a62f186. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

eugenevinitsky force-pushed the nightly_best_launch branch from 0693de9 to b232b7e Compare June 1, 2026 22:36

Eugene Vinitsky and others added 8 commits June 1, 2026 19:49

nightly_best_launch: disable resample_frequency

129966a

Sub-envs keep their initial map for the full run instead of cycling every 38400 steps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

nightly_best_launch: bump num_agents to 720k, shorter dt + scenario_l…

74c97fa

…ength Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: nightly_best_launch — multi-agent nightly derived from oignons2#467

WIP: nightly_best_launch — multi-agent nightly derived from oignons2#467
eugenevinitsky wants to merge 9 commits into
emerge/temp_trainingfrom
nightly_best_launch

eugenevinitsky commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eugenevinitsky commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was derived from oignons2

Evals

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eugenevinitsky commented Jun 1, 2026 •

edited

Loading