WIP: eval: per-episode metric sort UI on gallery index by eugenevinitsky · Pull Request #461 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-30T19:51:16Z

Adds a per-episode metric sort dropdown to the triage_html / obs_html gallery index. Pick offroad_rate, collision_rate, dnf_rate, etc., and tiles reorder client-side. No re-render needed.

Smoke-tested against model_puffer_drive_000382 from a single-agent run — 16 scenarios rendered, index.html populated with the sort dropdown.

Branch carries some unrelated yaml + test-fix commits that could be factored out for a cleaner diff.

…yaml smoke test - drive.py: if map_dir points at a .bin file, use it directly as the only map; otherwise scan the directory as before. Lets the single-agent launcher pin to one CARLA bin (Town10HD) without a separate starting_map knob and without breaking multi_scenario's render-time random injection. - drive.ini: declare enabled = true in [eval.validation_replay] so pufferl's argparser exposes --eval.validation-replay.enabled (otherwise the launcher yaml's enabled: 0 override died as an unrecognized argument). - single_agent_speed_run.yaml: drop env.starting_map; point env.map_dir straight at opendrive__Town10HD.bin. - tests/test_single_agent_yaml.py: parse the launcher yaml into pufferl's CLI flags and assert load_config accepts every key and that the values land in the expected nested slots. Catches future regressions where the yaml references a flag that isn't wired into drive.ini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously fell into FileNotFoundError via map_dir=None and silently skipped in CI even when the carla binaries were present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eight-way 1080p ffmpeg encode + EGL render burst at the validation_gigaflow eval epoch was aborting long single-agent runs with SIGABRT under VRAM/RAM pressure. The single-agent yaml already disables every other eval evaluator, so disabling this one too removes the eval signal entirely but keeps the run alive past epoch 250. Also declares enabled in [eval.validation_gigaflow] so the argparse flag exists (same pattern as the earlier validation_replay fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings in #443 (Improve html viz). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-map (Town10HD) gigaflow eval with the triage_html render backend, for inspecting which rollouts DNF when running an existing checkpoint via `puffer eval puffer_drive --evaluator dnf_triage --load-model-path <ckpt>`. Disabled by default so it does not run on every training cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds render_backend and env.num_agents to [eval.validation_gigaflow] so the launcher yaml can flip render backend / agent counts without per-run drive.ini edits. The single_agent_speed_run.yaml leaves validation_gigaflow enabled and reconfigures it inline: - render_backend: triage_html (CPU-only HTML, no EGL/ffmpeg spike that aborted older runs at eval epochs) - env.map_dir: opendrive__Town10HD.bin (matches training) - env.num_maps: 1, env.num_agents: 64, single-agent-per-env Each eval cycle drops a gallery of single-agent rollouts you can browse to see which DNF'd. No standalone evaluator section needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…riage_html eval enabled) # Conflicts: # scripts/cluster_configs/single_agent_speed_run.yaml

In gigaflow mode the C side fills info['scenario_id'] with the map's short name, which is identical across every rollout on the same map. The previous stem `{map_name}_{scenario_id}{step_suffix}` collided for every episode, so each render_compact_replay_html call overwrote the prior file and only the last rollout survived on disk. Append the monotonic scenarios_done counter so each episode lands in its own .html file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_render_pass_obs ends with viz.build_gallery_index(out_dir) so the gallery's index.html lands next to the per-episode files. _render_pass_html had the loop and tqdm wiring but no equivalent build call, so the triage_html render dir just held the raw N htmls with no entry point. Mirror the obs_html call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dnf_rate already exists in the aggregate Log struct (drive.h:2577 increments it whenever an agent finishes with no infractions and zero waypoints reached) but was missing from the per-episode CompletedEpisodeSummary that backs the gallery's info dict and the validation_gigaflow per-episode CSV. Without this, the obs_html / triage_html gallery sort UI has no way to access the DNF signal except by re-deriving it Python-side from the other fields, which silently drifts the moment anyone touches the C-side predicate. Single source of truth: snapshot the same predicate inside the per-episode loop alongside score, then emit through my_completed_episode_to_dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

build_gallery_index now accepts an optional `file_metrics` dict (basename → metric dict) and emits a sort dropdown + a direction toggle when provided. Backward-compatible: callers passing nothing keep the previous behaviour. The two render passes (_render_pass_obs and _render_pass_html) build that dict by pairing rendered files with rows from the per-episode CSV the metric pass already writes: i-th rendered file ↔ i-th first-episode row in the CSV sorted by env_slot. Heuristic but cheap; on any failure we silently fall back to the no-metrics gallery (collect helper returns None). Default sort: `score` ascending so failure-mode rollouts bubble to the top. The dropdown also exposes dnf_rate (desc), episode_return (asc), num_goals_reached (asc), collision_rate (desc), offroad_rate (desc), red_light_violation_rate (desc), total_infractions (desc), total_distance_travelled (asc), episode_length (asc). Flipping the key auto-flips the direction to whatever makes triage-useful rollouts surface first; the user can override via the direction toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The triage_html render path called viz.build_gallery_index without importing viz in that scope -- NameError at the very end of a successful rollout. viz was already imported locally inside _render_pass_obs at line 605; mirror the same lazy local import here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds gallery sorting support for eval HTML renders by plumbing per-episode metrics into generated gallery indexes and adding dnf_rate to completed episode summaries.

Changes:

Adds client-side metric sort controls to viz.build_gallery_index.
Collects per-file metric mappings from episode CSVs for triage_html and obs_html render outputs.
Extends Drive completed episode summaries with dnf_rate and updates eval configs for HTML triage workflows.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`scripts/cluster_configs/single_agent_speed_run.yaml`	Reconfigures validation_gigaflow eval rendering for single-agent Town10HD runs.
`pufferlib/viz.py`	Adds metric-aware labels and sort dropdown JavaScript to generated gallery indexes.
`pufferlib/ocean/drive/drive.h`	Adds and populates per-episode `dnf_rate` in completed episode summaries.
`pufferlib/ocean/drive/binding.c`	Exposes `dnf_rate` in completed episode summary dictionaries.
`pufferlib/ocean/benchmark/evaluators/base.py`	Wires gallery metric collection into `triage_html` and `obs_html` render passes.
`pufferlib/config/ocean/drive.ini`	Updates validation_gigaflow render defaults and agent count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                / f"{self.name}{step_suffix}.csv"
+            )
+            if not csv_path.exists():


+                    # gigaflow mode (the C side fills it with the map's short
+                    # name), so append a monotonic counter to make every
+                    # rendered episode land in its own file.
+                    stem = f"{map_name}_{scenario_id}_{scenarios_done:04d}{step_suffix}"


+# Disable nuPlan-based evaluators. validation_gigaflow stays on but is
+# reconfigured to triage_html (CPU-only scene playback + per-episode metrics
+# including dnf_rate), pinned to the same Town10HD single-agent setup the
+# training uses, so each eval cycle drops a gallery of rollouts you can browse
+# for DNFs. The eight-way ffmpeg/EGL spike of the default egl backend is gone.


The previous regex `(.+)_(\d+)\.html` required the filename to end in `_<digits>.html`. The triage_html stem ends in `_step{N}.html` (so the last underscore-delimited token is `step0` etc, not bare digits) and re.fullmatch rejected every file, leaving the gallery empty with "No matching .html files found in this directory." Filter on `.html` suffix instead and sort the filenames lexicographically -- the zero-padded scenarios_done embedded in the stem dominates ordering within a map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…agent [eval.validation_defaults] and [eval.validation_gigaflow] both had env.num_agents = 512, which silently halved the agent population the validation eval rolls out compared to the top-level [env] num_agents = 1024. Pin both to 1024 so the eval matches default training. [eval.dnf_triage] was 40 agents/env, but the section is meant for triaging single-agent rollouts (the trained policies we care about right now are single-agent gigaflow). Drop to min/max_agents_per_env = 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…un's metrics rollout() was calling _maybe_export_episodes after _render_pass, so the gallery's FILE_METRICS came from the previous run's CSV (or was empty on a first-ever run to a fresh output dir). Moving the CSV write between the metric and render passes -- _episode_rows is already fully populated by _run_rollout_loop, and _render_pass / _collect_file_metrics_from_csv now sees the just-written CSV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In pool / both modes the boundary slots are now drawn in red instead of the shared teal that lanes use, so road edges that the policy pooled stand out from lane centerlines at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When num_scenarios > the env-slot count, each env runs multiple episodes back-to-back. The renderer writes one file per completed scenario in (episode_index, env_slot) order; the prior pairing filtered to episode_index==0 only, so anything past the first batch had no metric in the gallery FILE_METRICS and didn't participate in the sort. Sort the CSV by (episode_index, env_slot) to match render order and pair positionally over every rendered file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… CSV The render and metric passes use different vec env instances (and different seeds in obs_html's num_envs=1 vs the metric pass's full population), so pairing rendered HTMLs against the metric-pass CSV row-by-row attached metrics from a different simulation. Sorting the gallery by offroad_rate then surfaced tiles whose actual rollout had no offroad in it. Read each rendered scenario's metrics directly from the completed_episode summary in info: triage_html's loop already iterates infos for the bundle; obs_html now enables emit_completed_episodes on its render env and snapshots the latest summary per batch. _collect_file_metrics_from_csv is dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eugene Vinitsky and others added 17 commits May 28, 2026 10:46

test_single_agent_yaml: ruff format

676a24f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_single_agent_yaml: drop history-of-change prose from docstrings

e362d18

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_single_agent_yaml: drop value-spot-checks; keep argparse smoke

4f02f15

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drive: guard isfile() so map_dir=None falls through to listdir path

2c2271a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_drive_scenario_length: pass real map_dir so the test actually runs

62de125

Previously fell into FileNotFoundError via map_dir=None and silently skipped in CI even when the carla binaries were present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge origin/emerge/temp_training into ev/single_agent_runs

1f021ba

Brings in #443 (Improve html viz). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge emerge/temp_training into ev/dnf-triage (resolve yaml to keep t…

170cbf3

…riage_html eval enabled) # Conflicts: # scripts/cluster_configs/single_agent_speed_run.yaml

yaml: switch single-agent eval render to obs_html

984ab64

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eval: fix gallery-sort helper using Path.name instead of unimported os

34af7c4

Copilot AI review requested due to automatic review settings May 30, 2026 19:51

Copilot started reviewing on behalf of eugenevinitsky May 30, 2026 19:51 View session

Eugene Vinitsky and others added 2 commits May 30, 2026 15:53

eval: hoist viz import to module-level

0bd3ac7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI reviewed May 30, 2026

View reviewed changes

eugenevinitsky changed the title ~~eval: per-episode metric sort UI on the gallery index + triage_html plumbing~~ WIP: eval: per-episode metric sort UI on gallery index May 30, 2026

vcharraut approved these changes May 30, 2026

View reviewed changes

Eugene Vinitsky and others added 5 commits May 30, 2026 16:14

viz: ruff-format build_gallery_index

4a68d1b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eugenevinitsky merged commit 474e28e into emerge/temp_training May 30, 2026
13 checks passed

eugenevinitsky deleted the ev/dnf-triage branch May 30, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: eval: per-episode metric sort UI on gallery index#461

WIP: eval: per-episode metric sort UI on gallery index#461
eugenevinitsky merged 26 commits into
emerge/temp_trainingfrom
ev/dnf-triage

eugenevinitsky commented May 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eugenevinitsky commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eugenevinitsky commented May 30, 2026 •

edited

Loading