WIP: eval: per-episode metric sort UI on gallery index#461
Merged
Conversation
…yaml smoke test - drive.py: if map_dir points at a .bin file, use it directly as the only map; otherwise scan the directory as before. Lets the single-agent launcher pin to one CARLA bin (Town10HD) without a separate starting_map knob and without breaking multi_scenario's render-time random injection. - drive.ini: declare enabled = true in [eval.validation_replay] so pufferl's argparser exposes --eval.validation-replay.enabled (otherwise the launcher yaml's enabled: 0 override died as an unrecognized argument). - single_agent_speed_run.yaml: drop env.starting_map; point env.map_dir straight at opendrive__Town10HD.bin. - tests/test_single_agent_yaml.py: parse the launcher yaml into pufferl's CLI flags and assert load_config accepts every key and that the values land in the expected nested slots. Catches future regressions where the yaml references a flag that isn't wired into drive.ini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously fell into FileNotFoundError via map_dir=None and silently skipped in CI even when the carla binaries were present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eight-way 1080p ffmpeg encode + EGL render burst at the validation_gigaflow eval epoch was aborting long single-agent runs with SIGABRT under VRAM/RAM pressure. The single-agent yaml already disables every other eval evaluator, so disabling this one too removes the eval signal entirely but keeps the run alive past epoch 250. Also declares enabled in [eval.validation_gigaflow] so the argparse flag exists (same pattern as the earlier validation_replay fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in #443 (Improve html viz). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-map (Town10HD) gigaflow eval with the triage_html render backend, for inspecting which rollouts DNF when running an existing checkpoint via `puffer eval puffer_drive --evaluator dnf_triage --load-model-path <ckpt>`. Disabled by default so it does not run on every training cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds render_backend and env.num_agents to [eval.validation_gigaflow] so the
launcher yaml can flip render backend / agent counts without per-run drive.ini
edits. The single_agent_speed_run.yaml leaves validation_gigaflow enabled and
reconfigures it inline:
- render_backend: triage_html (CPU-only HTML, no EGL/ffmpeg spike that
aborted older runs at eval epochs)
- env.map_dir: opendrive__Town10HD.bin (matches training)
- env.num_maps: 1, env.num_agents: 64, single-agent-per-env
Each eval cycle drops a gallery of single-agent rollouts you can browse to
see which DNF'd. No standalone evaluator section needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riage_html eval enabled) # Conflicts: # scripts/cluster_configs/single_agent_speed_run.yaml
In gigaflow mode the C side fills info['scenario_id'] with the map's short
name, which is identical across every rollout on the same map. The previous
stem `{map_name}_{scenario_id}{step_suffix}` collided for every episode, so
each render_compact_replay_html call overwrote the prior file and only the
last rollout survived on disk.
Append the monotonic scenarios_done counter so each episode lands in its
own .html file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_render_pass_obs ends with viz.build_gallery_index(out_dir) so the gallery's index.html lands next to the per-episode files. _render_pass_html had the loop and tqdm wiring but no equivalent build call, so the triage_html render dir just held the raw N htmls with no entry point. Mirror the obs_html call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dnf_rate already exists in the aggregate Log struct (drive.h:2577 increments it whenever an agent finishes with no infractions and zero waypoints reached) but was missing from the per-episode CompletedEpisodeSummary that backs the gallery's info dict and the validation_gigaflow per-episode CSV. Without this, the obs_html / triage_html gallery sort UI has no way to access the DNF signal except by re-deriving it Python-side from the other fields, which silently drifts the moment anyone touches the C-side predicate. Single source of truth: snapshot the same predicate inside the per-episode loop alongside score, then emit through my_completed_episode_to_dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_gallery_index now accepts an optional `file_metrics` dict (basename → metric dict) and emits a sort dropdown + a direction toggle when provided. Backward-compatible: callers passing nothing keep the previous behaviour. The two render passes (_render_pass_obs and _render_pass_html) build that dict by pairing rendered files with rows from the per-episode CSV the metric pass already writes: i-th rendered file ↔ i-th first-episode row in the CSV sorted by env_slot. Heuristic but cheap; on any failure we silently fall back to the no-metrics gallery (collect helper returns None). Default sort: `score` ascending so failure-mode rollouts bubble to the top. The dropdown also exposes dnf_rate (desc), episode_return (asc), num_goals_reached (asc), collision_rate (desc), offroad_rate (desc), red_light_violation_rate (desc), total_infractions (desc), total_distance_travelled (asc), episode_length (asc). Flipping the key auto-flips the direction to whatever makes triage-useful rollouts surface first; the user can override via the direction toggle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The triage_html render path called viz.build_gallery_index without importing viz in that scope -- NameError at the very end of a successful rollout. viz was already imported locally inside _render_pass_obs at line 605; mirror the same lazy local import here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds gallery sorting support for eval HTML renders by plumbing per-episode metrics into generated gallery indexes and adding dnf_rate to completed episode summaries.
Changes:
- Adds client-side metric sort controls to
viz.build_gallery_index. - Collects per-file metric mappings from episode CSVs for
triage_htmlandobs_htmlrender outputs. - Extends Drive completed episode summaries with
dnf_rateand updates eval configs for HTML triage workflows.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/cluster_configs/single_agent_speed_run.yaml |
Reconfigures validation_gigaflow eval rendering for single-agent Town10HD runs. |
pufferlib/viz.py |
Adds metric-aware labels and sort dropdown JavaScript to generated gallery indexes. |
pufferlib/ocean/drive/drive.h |
Adds and populates per-episode dnf_rate in completed episode summaries. |
pufferlib/ocean/drive/binding.c |
Exposes dnf_rate in completed episode summary dictionaries. |
pufferlib/ocean/benchmark/evaluators/base.py |
Wires gallery metric collection into triage_html and obs_html render passes. |
pufferlib/config/ocean/drive.ini |
Updates validation_gigaflow render defaults and agent count. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+560
to
+562
| / f"{self.name}{step_suffix}.csv" | ||
| ) | ||
| if not csv_path.exists(): |
| # gigaflow mode (the C side fills it with the map's short | ||
| # name), so append a monotonic counter to make every | ||
| # rendered episode land in its own file. | ||
| stem = f"{map_name}_{scenario_id}_{scenarios_done:04d}{step_suffix}" |
Comment on lines
+38
to
+42
| # Disable nuPlan-based evaluators. validation_gigaflow stays on but is | ||
| # reconfigured to triage_html (CPU-only scene playback + per-episode metrics | ||
| # including dnf_rate), pinned to the same Town10HD single-agent setup the | ||
| # training uses, so each eval cycle drops a gallery of rollouts you can browse | ||
| # for DNFs. The eight-way ffmpeg/EGL spike of the default egl backend is gone. |
The previous regex `(.+)_(\d+)\.html` required the filename to end in
`_<digits>.html`. The triage_html stem ends in `_step{N}.html` (so the last
underscore-delimited token is `step0` etc, not bare digits) and re.fullmatch
rejected every file, leaving the gallery empty with "No matching .html files
found in this directory." Filter on `.html` suffix instead and sort the
filenames lexicographically -- the zero-padded scenarios_done embedded in
the stem dominates ordering within a map.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vcharraut
approved these changes
May 30, 2026
…agent [eval.validation_defaults] and [eval.validation_gigaflow] both had env.num_agents = 512, which silently halved the agent population the validation eval rolls out compared to the top-level [env] num_agents = 1024. Pin both to 1024 so the eval matches default training. [eval.dnf_triage] was 40 agents/env, but the section is meant for triaging single-agent rollouts (the trained policies we care about right now are single-agent gigaflow). Drop to min/max_agents_per_env = 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…un's metrics rollout() was calling _maybe_export_episodes after _render_pass, so the gallery's FILE_METRICS came from the previous run's CSV (or was empty on a first-ever run to a fresh output dir). Moving the CSV write between the metric and render passes -- _episode_rows is already fully populated by _run_rollout_loop, and _render_pass / _collect_file_metrics_from_csv now sees the just-written CSV. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In pool / both modes the boundary slots are now drawn in red instead of the shared teal that lanes use, so road edges that the policy pooled stand out from lane centerlines at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When num_scenarios > the env-slot count, each env runs multiple episodes back-to-back. The renderer writes one file per completed scenario in (episode_index, env_slot) order; the prior pairing filtered to episode_index==0 only, so anything past the first batch had no metric in the gallery FILE_METRICS and didn't participate in the sort. Sort the CSV by (episode_index, env_slot) to match render order and pair positionally over every rendered file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… CSV The render and metric passes use different vec env instances (and different seeds in obs_html's num_envs=1 vs the metric pass's full population), so pairing rendered HTMLs against the metric-pass CSV row-by-row attached metrics from a different simulation. Sorting the gallery by offroad_rate then surfaced tiles whose actual rollout had no offroad in it. Read each rendered scenario's metrics directly from the completed_episode summary in info: triage_html's loop already iterates infos for the bundle; obs_html now enables emit_completed_episodes on its render env and snapshots the latest summary per batch. _collect_file_metrics_from_csv is dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a per-episode metric sort dropdown to the triage_html / obs_html gallery index. Pick
offroad_rate,collision_rate,dnf_rate, etc., and tiles reorder client-side. No re-render needed.Smoke-tested against
model_puffer_drive_000382from a single-agent run — 16 scenarios rendered, index.html populated with the sort dropdown.Branch carries some unrelated yaml + test-fix commits that could be factored out for a cleaner diff.