Skip to content

WIP: eval: per-episode metric sort UI on gallery index#461

Merged
eugenevinitsky merged 26 commits into
emerge/temp_trainingfrom
ev/dnf-triage
May 30, 2026
Merged

WIP: eval: per-episode metric sort UI on gallery index#461
eugenevinitsky merged 26 commits into
emerge/temp_trainingfrom
ev/dnf-triage

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 30, 2026

Adds a per-episode metric sort dropdown to the triage_html / obs_html gallery index. Pick offroad_rate, collision_rate, dnf_rate, etc., and tiles reorder client-side. No re-render needed.

Smoke-tested against model_puffer_drive_000382 from a single-agent run — 16 scenarios rendered, index.html populated with the sort dropdown.

Branch carries some unrelated yaml + test-fix commits that could be factored out for a cleaner diff.

Eugene Vinitsky and others added 17 commits May 28, 2026 10:46
…yaml smoke test

- drive.py: if map_dir points at a .bin file, use it directly as the only
  map; otherwise scan the directory as before. Lets the single-agent
  launcher pin to one CARLA bin (Town10HD) without a separate starting_map
  knob and without breaking multi_scenario's render-time random injection.
- drive.ini: declare enabled = true in [eval.validation_replay] so pufferl's
  argparser exposes --eval.validation-replay.enabled (otherwise the launcher
  yaml's enabled: 0 override died as an unrecognized argument).
- single_agent_speed_run.yaml: drop env.starting_map; point env.map_dir
  straight at opendrive__Town10HD.bin.
- tests/test_single_agent_yaml.py: parse the launcher yaml into pufferl's
  CLI flags and assert load_config accepts every key and that the values
  land in the expected nested slots. Catches future regressions where the
  yaml references a flag that isn't wired into drive.ini.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously fell into FileNotFoundError via map_dir=None and silently skipped
in CI even when the carla binaries were present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eight-way 1080p ffmpeg encode + EGL render burst at the validation_gigaflow
eval epoch was aborting long single-agent runs with SIGABRT under VRAM/RAM
pressure. The single-agent yaml already disables every other eval evaluator,
so disabling this one too removes the eval signal entirely but keeps the run
alive past epoch 250. Also declares enabled in [eval.validation_gigaflow] so
the argparse flag exists (same pattern as the earlier validation_replay fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in #443 (Improve html viz).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-map (Town10HD) gigaflow eval with the triage_html render backend, for
inspecting which rollouts DNF when running an existing checkpoint via
`puffer eval puffer_drive --evaluator dnf_triage --load-model-path <ckpt>`.
Disabled by default so it does not run on every training cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds render_backend and env.num_agents to [eval.validation_gigaflow] so the
launcher yaml can flip render backend / agent counts without per-run drive.ini
edits. The single_agent_speed_run.yaml leaves validation_gigaflow enabled and
reconfigures it inline:
  - render_backend: triage_html (CPU-only HTML, no EGL/ffmpeg spike that
    aborted older runs at eval epochs)
  - env.map_dir: opendrive__Town10HD.bin (matches training)
  - env.num_maps: 1, env.num_agents: 64, single-agent-per-env

Each eval cycle drops a gallery of single-agent rollouts you can browse to
see which DNF'd. No standalone evaluator section needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riage_html eval enabled)

# Conflicts:
#	scripts/cluster_configs/single_agent_speed_run.yaml
In gigaflow mode the C side fills info['scenario_id'] with the map's short
name, which is identical across every rollout on the same map. The previous
stem `{map_name}_{scenario_id}{step_suffix}` collided for every episode, so
each render_compact_replay_html call overwrote the prior file and only the
last rollout survived on disk.

Append the monotonic scenarios_done counter so each episode lands in its
own .html file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_render_pass_obs ends with viz.build_gallery_index(out_dir) so the gallery's
index.html lands next to the per-episode files. _render_pass_html had the
loop and tqdm wiring but no equivalent build call, so the triage_html
render dir just held the raw N htmls with no entry point. Mirror the
obs_html call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dnf_rate already exists in the aggregate Log struct (drive.h:2577 increments
it whenever an agent finishes with no infractions and zero waypoints reached)
but was missing from the per-episode CompletedEpisodeSummary that backs the
gallery's info dict and the validation_gigaflow per-episode CSV. Without
this, the obs_html / triage_html gallery sort UI has no way to access the
DNF signal except by re-deriving it Python-side from the other fields,
which silently drifts the moment anyone touches the C-side predicate.

Single source of truth: snapshot the same predicate inside the per-episode
loop alongside score, then emit through my_completed_episode_to_dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_gallery_index now accepts an optional `file_metrics` dict (basename →
metric dict) and emits a sort dropdown + a direction toggle when provided.
Backward-compatible: callers passing nothing keep the previous behaviour.

The two render passes (_render_pass_obs and _render_pass_html) build that
dict by pairing rendered files with rows from the per-episode CSV the metric
pass already writes: i-th rendered file ↔ i-th first-episode row in the CSV
sorted by env_slot. Heuristic but cheap; on any failure we silently fall
back to the no-metrics gallery (collect helper returns None).

Default sort: `score` ascending so failure-mode rollouts bubble to the top.
The dropdown also exposes dnf_rate (desc), episode_return (asc),
num_goals_reached (asc), collision_rate (desc), offroad_rate (desc),
red_light_violation_rate (desc), total_infractions (desc),
total_distance_travelled (asc), episode_length (asc). Flipping the key
auto-flips the direction to whatever makes triage-useful rollouts surface
first; the user can override via the direction toggle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 30, 2026 19:51
Eugene Vinitsky and others added 2 commits May 30, 2026 15:53
The triage_html render path called viz.build_gallery_index without importing
viz in that scope -- NameError at the very end of a successful rollout. viz
was already imported locally inside _render_pass_obs at line 605; mirror the
same lazy local import here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds gallery sorting support for eval HTML renders by plumbing per-episode metrics into generated gallery indexes and adding dnf_rate to completed episode summaries.

Changes:

  • Adds client-side metric sort controls to viz.build_gallery_index.
  • Collects per-file metric mappings from episode CSVs for triage_html and obs_html render outputs.
  • Extends Drive completed episode summaries with dnf_rate and updates eval configs for HTML triage workflows.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/cluster_configs/single_agent_speed_run.yaml Reconfigures validation_gigaflow eval rendering for single-agent Town10HD runs.
pufferlib/viz.py Adds metric-aware labels and sort dropdown JavaScript to generated gallery indexes.
pufferlib/ocean/drive/drive.h Adds and populates per-episode dnf_rate in completed episode summaries.
pufferlib/ocean/drive/binding.c Exposes dnf_rate in completed episode summary dictionaries.
pufferlib/ocean/benchmark/evaluators/base.py Wires gallery metric collection into triage_html and obs_html render passes.
pufferlib/config/ocean/drive.ini Updates validation_gigaflow render defaults and agent count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +560 to +562
/ f"{self.name}{step_suffix}.csv"
)
if not csv_path.exists():
# gigaflow mode (the C side fills it with the map's short
# name), so append a monotonic counter to make every
# rendered episode land in its own file.
stem = f"{map_name}_{scenario_id}_{scenarios_done:04d}{step_suffix}"
Comment on lines +38 to +42
# Disable nuPlan-based evaluators. validation_gigaflow stays on but is
# reconfigured to triage_html (CPU-only scene playback + per-episode metrics
# including dnf_rate), pinned to the same Town10HD single-agent setup the
# training uses, so each eval cycle drops a gallery of rollouts you can browse
# for DNFs. The eight-way ffmpeg/EGL spike of the default egl backend is gone.
The previous regex `(.+)_(\d+)\.html` required the filename to end in
`_<digits>.html`. The triage_html stem ends in `_step{N}.html` (so the last
underscore-delimited token is `step0` etc, not bare digits) and re.fullmatch
rejected every file, leaving the gallery empty with "No matching .html files
found in this directory." Filter on `.html` suffix instead and sort the
filenames lexicographically -- the zero-padded scenarios_done embedded in
the stem dominates ordering within a map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky changed the title eval: per-episode metric sort UI on the gallery index + triage_html plumbing WIP: eval: per-episode metric sort UI on gallery index May 30, 2026
Eugene Vinitsky and others added 5 commits May 30, 2026 16:14
…agent

[eval.validation_defaults] and [eval.validation_gigaflow] both had
env.num_agents = 512, which silently halved the agent population the
validation eval rolls out compared to the top-level [env] num_agents = 1024.
Pin both to 1024 so the eval matches default training.

[eval.dnf_triage] was 40 agents/env, but the section is meant for triaging
single-agent rollouts (the trained policies we care about right now are
single-agent gigaflow). Drop to min/max_agents_per_env = 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…un's metrics

rollout() was calling _maybe_export_episodes after _render_pass, so the
gallery's FILE_METRICS came from the previous run's CSV (or was empty on a
first-ever run to a fresh output dir). Moving the CSV write between the
metric and render passes -- _episode_rows is already fully populated by
_run_rollout_loop, and _render_pass / _collect_file_metrics_from_csv now
sees the just-written CSV.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In pool / both modes the boundary slots are now drawn in red instead of the
shared teal that lanes use, so road edges that the policy pooled stand out
from lane centerlines at a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When num_scenarios > the env-slot count, each env runs multiple episodes
back-to-back. The renderer writes one file per completed scenario in
(episode_index, env_slot) order; the prior pairing filtered to
episode_index==0 only, so anything past the first batch had no metric in
the gallery FILE_METRICS and didn't participate in the sort. Sort the CSV
by (episode_index, env_slot) to match render order and pair positionally
over every rendered file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… CSV

The render and metric passes use different vec env instances (and different
seeds in obs_html's num_envs=1 vs the metric pass's full population), so
pairing rendered HTMLs against the metric-pass CSV row-by-row attached
metrics from a different simulation. Sorting the gallery by offroad_rate
then surfaced tiles whose actual rollout had no offroad in it.

Read each rendered scenario's metrics directly from the completed_episode
summary in info: triage_html's loop already iterates infos for the bundle;
obs_html now enables emit_completed_episodes on its render env and snapshots
the latest summary per batch. _collect_file_metrics_from_csv is dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky merged commit 474e28e into emerge/temp_training May 30, 2026
13 checks passed
@eugenevinitsky eugenevinitsky deleted the ev/dnf-triage branch May 30, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants