vec_log per-agent population mean - average over agents instead of over completed episodes#459
vec_log per-agent population mean - average over agents instead of over completed episodes#459eugenevinitsky wants to merge 12 commits into
Conversation
Replaces the existing 'sum-over-completed-episodes / completed-episode-count' aggregation in vec_log with a two-stage per-agent mean: each agent slot contributes one window-mean (mean over its completions since last emit), then those means are averaged across agents. An agent that completes 10 short infraction-terminated episodes now has the same weight in the metric as an agent that completes one full clean episode. drive.h: - New per-env state: per_agent_log_sum[], per_agent_log_count[], per_agent_log_capacity. Grow-only realloc handles REPLAY scenarios where active_agent_count can vary across c_reset. - add_log now stamps the completing agent's slot instead of env->log, and bumps per_agent_log_count[i] instead of env->log.n. - New prepare_log: drains each agent's slot to (sum / count), sums across agents into env->log, sets env->log.n = #agents-with-data. Resets the per-agent buffers in place. - env_static_car_count and static_car_count are folded into each agent's slot so the population mean recovers the env's per-scenario value. drive/binding.c: - New vec_prepare_log binding (MY_METHODS) that calls prepare_log on every env in a VecEnv. drive/drive.py: - Calls binding.vec_prepare_log right before binding.vec_log every report_interval steps. env_binding.h: - Relaxes the vec_log gate from `aggregate.n < num_agents` (wait until enough completed episodes) to `aggregate.n < 1` (emit if anyone has data). For Drive this means "emit if any env has at least one agent with a window-mean"; for other ocean envs sharing this header it means smaller batches emitted more often, which the Python-side mean_and_log in pufferl.py already amortizes via its 0.25s rate-limit. - num_agents arg kept on the C signature for caller-API compatibility, but is now unused. Validated locally: `python setup.py build_ext --inplace --force` clean, existing scenario-length and single-agent-yaml tests pass, and a 64-agent smoke run emits one log dict per scenario_length window with n=num_agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR changes how Drive environment metrics are aggregated in vec_log to remove completion-rate bias by switching from “mean over completed episodes” to a two-stage “per-agent window mean, then population mean across agent slots” approach. It also adds a Drive-specific vec_prepare_log step to drain per-agent accumulators into env->log immediately before vec_log.
Changes:
- Add per-agent log accumulators in Drive and a
prepare_log()drain step to produce per-agent population means. - Add a new C binding
vec_prepare_logand call it fromdrive.pybeforevec_logat eachreport_interval. - Relax
vec_log’s emission gate inenv_binding.hfromaggregate.n < num_agentstoaggregate.n < 1(API arg retained but unused).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pufferlib/ocean/env_binding.h | Changes vec_log emission gating to emit with any data (ignores num_agents arg). |
| pufferlib/ocean/drive/drive.py | Calls new binding.vec_prepare_log() prior to binding.vec_log(). |
| pufferlib/ocean/drive/drive.h | Adds per-agent log sum/count buffers; introduces ensure_per_agent_log_capacity() and prepare_log(); rewrites add_log() to accumulate per-agent. |
| pufferlib/ocean/drive/binding.c | Exposes vec_prepare_log Python binding to call prepare_log() across VecEnv envs. |
Comments suppressed due to low confidence (1)
pufferlib/ocean/drive/binding.c:55
- vec_prepare_log_py doesn’t validate its argument count. Most other vec_* entrypoints in env_binding.h explicitly check PyTuple_Size and raise a clear TypeError; without this, accidental extra args will be silently ignored and missing args will lead to confusing errors inside unpack_vecenv.
prepare_log((Drive *) vec->envs[i]);
}
Py_RETURN_NONE;
}
static int my_put(Env *env, PyObject *args, PyObject *kwargs) {
PyObject *obs = PyDict_GetItemString(kwargs, "observations");
if (!PyObject_TypeCheck(obs, &PyArray_Type)) {
PyErr_SetString(PyExc_TypeError, "Observations must be a NumPy array");
return 1;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static void ensure_per_agent_log_capacity(Drive *env, int needed) { | ||
| if (env->per_agent_log_capacity >= needed) { | ||
| return; | ||
| } | ||
| int old_cap = env->per_agent_log_capacity; | ||
| env->per_agent_log_sum = (Log *) realloc(env->per_agent_log_sum, needed * sizeof(Log)); | ||
| env->per_agent_log_count = (int *) realloc(env->per_agent_log_count, needed * sizeof(int)); | ||
| memset(&env->per_agent_log_sum[old_cap], 0, (needed - old_cap) * sizeof(Log)); | ||
| memset(&env->per_agent_log_count[old_cap], 0, (needed - old_cap) * sizeof(int)); | ||
| env->per_agent_log_capacity = needed; | ||
| } |
| // Emit whenever any env has data. With Drive's per-agent prepare_log | ||
| // path, aggregate.n is the cross-env count of agents that contributed | ||
| // a window-mean (not completed-episode count), so the meaningful gate | ||
| // is "at least one contribution." Other ocean envs that don't run | ||
| // prepare_log retain completed-episode-count semantics and now emit | ||
| // smaller batches more often — the Python-side mean_and_log | ||
| // (pufferl.py) re-averages across emissions in its rate-limit window. | ||
| if (aggregate.n < 1) { | ||
| return dict; |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the window-mean per-agent state (sum + count, reset on every emit) with a per-agent EMA that persists across emissions. Each agent slot tracks slot = alpha*slot + (1 - alpha)*new_episode_log on every completion, with the first completion seeding the slot directly. prepare_log then emits the population mean across agents flagged has_data, never resetting the EMAs. Removes the residual completion-rate bias that the previous PR design still carried: fast-completers no longer dominate the multi-emission average that pufferl.py's mean_and_log produces, because every agent now contributes a single smoothed value to every emission rather than appearing in some windows and not others. New env config knob log_ema_alpha (default 0.95, half-life ~14 episodes per agent) controls smoothing. alpha=0 collapses to "most recent episode only", alpha->1 freezes the slot at its first observation. Files: - drive.h: rename per_agent_log_sum to per_agent_log_ema, add per_agent_has_data flag, add log_ema_alpha field. add_log builds a fresh episode_log snapshot, then seeds the slot or EMA-blends. prepare_log skips no-data slots and sums (no division) into env->log; vec_log divides by env->log.n = num_with_data downstream. - binding.c: unpack log_ema_alpha kwarg. - drive.py: __init__ takes log_ema_alpha=0.95, plumbs through env_kwargs. - drive.ini: log_ema_alpha = 0.95 under [env]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T1: gate skips emissions until the first agent completes an episode; first log dict appears exactly when the scenario truncates. T2: once every agent has completed at least one episode, dict["n"] equals num_agents (population fully represented in the cross-agent mean). T3: prepare_log preserves per-agent EMA state across emissions -- two consecutive emissions with no intervening completions produce identical metric values, locking in that the EMA buffers are not reset on emit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
active_agent_count is set once by set_active_agents (called only from init() and never from c_reset), so the grow-only ensure_per_agent_log_capacity helper only ever did real work on its first call and the comment justifying it as defense against varying populations was inaccurate. Allocate the three per-agent arrays in init() right after set_active_agents and drop the helper, the per-step ensure call in add_log, and the now-unused per_agent_log_capacity field. prepare_log iterates active_agent_count directly. Also trim the struct-field, prepare_log, episode_log-snapshot, and env-composition comments that were either describing wiring history or referencing downstream consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval branch of vec_log only checked env[0]'s log.n before dividing every env's log fields by their own n, so envs other than env[0] with no data would divide by zero. The empty-data short-circuit also returned the unused outer dict instead of the half-built list, leaking it to the GC. Rewrite the eval branch to per-env-check inside the loop: skip the divide-and-emit for envs with n=0 and leave their list slot as an empty dict, so the list shape stays vec->num_envs regardless of which envs have contributed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_drive_train called load_config which reads sys.argv; under pytest that contains pytest's own flags and the argparser bailed with SystemExit(2). Patch sys.argv to a single-element list inside the test so pytest discovery works (the existing train-ci.yml bare-script invocation is unaffected). Add tests/test_drive_per_agent_logging.py and tests/test_drive_train.py to the utest pytest pipeline. test_drive_train runs in its own pytest step because the test calls os._exit(0) on success to dodge worker-thread hangs, which would otherwise mask any earlier-test failures with exit 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returning vec->num_envs dicts with empty entries for no-data envs poisons downstream eval aggregation: pufferlib/ocean/benchmark/evaluators/base.py treats a missing "n" key as 1, so each empty dict inflates total_n and the weighted-mean denominator without contributing any numerator -- per-metric means get diluted. Worse, multi_scenario._should_stop counts emissions against num_scenarios, so cold-start steps with all-empty lists could terminate the eval loop early with garbage data. Return a list of only the populated dicts (length <= vec->num_envs). When no env has data the list is empty, drive.py's `if log: info.append(log)` guard skips it, and downstream counters keep ticking on real emissions only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
log_ema_alpha = 0.707 (was 0.95) so the per-agent EMA's effective half-life is ~2 episodes -- responsive enough to track training progress within a few emissions per agent. Drive.__init__ default tracks drive.ini. Also join a split f-string in test_drive_per_agent_logging.py to make ruff-format happy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vcharraut
left a comment
There was a problem hiding this comment.
Do you have an example of the same seed with and without this PR ?
|
great question, will launch two runs and shre |
|
@copilot resolve the merge conflicts in this pull request |
Conflicts are resolved in commit |
Previously we were logging anytime N agents finished, which biased the metrics towards agents that finished frequently. Now we report an average over all the agent slots.
Summary
We use per-agent EMA-based aggregation in `vec_log` so the cross-agent mean reported in wandb has one weight per agent regardless of completion frequency — no per-emission bias and no multi-emission residual bias.
For each agent slot the env keeps a smoothed `Log` state:
```
slot ← α · slot + (1 − α) · episode_log # on every completion (after first)
slot ← episode_log # on the slot's first completion
```
`prepare_log` then sums slots flagged `has_data`, sets `env->log.n = num_with_data`, and does not reset. `vec_log`'s sum-across-envs / divide-by-`aggregate.n` step then produces a population mean: every agent that has ever completed contributes one term, every emission.
Why EMA?
We need some way of providing a good estimate of the average performance of the agents. We don't want to accumulate forever, as that'd be slow to update, and we could use a fixed size number of logs per agent, but this is a nice compromise.