vec_log per-agent population mean - average over agents instead of over completed episodes by eugenevinitsky · Pull Request #459 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-30T17:47:19Z

Previously we were logging anytime N agents finished, which biased the metrics towards agents that finished frequently. Now we report an average over all the agent slots.

Summary

We use per-agent EMA-based aggregation in `vec_log` so the cross-agent mean reported in wandb has one weight per agent regardless of completion frequency — no per-emission bias and no multi-emission residual bias.

For each agent slot the env keeps a smoothed `Log` state:

```
slot ← α · slot + (1 − α) · episode_log # on every completion (after first)
slot ← episode_log # on the slot's first completion
```

`prepare_log` then sums slots flagged `has_data`, sets `env->log.n = num_with_data`, and does not reset. `vec_log`'s sum-across-envs / divide-by-`aggregate.n` step then produces a population mean: every agent that has ever completed contributes one term, every emission.

Why EMA?

We need some way of providing a good estimate of the average performance of the agents. We don't want to accumulate forever, as that'd be slow to update, and we could use a fixed size number of logs per agent, but this is a nice compromise.

Replaces the existing 'sum-over-completed-episodes / completed-episode-count' aggregation in vec_log with a two-stage per-agent mean: each agent slot contributes one window-mean (mean over its completions since last emit), then those means are averaged across agents. An agent that completes 10 short infraction-terminated episodes now has the same weight in the metric as an agent that completes one full clean episode. drive.h: - New per-env state: per_agent_log_sum[], per_agent_log_count[], per_agent_log_capacity. Grow-only realloc handles REPLAY scenarios where active_agent_count can vary across c_reset. - add_log now stamps the completing agent's slot instead of env->log, and bumps per_agent_log_count[i] instead of env->log.n. - New prepare_log: drains each agent's slot to (sum / count), sums across agents into env->log, sets env->log.n = #agents-with-data. Resets the per-agent buffers in place. - env_static_car_count and static_car_count are folded into each agent's slot so the population mean recovers the env's per-scenario value. drive/binding.c: - New vec_prepare_log binding (MY_METHODS) that calls prepare_log on every env in a VecEnv. drive/drive.py: - Calls binding.vec_prepare_log right before binding.vec_log every report_interval steps. env_binding.h: - Relaxes the vec_log gate from `aggregate.n < num_agents` (wait until enough completed episodes) to `aggregate.n < 1` (emit if anyone has data). For Drive this means "emit if any env has at least one agent with a window-mean"; for other ocean envs sharing this header it means smaller batches emitted more often, which the Python-side mean_and_log in pufferl.py already amortizes via its 0.25s rate-limit. - num_agents arg kept on the C signature for caller-API compatibility, but is now unused. Validated locally: `python setup.py build_ext --inplace --force` clean, existing scenario-length and single-agent-yaml tests pass, and a 64-agent smoke run emits one log dict per scenario_length window with n=num_agents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR changes how Drive environment metrics are aggregated in vec_log to remove completion-rate bias by switching from “mean over completed episodes” to a two-stage “per-agent window mean, then population mean across agent slots” approach. It also adds a Drive-specific vec_prepare_log step to drain per-agent accumulators into env->log immediately before vec_log.

Changes:

Add per-agent log accumulators in Drive and a prepare_log() drain step to produce per-agent population means.
Add a new C binding vec_prepare_log and call it from drive.py before vec_log at each report_interval.
Relax vec_log’s emission gate in env_binding.h from aggregate.n < num_agents to aggregate.n < 1 (API arg retained but unused).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
pufferlib/ocean/env_binding.h	Changes `vec_log` emission gating to emit with any data (ignores `num_agents` arg).
pufferlib/ocean/drive/drive.py	Calls new `binding.vec_prepare_log()` prior to `binding.vec_log()`.
pufferlib/ocean/drive/drive.h	Adds per-agent log sum/count buffers; introduces `ensure_per_agent_log_capacity()` and `prepare_log()`; rewrites `add_log()` to accumulate per-agent.
pufferlib/ocean/drive/binding.c	Exposes `vec_prepare_log` Python binding to call `prepare_log()` across VecEnv envs.

Comments suppressed due to low confidence (1)

pufferlib/ocean/drive/binding.c:55

vec_prepare_log_py doesn’t validate its argument count. Most other vec_* entrypoints in env_binding.h explicitly check PyTuple_Size and raise a clear TypeError; without this, accidental extra args will be silently ignored and missing args will lead to confusing errors inside unpack_vecenv.

        prepare_log((Drive *) vec->envs[i]);
    }
    Py_RETURN_NONE;
}

static int my_put(Env *env, PyObject *args, PyObject *kwargs) {
    PyObject *obs = PyDict_GetItemString(kwargs, "observations");
    if (!PyObject_TypeCheck(obs, &PyArray_Type)) {
        PyErr_SetString(PyExc_TypeError, "Observations must be a NumPy array");
        return 1;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+static void ensure_per_agent_log_capacity(Drive *env, int needed) {
+    if (env->per_agent_log_capacity >= needed) {
+        return;
+    }
+    int old_cap = env->per_agent_log_capacity;
+    env->per_agent_log_sum = (Log *) realloc(env->per_agent_log_sum, needed * sizeof(Log));
+    env->per_agent_log_count = (int *) realloc(env->per_agent_log_count, needed * sizeof(int));
+    memset(&env->per_agent_log_sum[old_cap], 0, (needed - old_cap) * sizeof(Log));
+    memset(&env->per_agent_log_count[old_cap], 0, (needed - old_cap) * sizeof(int));
+    env->per_agent_log_capacity = needed;
+}


+        // Emit whenever any env has data. With Drive's per-agent prepare_log
+        // path, aggregate.n is the cross-env count of agents that contributed
+        // a window-mean (not completed-episode count), so the meaningful gate
+        // is "at least one contribution." Other ocean envs that don't run
+        // prepare_log retain completed-episode-count semantics and now emit
+        // smaller batches more often — the Python-side mean_and_log
+        // (pufferl.py) re-averages across emissions in its rate-limit window.
+        if (aggregate.n < 1) {
+            return dict;


Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the window-mean per-agent state (sum + count, reset on every emit) with a per-agent EMA that persists across emissions. Each agent slot tracks slot = alpha*slot + (1 - alpha)*new_episode_log on every completion, with the first completion seeding the slot directly. prepare_log then emits the population mean across agents flagged has_data, never resetting the EMAs. Removes the residual completion-rate bias that the previous PR design still carried: fast-completers no longer dominate the multi-emission average that pufferl.py's mean_and_log produces, because every agent now contributes a single smoothed value to every emission rather than appearing in some windows and not others. New env config knob log_ema_alpha (default 0.95, half-life ~14 episodes per agent) controls smoothing. alpha=0 collapses to "most recent episode only", alpha->1 freezes the slot at its first observation. Files: - drive.h: rename per_agent_log_sum to per_agent_log_ema, add per_agent_has_data flag, add log_ema_alpha field. add_log builds a fresh episode_log snapshot, then seeds the slot or EMA-blends. prepare_log skips no-data slots and sums (no division) into env->log; vec_log divides by env->log.n = num_with_data downstream. - binding.c: unpack log_ema_alpha kwarg. - drive.py: __init__ takes log_ema_alpha=0.95, plumbs through env_kwargs. - drive.ini: log_ema_alpha = 0.95 under [env]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

T1: gate skips emissions until the first agent completes an episode; first log dict appears exactly when the scenario truncates. T2: once every agent has completed at least one episode, dict["n"] equals num_agents (population fully represented in the cross-agent mean). T3: prepare_log preserves per-agent EMA state across emissions -- two consecutive emissions with no intervening completions produce identical metric values, locking in that the EMA buffers are not reset on emit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

active_agent_count is set once by set_active_agents (called only from init() and never from c_reset), so the grow-only ensure_per_agent_log_capacity helper only ever did real work on its first call and the comment justifying it as defense against varying populations was inaccurate. Allocate the three per-agent arrays in init() right after set_active_agents and drop the helper, the per-step ensure call in add_log, and the now-unused per_agent_log_capacity field. prepare_log iterates active_agent_count directly. Also trim the struct-field, prepare_log, episode_log-snapshot, and env-composition comments that were either describing wiring history or referencing downstream consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eval branch of vec_log only checked env[0]'s log.n before dividing every env's log fields by their own n, so envs other than env[0] with no data would divide by zero. The empty-data short-circuit also returned the unused outer dict instead of the half-built list, leaking it to the GC. Rewrite the eval branch to per-env-check inside the loop: skip the divide-and-emit for envs with n=0 and leave their list slot as an empty dict, so the list shape stays vec->num_envs regardless of which envs have contributed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_drive_train called load_config which reads sys.argv; under pytest that contains pytest's own flags and the argparser bailed with SystemExit(2). Patch sys.argv to a single-element list inside the test so pytest discovery works (the existing train-ci.yml bare-script invocation is unaffected). Add tests/test_drive_per_agent_logging.py and tests/test_drive_train.py to the utest pytest pipeline. test_drive_train runs in its own pytest step because the test calls os._exit(0) on success to dodge worker-thread hangs, which would otherwise mask any earlier-test failures with exit 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Returning vec->num_envs dicts with empty entries for no-data envs poisons downstream eval aggregation: pufferlib/ocean/benchmark/evaluators/base.py treats a missing "n" key as 1, so each empty dict inflates total_n and the weighted-mean denominator without contributing any numerator -- per-metric means get diluted. Worse, multi_scenario._should_stop counts emissions against num_scenarios, so cold-start steps with all-empty lists could terminate the eval loop early with garbage data. Return a list of only the populated dicts (length <= vec->num_envs). When no env has data the list is empty, drive.py's `if log: info.append(log)` guard skips it, and downstream counters keep ticking on real emissions only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

log_ema_alpha = 0.707 (was 0.95) so the per-agent EMA's effective half-life is ~2 episodes -- responsive enough to track training progress within a few emissions per agent. Drive.__init__ default tracks drive.ini. Also join a split f-string in test_drive_per_agent_logging.py to make ruff-format happy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vcharraut

Do you have an example of the same seed with and without this PR ?

eugenevinitsky · 2026-06-01T18:01:31Z

great question, will launch two runs and shre

eugenevinitsky · 2026-06-01T18:01:48Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-06-01T18:12:36Z

@copilot resolve the merge conflicts in this pull request

Conflicts are resolved in commit 63093b1 by merging emerge/temp_training into this branch and resolving the two modify/delete conflicts (.github/workflows/utest.yml, tests/test_drive_train.py) in favor of the base branch file moves.

Copilot AI review requested due to automatic review settings May 30, 2026 17:47

Copilot started reviewing on behalf of eugenevinitsky May 30, 2026 17:47 View session

drive/binding: drop forward-decl wiring comments

dda8ccc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI reviewed May 30, 2026

View reviewed changes

Eugene Vinitsky and others added 9 commits May 30, 2026 13:54

env_binding: focus vec_log gate comment on the divide-by-zero invariant

ff62899

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

env_binding: drop unused num_agents extraction in vec_log

e952bcb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eugenevinitsky changed the title ~~WIP: vec_log per-agent population mean (remove completion-rate bias)~~ vec_log per-agent population mean - average over agents instead of over completed episodes May 30, 2026

vcharraut reviewed Jun 1, 2026

View reviewed changes

Copilot started work on behalf of eugenevinitsky June 1, 2026 18:02 View session

Merge branch 'emerge/temp_training' into ev/per-agent-logging

63093b1

Copilot finished work on behalf of eugenevinitsky June 1, 2026 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vec_log per-agent population mean - average over agents instead of over completed episodes#459

vec_log per-agent population mean - average over agents instead of over completed episodes#459
eugenevinitsky wants to merge 12 commits into
emerge/temp_trainingfrom
ev/per-agent-logging

eugenevinitsky commented May 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

vcharraut left a comment

Uh oh!

eugenevinitsky commented Jun 1, 2026

Uh oh!

eugenevinitsky commented Jun 1, 2026

Uh oh!

Copilot AI commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

eugenevinitsky commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why EMA?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vcharraut left a comment

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky commented Jun 1, 2026

Uh oh!

eugenevinitsky commented Jun 1, 2026

Uh oh!

Copilot AI commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eugenevinitsky commented May 30, 2026 •

edited

Loading