NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation by Copilot · Pull Request #10 · MatPoliquin/stable-retro-scripts

Copilot · 2026-04-01T12:57:26Z

Implements Milestones 1–4 of the NHL94 self-play plan: curriculum bootstrapping into separate offense/defense finetune phases with a frozen team-2 opponent, terminal zero-sum rewards, and checkpoint-pool rotation. Team 1 is fixed as the learner throughout.

`nhl94_rf.py` — New finetune reward modes

Adds SelfPlayOffenseFinetune and SelfPlayDefenseFinetune to _reward_function_map. Both use purely terminal rewards (no dense shaping):

Offense: +1 on team-1 goal; -1 when puck exits attack zone or team-2 holds possession for SELFPLAY_CONTROL_FRAMES=30 consecutive frames; 0 on timeout.
Defense: +1 when puck clears defense zone or team-1 holds possession 30+ frames; -1 on team-2 goal; 0 on timeout.

Per-episode control tracking is stored on game_state using the same _sp_offense_ctx / _sp_defense_ctx transient-field pattern already used by rf_defensezone.

`nhl94_obs.py` — Side-aware action processing + self-play API

Replaces shared b_button_pressed / c_button_pressed / slapshot_frames_held with a per-side dict:

self.action_state = {
    "learner":  {"b_pressed": False, "c_pressed": False, "slapshot_frames": 0},
    "opponent": {"b_pressed": False, "c_pressed": False, "slapshot_frames": 0},
}

Button debounce and slapshot-hold logic is extracted into _process_action(ac, side_state) called independently for each side. The dead prev_state.Flip() call (prev_state was never read after assignment) is removed.

New self-play API on the wrapper:

set_opponent_model(path) — loads/clears a frozen PPO checkpoint via PPO.load.
compute_opponent_action(obs) — queries the frozen policy; returns zero-action when no model is loaded.
combine_selfplay_actions(learner_ac, opponent_ac) — concatenates both into a two-player action array.
selfplay_enabled property — True when an opponent model is loaded.

When selfplay_enabled, step() queries the frozen opponent using the current team-1-perspective observation and routes its output through the opponent's independent action state before concatenating with the learner action.

`train_live.py` — Opponent snapshot callback + CLI args

Adds OpponentSnapshotCallback which snapshots the current learner every opponent_snapshot_freq steps, maintains a bounded pool of up to opponent_pool_size checkpoints, and pushes a sampled opponent to all worker envs via env.env_method("set_opponent_model", path) using a 40/40/20 mixture (latest / random historical / best-known).

New CLI args: --load_opponent_model, --opponent_snapshot_freq (default 50k), --opponent_pool_size (default 5).

LiveTrainer seeds the initial frozen opponent across all subproc workers at startup when --selfplay and --load_opponent_model are both set. build_callback wraps LiveTrainingCallback and OpponentSnapshotCallback in a CallbackList when self-play is active.

`train_curriculum.py` — Opponent seeding for self-play phases

"load_opponent_model" added to PATH_KEYS for relative-path resolution. merge_phase_config automatically sets load_opponent_model = previous_model_path for self-play phases that don't specify an explicit opponent — so the final curriculum checkpoint seeds both learner init and the first frozen opponent pool entry.

`curriculum/nhl94.json` — Self-play phases appended

Two phases added after the existing Defense Zone phase:

{ "name": "Self-Play Offense Finetune", "rf": "SelfPlayOffenseFinetune",
  "state": "PenguinsVsSenators.AttackZone", "selfplay": true, "num_players": 2,
  "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 },
{ "name": "Self-Play Defense Finetune", "rf": "SelfPlayDefenseFinetune",
  "state": "PenguinsVsSenators.DefenseZone", "selfplay": true, "num_players": 2,
  "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 }

… opponent snapshot rotation, curriculum phases Agent-Logs-Url: https://github.com/MatPoliquin/stable-retro-scripts/sessions/c2ccf5b5-8dfc-4051-879b-b79ad0786973 Co-authored-by: MatPoliquin <7024551+MatPoliquin@users.noreply.github.com>

Initial plan

b521ead

Copilot AI assigned Copilot and MatPoliquin Apr 1, 2026

Copilot started work on behalf of MatPoliquin April 1, 2026 12:57 View session

Copilot AI changed the title ~~[WIP] Add self-play implementation for NHL94~~ NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation Apr 1, 2026

Copilot AI requested a review from MatPoliquin April 1, 2026 13:16

Copilot finished work on behalf of MatPoliquin April 1, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10

NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10
Copilot wants to merge 2 commits into
mainfrom
copilot/add-self-play-to-nhl94

Copilot AI commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nhl94_rf.py — New finetune reward modes

nhl94_obs.py — Side-aware action processing + self-play API

train_live.py — Opponent snapshot callback + CLI args

train_curriculum.py — Opponent seeding for self-play phases

curriculum/nhl94.json — Self-play phases appended

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 1, 2026 •

edited

Loading

`nhl94_rf.py` — New finetune reward modes

`nhl94_obs.py` — Side-aware action processing + self-play API

`train_live.py` — Opponent snapshot callback + CLI args

`train_curriculum.py` — Opponent seeding for self-play phases

`curriculum/nhl94.json` — Self-play phases appended