Skip to content

NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/add-self-play-to-nhl94
Draft

NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10
Copilot wants to merge 2 commits into
mainfrom
copilot/add-self-play-to-nhl94

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 1, 2026

Implements Milestones 1–4 of the NHL94 self-play plan: curriculum bootstrapping into separate offense/defense finetune phases with a frozen team-2 opponent, terminal zero-sum rewards, and checkpoint-pool rotation. Team 1 is fixed as the learner throughout.

nhl94_rf.py — New finetune reward modes

Adds SelfPlayOffenseFinetune and SelfPlayDefenseFinetune to _reward_function_map. Both use purely terminal rewards (no dense shaping):

  • Offense: +1 on team-1 goal; -1 when puck exits attack zone or team-2 holds possession for SELFPLAY_CONTROL_FRAMES=30 consecutive frames; 0 on timeout.
  • Defense: +1 when puck clears defense zone or team-1 holds possession 30+ frames; -1 on team-2 goal; 0 on timeout.

Per-episode control tracking is stored on game_state using the same _sp_offense_ctx / _sp_defense_ctx transient-field pattern already used by rf_defensezone.

nhl94_obs.py — Side-aware action processing + self-play API

Replaces shared b_button_pressed / c_button_pressed / slapshot_frames_held with a per-side dict:

self.action_state = {
    "learner":  {"b_pressed": False, "c_pressed": False, "slapshot_frames": 0},
    "opponent": {"b_pressed": False, "c_pressed": False, "slapshot_frames": 0},
}

Button debounce and slapshot-hold logic is extracted into _process_action(ac, side_state) called independently for each side. The dead prev_state.Flip() call (prev_state was never read after assignment) is removed.

New self-play API on the wrapper:

  • set_opponent_model(path) — loads/clears a frozen PPO checkpoint via PPO.load.
  • compute_opponent_action(obs) — queries the frozen policy; returns zero-action when no model is loaded.
  • combine_selfplay_actions(learner_ac, opponent_ac) — concatenates both into a two-player action array.
  • selfplay_enabled property — True when an opponent model is loaded.

When selfplay_enabled, step() queries the frozen opponent using the current team-1-perspective observation and routes its output through the opponent's independent action state before concatenating with the learner action.

train_live.py — Opponent snapshot callback + CLI args

Adds OpponentSnapshotCallback which snapshots the current learner every opponent_snapshot_freq steps, maintains a bounded pool of up to opponent_pool_size checkpoints, and pushes a sampled opponent to all worker envs via env.env_method("set_opponent_model", path) using a 40/40/20 mixture (latest / random historical / best-known).

New CLI args: --load_opponent_model, --opponent_snapshot_freq (default 50k), --opponent_pool_size (default 5).

LiveTrainer seeds the initial frozen opponent across all subproc workers at startup when --selfplay and --load_opponent_model are both set. build_callback wraps LiveTrainingCallback and OpponentSnapshotCallback in a CallbackList when self-play is active.

train_curriculum.py — Opponent seeding for self-play phases

"load_opponent_model" added to PATH_KEYS for relative-path resolution. merge_phase_config automatically sets load_opponent_model = previous_model_path for self-play phases that don't specify an explicit opponent — so the final curriculum checkpoint seeds both learner init and the first frozen opponent pool entry.

curriculum/nhl94.json — Self-play phases appended

Two phases added after the existing Defense Zone phase:

{ "name": "Self-Play Offense Finetune", "rf": "SelfPlayOffenseFinetune",
  "state": "PenguinsVsSenators.AttackZone", "selfplay": true, "num_players": 2,
  "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 },
{ "name": "Self-Play Defense Finetune", "rf": "SelfPlayDefenseFinetune",
  "state": "PenguinsVsSenators.DefenseZone", "selfplay": true, "num_players": 2,
  "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 }

… opponent snapshot rotation, curriculum phases

Agent-Logs-Url: https://github.com/MatPoliquin/stable-retro-scripts/sessions/c2ccf5b5-8dfc-4051-879b-b79ad0786973

Co-authored-by: MatPoliquin <7024551+MatPoliquin@users.noreply.github.com>
Copilot AI changed the title [WIP] Add self-play implementation for NHL94 NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation Apr 1, 2026
Copilot AI requested a review from MatPoliquin April 1, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants