NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10
Draft
Copilot wants to merge 2 commits into
Draft
NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation#10Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
… opponent snapshot rotation, curriculum phases Agent-Logs-Url: https://github.com/MatPoliquin/stable-retro-scripts/sessions/c2ccf5b5-8dfc-4051-879b-b79ad0786973 Co-authored-by: MatPoliquin <7024551+MatPoliquin@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add self-play implementation for NHL94
NHL94 self-play: asymmetric finetune modes, side-aware action processing, opponent snapshot rotation
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements Milestones 1–4 of the NHL94 self-play plan: curriculum bootstrapping into separate offense/defense finetune phases with a frozen team-2 opponent, terminal zero-sum rewards, and checkpoint-pool rotation. Team 1 is fixed as the learner throughout.
nhl94_rf.py— New finetune reward modesAdds
SelfPlayOffenseFinetuneandSelfPlayDefenseFinetuneto_reward_function_map. Both use purely terminal rewards (no dense shaping):+1on team-1 goal;-1when puck exits attack zone or team-2 holds possession forSELFPLAY_CONTROL_FRAMES=30consecutive frames;0on timeout.+1when puck clears defense zone or team-1 holds possession 30+ frames;-1on team-2 goal;0on timeout.Per-episode control tracking is stored on
game_stateusing the same_sp_offense_ctx/_sp_defense_ctxtransient-field pattern already used byrf_defensezone.nhl94_obs.py— Side-aware action processing + self-play APIReplaces shared
b_button_pressed / c_button_pressed / slapshot_frames_heldwith a per-side dict:Button debounce and slapshot-hold logic is extracted into
_process_action(ac, side_state)called independently for each side. The deadprev_state.Flip()call (prev_state was never read after assignment) is removed.New self-play API on the wrapper:
set_opponent_model(path)— loads/clears a frozenPPOcheckpoint viaPPO.load.compute_opponent_action(obs)— queries the frozen policy; returns zero-action when no model is loaded.combine_selfplay_actions(learner_ac, opponent_ac)— concatenates both into a two-player action array.selfplay_enabledproperty —Truewhen an opponent model is loaded.When
selfplay_enabled,step()queries the frozen opponent using the current team-1-perspective observation and routes its output through the opponent's independent action state before concatenating with the learner action.train_live.py— Opponent snapshot callback + CLI argsAdds
OpponentSnapshotCallbackwhich snapshots the current learner everyopponent_snapshot_freqsteps, maintains a bounded pool of up toopponent_pool_sizecheckpoints, and pushes a sampled opponent to all worker envs viaenv.env_method("set_opponent_model", path)using a 40/40/20 mixture (latest / random historical / best-known).New CLI args:
--load_opponent_model,--opponent_snapshot_freq(default 50k),--opponent_pool_size(default 5).LiveTrainerseeds the initial frozen opponent across all subproc workers at startup when--selfplayand--load_opponent_modelare both set.build_callbackwrapsLiveTrainingCallbackandOpponentSnapshotCallbackin aCallbackListwhen self-play is active.train_curriculum.py— Opponent seeding for self-play phases"load_opponent_model"added toPATH_KEYSfor relative-path resolution.merge_phase_configautomatically setsload_opponent_model = previous_model_pathfor self-play phases that don't specify an explicit opponent — so the final curriculum checkpoint seeds both learner init and the first frozen opponent pool entry.curriculum/nhl94.json— Self-play phases appendedTwo phases added after the existing
Defense Zonephase:{ "name": "Self-Play Offense Finetune", "rf": "SelfPlayOffenseFinetune", "state": "PenguinsVsSenators.AttackZone", "selfplay": true, "num_players": 2, "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 }, { "name": "Self-Play Defense Finetune", "rf": "SelfPlayDefenseFinetune", "state": "PenguinsVsSenators.DefenseZone", "selfplay": true, "num_players": 2, "opponent_snapshot_freq": 50000, "opponent_pool_size": 5, "num_timesteps": 10000000 }