Add NHL94 self-play: asymmetric drill modes, opponent snapshot pool, curriculum integration#11
Draft
Copilot wants to merge 2 commits into
Draft
Add NHL94 self-play: asymmetric drill modes, opponent snapshot pool, curriculum integration#11Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
… pool rotation, curriculum integration) Agent-Logs-Url: https://github.com/MatPoliquin/stable-retro-scripts/sessions/81585af8-7e24-4899-9d92-ce66beb56d32 Co-authored-by: MatPoliquin <7024551+MatPoliquin@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add self-play implementation for NHL94 training
Add NHL94 self-play: asymmetric drill modes, opponent snapshot pool, curriculum integration
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements Milestones 1–4 of the NHL94 self-play plan: side-aware action processing, frozen opponent execution inside the wrapper, terminal zero-sum drill rewards, and checkpoint-pool rotation — all without touching PPO's single-agent interface.
nhl94_rf.py— Dedicated finetune reward modesSelfPlayOffenseFinetune: attack-zone init; terminal+1on team-1 score,-1if team-2 secures ≥60 consecutive frames of possession or clears the puck,0on timeoutSelfPlayDefenseFinetune: defense-zone init; terminal+1if team-1 clears past centre ice or holds possession ≥60 frames,-1on team-2 goal,0on timeout_selfplay_offense_ctrl/_selfplay_defense_ctrl) follows the existing_defensezone_carrypattern; both modes registered in_reward_function_mapnhl94_obs.py— Side-aware action state + self-play interfaceReplaces the three shared button-debounce fields with a per-side dict so learner and opponent never corrupt each other's state:
_process_filtered_action(ac, side)/_process_multidiscrete_action(ac, side)extracted as helpersprev_state.Flip()(was unreachable in the reward/obs path)set_opponent_model(path),set_selfplay_role(role),compute_opponent_action(obs),combine_selfplay_actions(learner, opponent)step()injects the frozen opponent's processed actions whenselfplay_enabled=True; team 1 is fixed as the learner throughout Milestone 1train_live.py— Opponent snapshot rotation--load_opponent_model,--selfplay_snapshot_freq,--selfplay_pool_sizeOpponentSnapshotCallbacksaves a snapshot every N steps, maintains a bounded pool, and samples the next opponent with a 40 % latest / 40 % random historical / 20 % oldest distribution (degrades gracefully for small pool sizes)LiveTrainerseeds the initial opponent at startup viaenv_method("set_opponent_model", ...)and attaches the callback when--selfplayis activetrain_curriculum.pyload_opponent_modeltoPATH_KEYSso relative paths in the curriculum JSON are resolved correctlycurriculum/nhl94.jsonselfplay: trueand snapshot/pool config