- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.6k
          Resets recurrent state after episode termination in RSL-RL play.py
          #3838
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
  
    Resets recurrent state after episode termination in RSL-RL play.py
  
  #3838
              Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical bug in the RSL-RL evaluation pipeline where recurrent policy networks (e.g., LSTMs, GRUs) retained hidden states across episode boundaries during evaluation. The evaluation script (play.py) was missing the necessary logic to reset recurrent states when episodes terminated, causing residual memory from previous episodes to contaminate the policy's behavior at the start of new episodes. The fix captures the dones tensor from the environment step and passes it to policy_nn.reset(), which conditionally resets recurrent states for terminated environments. This brings the evaluation loop into parity with the training loop, which already handled this correctly. The changes include appropriate version bumping (patch increment to 0.47.2) and changelog documentation.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added logic to capture donestensor and callpolicy_nn.reset(dones)to properly reset recurrent states between episodes during evaluation | 
| source/isaaclab/config/extension.toml | 5/5 | Incremented patch version from 0.47.1to 0.47.2 to reflect the bug fix | 
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added release notes for version 0.47.2 documenting the recurrent policy state reset fix | 
Confidence score: 4/5
- This PR addresses a clear bug with a focused fix that aligns with established training behavior, making it safe to merge with minimal risk.
- Score is 4/5 because while the fix is correct and necessary, there's a minor concern: the reset()call may not be implemented or tested for all policy architectures, and the change assumespolicy_nnalways has areset()method that accepts adonestensor - this could potentially cause issues with non-recurrent policies or custom architectures if not handled properly.
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.pyto verify that thereset()method is available on all policy types used in evaluation, particularly ensuring non-recurrent policies handle this gracefully.
Sequence Diagram
sequenceDiagram
    participant User
    participant main
    participant AppLauncher
    participant gym
    participant env
    participant policy
    participant policy_nn
    
    User->>main: Execute play.py
    main->>AppLauncher: Launch simulation app
    main->>gym: gym.make(task, cfg, render_mode)
    gym->>env: Create environment
    main->>env: wrap with RslRlVecEnvWrapper
    main->>main: Load runner and checkpoint
    main->>main: Get inference policy
    main->>main: Extract policy neural network
    main->>env: get_observations()
    env->>main: Return obs
    
    loop While simulation running
        main->>policy: policy(obs)
        policy->>main: Return actions
        main->>env: env.step(actions)
        env->>main: Return obs, rewards, dones, info
        Note over main,policy_nn: NEW: Reset recurrent states for terminated episodes
        main->>policy_nn: policy_nn.reset(dones)
        policy_nn->>main: Reset complete
    end
    
    main->>env: env.close()
    main->>User: Simulation complete
    3 files reviewed, no comments
| @ClemensSchwarke can you take a look at this? I think the change looks good to me. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
play.py
      c09f7ee    to
    00ef36e      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical bug in the RSL-RL evaluation script (play.py) where recurrent neural network policies (LSTM/GRU) were not properly resetting their hidden states between episodes. The codebase uses the RSL-RL framework for reinforcement learning, where training and evaluation follow different code paths. During training, episode resets are handled automatically, but the evaluation script was missing this logic. The fix captures the dones signal from env.step() and calls policy_nn.reset(dones) to clear recurrent states when episodes terminate. This ensures each episode starts with a clean memory state, preventing information leakage across episode boundaries. Version metadata was updated from 0.47.2 to 0.47.3 to reflect this patch-level bug fix.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset logic after step, capturing donesand callingpolicy_nn.reset(dones) | 
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 for bug fix release | 
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added changelog entry documenting the recurrent state reset fix for version 0.47.3 | 
Confidence score: 4/5
- This PR is safe to merge with minimal risk, addressing a specific bug in recurrent policy evaluation
- Score reflects a well-targeted fix with appropriate version bumping and documentation, though minor concerns exist about the interaction between policy_nn.reset()and the environment's reset mechanism, and potential edge cases with partial resets in vectorized environments
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure the policy_nn.reset(dones)call is compatible with all policy network types and vectorized environment configurations
Sequence Diagram
sequenceDiagram
    participant User
    participant play.py as play.py Script
    participant AppLauncher
    participant Gym as Gymnasium
    participant Env as Environment
    participant Wrapper as RslRlVecEnvWrapper
    participant Runner as OnPolicyRunner/DistillationRunner
    participant Policy as policy_nn (Actor-Critic)
    participant SimApp as Simulation App
    User->>play.py: Execute play.py with task config
    play.py->>AppLauncher: Initialize AppLauncher(args_cli)
    AppLauncher->>SimApp: Launch Omniverse app
    play.py->>play.py: Parse hydra config for env_cfg & agent_cfg
    play.py->>play.py: Determine checkpoint path (resume_path)
    play.py->>Gym: gym.make(task, cfg=env_cfg, render_mode)
    Gym->>Env: Create environment instance
    play.py->>Wrapper: RslRlVecEnvWrapper(env)
    play.py->>Runner: Create OnPolicyRunner/DistillationRunner
    play.py->>Runner: runner.load(resume_path)
    Runner->>Policy: Load policy checkpoint
    play.py->>Runner: policy = runner.get_inference_policy()
    play.py->>Env: obs = env.get_observations()
    
    loop Simulation Loop (while simulation_app.is_running())
        play.py->>Policy: actions = policy(obs)
        play.py->>Env: obs, _, dones, _ = env.step(actions)
        Note over play.py,Policy: Bug Fix: Reset recurrent states on termination
        play.py->>Policy: policy_nn.reset(dones)
        alt Video Recording Enabled
            play.py->>play.py: Increment timestep counter
            alt timestep == video_length
                play.py->>play.py: Break loop
            end
        end
    end
    
    play.py->>Env: env.close()
    play.py->>SimApp: simulation_app.close()
    3 files reviewed, 2 comments
00ef36e    to
    26a7828      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR addresses a critical bug in the RSL-RL evaluation workflow where recurrent policy networks (LSTM/GRU) were not resetting hidden states after episode terminations during inference. The fix modifies play.py to capture the dones signal from env.step() and explicitly call policy_nn.reset(dones) to clear recurrent memory for terminated episodes. This ensures proper episode boundary handling during policy evaluation, preventing residual memory from previous episodes from affecting subsequent rollouts. The changes are minimal and focused: capturing the previously ignored fourth return value from env.step() and adding a single reset call. The fix aligns with standard RL evaluation practices where recurrent states should be cleared at episode boundaries, similar to how the training workflow in train.py likely handles resets through the OnPolicyRunner wrapper.
PR Description Notes:
- The PR checklist indicates tests were added, but no test files appear in the changed files list
- Documentation checkbox is unchecked despite this being a behavioral fix that may warrant documentation
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset after episode termination by capturing donesand callingpolicy_nn.reset(dones) | 
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning | 
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added changelog entry for version 0.47.3 documenting the recurrent policy reset fix | 
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it fixes a clear bug in recurrent policy evaluation with a straightforward solution
- Score reflects that while the fix is correct, the unconditional call to reset()on every step may cause issues if feedforward policies don't implement a no-op reset method, and the PR claims tests were added but they're not visible in the changed files
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that all policy types in the codebase properly handle the reset()method call
Sequence Diagram
sequenceDiagram
    participant User
    participant PlayScript as play.py
    participant Env as Environment
    participant Policy as Policy Network
    participant Simulation as SimulationApp
    User->>PlayScript: Execute play.py
    PlayScript->>Env: gym.make(task)
    PlayScript->>Env: get_observations()
    Env-->>PlayScript: obs
    
    loop While simulation running
        PlayScript->>Policy: policy(obs)
        Policy-->>PlayScript: actions
        PlayScript->>Env: step(actions)
        Env-->>PlayScript: obs, _, dones, _
        PlayScript->>Policy: reset(dones)
        Note over Policy: Reset recurrent states<br/>for terminated episodes
        
        alt Video recording enabled
            PlayScript->>PlayScript: Check if video complete
            alt Video length reached
                PlayScript->>PlayScript: break
            end
        end
    end
    
    PlayScript->>Env: close()
    PlayScript->>Simulation: close()
    3 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review. The developer has addressed previous feedback by updating the CHANGELOG category from "Changed" to "Fixed" (as appropriate for a bug fix), ensuring proper semantic versioning with the patch version increment to 0.47.3. The core fix remains targeted and minimal: capturing the dones signal from env.step() and passing it to policy_nn.reset(dones) to clear recurrent states selectively for terminated episodes. This ensures recurrent policies (LSTM/GRU) don't carry residual memory between episodes during evaluation in the RSL-RL play.py workflow.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added version 0.47.3 entry documenting the recurrent state reset bug fix under "Fixed" section | 
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Captures donesfrom env.step() and callspolicy_nn.reset(dones)to reset recurrent states after episode termination | 
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning for bug fix | 
Confidence score: 4/5
- This PR is safe to merge with only minor considerations regarding non-recurrent policy performance
- Score reflects that the core bug fix is correct and properly documented, but one previous concern about calling reset unconditionally on feedforward policies remains unaddressed in the visible changes
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure policy_nn.reset()is either a no-op for feedforward policies or has negligible performance impact
3 files reviewed, no comments
…tion in RSL-RL framework
827b813    to
    9b93780      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a bug in the RSL-RL evaluation workflow where recurrent policy hidden states were not being reset between episode terminations. The fix modifies play.py to capture the dones tensor from env.step() and pass it to policy_nn.reset(dones), ensuring that recurrent networks (LSTMs, GRUs) start each new episode with clean hidden states rather than carrying over residual memory from previous episodes. This aligns the evaluation behavior with the standard training loop pattern where policies are reset on episode boundaries. The change also includes a changelog entry documenting this fix as version 0.47.4.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added capture of donesfromenv.step()and call topolicy_nn.reset(dones)to reset recurrent states on episode termination | 
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added version0.47.4 changelog entry documenting the recurrent policy reset bug fix | 
Confidence score: 4/5
- This PR addresses a legitimate bug with a targeted fix that follows standard RL patterns for handling recurrent policies
- Score reflects that while the fix is correct, calling reset unconditionally on every step for all policy types may have minor performance implications, and the changelog entry uses "Fixed" correctly but has a minor date inconsistency (shows October instead of current date)
- The main file requiring attention is play.pyto verify that calling reset on every step is acceptable for all policy architectures in the codebase
Sequence Diagram
sequenceDiagram
    participant User
    participant play.py
    participant gym.make
    participant RslRlVecEnvWrapper
    participant OnPolicyRunner/DistillationRunner
    participant policy
    participant policy_nn
    participant env
    User->>play.py: Run play.py script
    play.py->>gym.make: Create environment
    gym.make-->>play.py: Return env
    play.py->>RslRlVecEnvWrapper: Wrap environment
    RslRlVecEnvWrapper-->>play.py: Return wrapped env
    play.py->>OnPolicyRunner/DistillationRunner: Initialize runner
    play.py->>OnPolicyRunner/DistillationRunner: load(resume_path)
    play.py->>OnPolicyRunner/DistillationRunner: get_inference_policy()
    OnPolicyRunner/DistillationRunner-->>play.py: Return policy
    play.py->>env: get_observations()
    env-->>play.py: Return obs
    
    loop While simulation running
        play.py->>policy: policy(obs)
        policy-->>play.py: Return actions
        play.py->>env: step(actions)
        env-->>play.py: Return obs, rewards, dones, info
        Note over play.py,policy_nn: Bug Fix: Reset recurrent states on termination
        play.py->>policy_nn: reset(dones)
        Note over policy_nn: Resets hidden states for<br/>terminated episodes
    end
    
    play.py->>env: close()
    2 files reviewed, 1 comment
| obs, _, _, _ = env.step(actions) | ||
| obs, _, dones, _ = env.step(actions) | ||
| # reset recurrent states for episodes that have terminated | ||
| policy_nn.reset(dones) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: calling reset() on every step (even when dones is all False) may have overhead - consider only calling when dones.any() is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greptile-apps Does that make much difference? dones.any() adds overhead too, and for MLPs reset() is a no-op, so it might actually be slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point. Let me check what reset() actually does in the policy network.You're absolutely right. For MLPs, reset() is typically a no-op (just returns immediately), so calling dones.any() adds an extra operation that might not be worth it. The overhead of checking dones.any() could actually be comparable to or greater than just calling reset() unconditionally, especially since:
- dones.any()requires a reduction operation across all environments
- For non-recurrent policies, reset()is essentially free (just returns)
- Even for recurrent policies, reset()with an all-False mask should be fast
The current implementation is actually cleaner and the performance impact is likely negligible either way. I'll withdraw my comment - unconditionally calling reset() is the right approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest changes address prior review feedback by correcting the changelog category from "Changed" to "Fixed" to properly reflect that this is a bug fix rather than a feature change. The core fix remains unchanged: capturing the dones tensor from env.step() and calling policy_nn.reset(dones) after each step to clear recurrent hidden states when episodes terminate. This ensures recurrent policies (LSTM, GRU) don't carry over memory between episodes during evaluation in the RSL-RL play.py workflow.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Updated changelog category from "Changed" to "Fixed" for the recurrent policy reset bug fix entry | 
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset after env.step() by capturing dones tensor and calling policy_nn.reset(dones) | 
Confidence score: 4/5
- This PR is safe to merge with low risk as it addresses a critical bug in recurrent policy evaluation
- Score reflects that the fix correctly resolves the memory persistence issue and changelog now properly categorizes this as a bug fix, though minor performance optimization opportunities remain from previous reviews (calling reset unconditionally on every step even when no episodes terminate, and potential impacts on feedforward policies)
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify the reset mechanism works correctly across both recurrent and feedforward policy types in production scenarios
2 files reviewed, no comments
Description
This PR fixes an issue in recurrent policy evaluation where the recurrent state was not being reset after an episode termination. The missing reset caused residual memory to persist between episodes. The fix ensures that
reset()is now called during evaluation inplay.pyfor policy networks, including recurrent.Fixes #3837
Type of change
Screenshots
N/A
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there