Skip to content

Conversation

@bikcrum
Copy link
Contributor

@bikcrum bikcrum commented Oct 26, 2025

Description

This PR fixes an issue in recurrent policy evaluation where the recurrent state was not being reset after an episode termination. The missing reset caused residual memory to persist between episodes. The fix ensures that reset() is now called during evaluation in play.py for policy networks, including recurrent.

Fixes #3837

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

N/A

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation where necessary
  • My changes generate no new warnings
  • I have added tests verifying that recurrent states are correctly reset during evaluation
  • I have updated the changelog and corresponding version in the extension’s config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@bikcrum bikcrum requested a review from ooctipus as a code owner October 26, 2025 22:33
@github-actions github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Oct 26, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical bug in the RSL-RL evaluation pipeline where recurrent policy networks (e.g., LSTMs, GRUs) retained hidden states across episode boundaries during evaluation. The evaluation script (play.py) was missing the necessary logic to reset recurrent states when episodes terminated, causing residual memory from previous episodes to contaminate the policy's behavior at the start of new episodes. The fix captures the dones tensor from the environment step and passes it to policy_nn.reset(), which conditionally resets recurrent states for terminated environments. This brings the evaluation loop into parity with the training loop, which already handled this correctly. The changes include appropriate version bumping (patch increment to 0.47.2) and changelog documentation.

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Added logic to capture dones tensor and call policy_nn.reset(dones) to properly reset recurrent states between episodes during evaluation
source/isaaclab/config/extension.toml 5/5 Incremented patch version from 0.47.1to 0.47.2 to reflect the bug fix
source/isaaclab/docs/CHANGELOG.rst 5/5 Added release notes for version 0.47.2 documenting the recurrent policy state reset fix

Confidence score: 4/5

  • This PR addresses a clear bug with a focused fix that aligns with established training behavior, making it safe to merge with minimal risk.
  • Score is 4/5 because while the fix is correct and necessary, there's a minor concern: the reset() call may not be implemented or tested for all policy architectures, and the change assumes policy_nn always has a reset() method that accepts a dones tensor - this could potentially cause issues with non-recurrent policies or custom architectures if not handled properly.
  • Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that the reset() method is available on all policy types used in evaluation, particularly ensuring non-recurrent policies handle this gracefully.

Sequence Diagram

sequenceDiagram
    participant User
    participant main
    participant AppLauncher
    participant gym
    participant env
    participant policy
    participant policy_nn
    
    User->>main: Execute play.py
    main->>AppLauncher: Launch simulation app
    main->>gym: gym.make(task, cfg, render_mode)
    gym->>env: Create environment
    main->>env: wrap with RslRlVecEnvWrapper
    main->>main: Load runner and checkpoint
    main->>main: Get inference policy
    main->>main: Extract policy neural network
    main->>env: get_observations()
    env->>main: Return obs
    
    loop While simulation running
        main->>policy: policy(obs)
        policy->>main: Return actions
        main->>env: env.step(actions)
        env->>main: Return obs, rewards, dones, info
        Note over main,policy_nn: NEW: Reset recurrent states for terminated episodes
        main->>policy_nn: policy_nn.reset(dones)
        policy_nn->>main: Reset complete
    end
    
    main->>env: env.close()
    main->>User: Simulation complete
Loading

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ooctipus
Copy link
Collaborator

@ClemensSchwarke can you take a look at this? I think the change looks good to me.

Copy link
Contributor

@Mayankm96 Mayankm96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Mayankm96 Mayankm96 changed the title Bugfix: Reset recurrent state after episode termination during evaluation in RSL-RL framework. Resets recurrent state after episode termination in RSL-RL play.py Oct 27, 2025
@Mayankm96 Mayankm96 moved this to In review in Isaac Lab Oct 27, 2025
@bikcrum bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from c09f7ee to 00ef36e Compare October 27, 2025 21:16
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical bug in the RSL-RL evaluation script (play.py) where recurrent neural network policies (LSTM/GRU) were not properly resetting their hidden states between episodes. The codebase uses the RSL-RL framework for reinforcement learning, where training and evaluation follow different code paths. During training, episode resets are handled automatically, but the evaluation script was missing this logic. The fix captures the dones signal from env.step() and calls policy_nn.reset(dones) to clear recurrent states when episodes terminate. This ensures each episode starts with a clean memory state, preventing information leakage across episode boundaries. Version metadata was updated from 0.47.2 to 0.47.3 to reflect this patch-level bug fix.

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Added recurrent state reset logic after step, capturing dones and calling policy_nn.reset(dones)
source/isaaclab/config/extension.toml 5/5 Bumped patch version from 0.47.2 to 0.47.3 for bug fix release
source/isaaclab/docs/CHANGELOG.rst 5/5 Added changelog entry documenting the recurrent state reset fix for version 0.47.3

Confidence score: 4/5

  • This PR is safe to merge with minimal risk, addressing a specific bug in recurrent policy evaluation
  • Score reflects a well-targeted fix with appropriate version bumping and documentation, though minor concerns exist about the interaction between policy_nn.reset() and the environment's reset mechanism, and potential edge cases with partial resets in vectorized environments
  • Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure the policy_nn.reset(dones) call is compatible with all policy network types and vectorized environment configurations

Sequence Diagram

sequenceDiagram
    participant User
    participant play.py as play.py Script
    participant AppLauncher
    participant Gym as Gymnasium
    participant Env as Environment
    participant Wrapper as RslRlVecEnvWrapper
    participant Runner as OnPolicyRunner/DistillationRunner
    participant Policy as policy_nn (Actor-Critic)
    participant SimApp as Simulation App

    User->>play.py: Execute play.py with task config
    play.py->>AppLauncher: Initialize AppLauncher(args_cli)
    AppLauncher->>SimApp: Launch Omniverse app
    play.py->>play.py: Parse hydra config for env_cfg & agent_cfg
    play.py->>play.py: Determine checkpoint path (resume_path)
    play.py->>Gym: gym.make(task, cfg=env_cfg, render_mode)
    Gym->>Env: Create environment instance
    play.py->>Wrapper: RslRlVecEnvWrapper(env)
    play.py->>Runner: Create OnPolicyRunner/DistillationRunner
    play.py->>Runner: runner.load(resume_path)
    Runner->>Policy: Load policy checkpoint
    play.py->>Runner: policy = runner.get_inference_policy()
    play.py->>Env: obs = env.get_observations()
    
    loop Simulation Loop (while simulation_app.is_running())
        play.py->>Policy: actions = policy(obs)
        play.py->>Env: obs, _, dones, _ = env.step(actions)
        Note over play.py,Policy: Bug Fix: Reset recurrent states on termination
        play.py->>Policy: policy_nn.reset(dones)
        alt Video Recording Enabled
            play.py->>play.py: Increment timestep counter
            alt timestep == video_length
                play.py->>play.py: Break loop
            end
        end
    end
    
    play.py->>Env: env.close()
    play.py->>SimApp: simulation_app.close()
Loading

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@bikcrum bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from 00ef36e to 26a7828 Compare October 27, 2025 21:19
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR addresses a critical bug in the RSL-RL evaluation workflow where recurrent policy networks (LSTM/GRU) were not resetting hidden states after episode terminations during inference. The fix modifies play.py to capture the dones signal from env.step() and explicitly call policy_nn.reset(dones) to clear recurrent memory for terminated episodes. This ensures proper episode boundary handling during policy evaluation, preventing residual memory from previous episodes from affecting subsequent rollouts. The changes are minimal and focused: capturing the previously ignored fourth return value from env.step() and adding a single reset call. The fix aligns with standard RL evaluation practices where recurrent states should be cleared at episode boundaries, similar to how the training workflow in train.py likely handles resets through the OnPolicyRunner wrapper.

PR Description Notes:

  • The PR checklist indicates tests were added, but no test files appear in the changed files list
  • Documentation checkbox is unchecked despite this being a behavioral fix that may warrant documentation

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Added recurrent state reset after episode termination by capturing dones and calling policy_nn.reset(dones)
source/isaaclab/config/extension.toml 5/5 Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning
source/isaaclab/docs/CHANGELOG.rst 5/5 Added changelog entry for version 0.47.3 documenting the recurrent policy reset fix

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it fixes a clear bug in recurrent policy evaluation with a straightforward solution
  • Score reflects that while the fix is correct, the unconditional call to reset() on every step may cause issues if feedforward policies don't implement a no-op reset method, and the PR claims tests were added but they're not visible in the changed files
  • Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that all policy types in the codebase properly handle the reset() method call

Sequence Diagram

sequenceDiagram
    participant User
    participant PlayScript as play.py
    participant Env as Environment
    participant Policy as Policy Network
    participant Simulation as SimulationApp

    User->>PlayScript: Execute play.py
    PlayScript->>Env: gym.make(task)
    PlayScript->>Env: get_observations()
    Env-->>PlayScript: obs
    
    loop While simulation running
        PlayScript->>Policy: policy(obs)
        Policy-->>PlayScript: actions
        PlayScript->>Env: step(actions)
        Env-->>PlayScript: obs, _, dones, _
        PlayScript->>Policy: reset(dones)
        Note over Policy: Reset recurrent states<br/>for terminated episodes
        
        alt Video recording enabled
            PlayScript->>PlayScript: Check if video complete
            alt Video length reached
                PlayScript->>PlayScript: break
            end
        end
    end
    
    PlayScript->>Env: close()
    PlayScript->>Simulation: close()
Loading

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@ClemensSchwarke ClemensSchwarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review. The developer has addressed previous feedback by updating the CHANGELOG category from "Changed" to "Fixed" (as appropriate for a bug fix), ensuring proper semantic versioning with the patch version increment to 0.47.3. The core fix remains targeted and minimal: capturing the dones signal from env.step() and passing it to policy_nn.reset(dones) to clear recurrent states selectively for terminated episodes. This ensures recurrent policies (LSTM/GRU) don't carry residual memory between episodes during evaluation in the RSL-RL play.py workflow.

Important Files Changed

Filename Score Overview
source/isaaclab/docs/CHANGELOG.rst 5/5 Added version 0.47.3 entry documenting the recurrent state reset bug fix under "Fixed" section
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Captures dones from env.step() and calls policy_nn.reset(dones) to reset recurrent states after episode termination
source/isaaclab/config/extension.toml 5/5 Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning for bug fix

Confidence score: 4/5

  • This PR is safe to merge with only minor considerations regarding non-recurrent policy performance
  • Score reflects that the core bug fix is correct and properly documented, but one previous concern about calling reset unconditionally on feedforward policies remains unaddressed in the visible changes
  • Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure policy_nn.reset() is either a no-op for feedforward policies or has negligible performance impact

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@bikcrum bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from 827b813 to 9b93780 Compare October 28, 2025 22:06
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a bug in the RSL-RL evaluation workflow where recurrent policy hidden states were not being reset between episode terminations. The fix modifies play.py to capture the dones tensor from env.step() and pass it to policy_nn.reset(dones), ensuring that recurrent networks (LSTMs, GRUs) start each new episode with clean hidden states rather than carrying over residual memory from previous episodes. This aligns the evaluation behavior with the standard training loop pattern where policies are reset on episode boundaries. The change also includes a changelog entry documenting this fix as version 0.47.4.

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Added capture of dones from env.step() and call to policy_nn.reset(dones) to reset recurrent states on episode termination
source/isaaclab/docs/CHANGELOG.rst 5/5 Added version0.47.4 changelog entry documenting the recurrent policy reset bug fix

Confidence score: 4/5

  • This PR addresses a legitimate bug with a targeted fix that follows standard RL patterns for handling recurrent policies
  • Score reflects that while the fix is correct, calling reset unconditionally on every step for all policy types may have minor performance implications, and the changelog entry uses "Fixed" correctly but has a minor date inconsistency (shows October instead of current date)
  • The main file requiring attention is play.py to verify that calling reset on every step is acceptable for all policy architectures in the codebase

Sequence Diagram

sequenceDiagram
    participant User
    participant play.py
    participant gym.make
    participant RslRlVecEnvWrapper
    participant OnPolicyRunner/DistillationRunner
    participant policy
    participant policy_nn
    participant env

    User->>play.py: Run play.py script
    play.py->>gym.make: Create environment
    gym.make-->>play.py: Return env
    play.py->>RslRlVecEnvWrapper: Wrap environment
    RslRlVecEnvWrapper-->>play.py: Return wrapped env
    play.py->>OnPolicyRunner/DistillationRunner: Initialize runner
    play.py->>OnPolicyRunner/DistillationRunner: load(resume_path)
    play.py->>OnPolicyRunner/DistillationRunner: get_inference_policy()
    OnPolicyRunner/DistillationRunner-->>play.py: Return policy
    play.py->>env: get_observations()
    env-->>play.py: Return obs
    
    loop While simulation running
        play.py->>policy: policy(obs)
        policy-->>play.py: Return actions
        play.py->>env: step(actions)
        env-->>play.py: Return obs, rewards, dones, info
        Note over play.py,policy_nn: Bug Fix: Reset recurrent states on termination
        play.py->>policy_nn: reset(dones)
        Note over policy_nn: Resets hidden states for<br/>terminated episodes
    end
    
    play.py->>env: close()
Loading

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

obs, _, _, _ = env.step(actions)
obs, _, dones, _ = env.step(actions)
# reset recurrent states for episodes that have terminated
policy_nn.reset(dones)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: calling reset() on every step (even when dones is all False) may have overhead - consider only calling when dones.any() is True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile-apps Does that make much difference? dones.any() adds overhead too, and for MLPs reset() is a no-op, so it might actually be slower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point. Let me check what reset() actually does in the policy network.You're absolutely right. For MLPs, reset() is typically a no-op (just returns immediately), so calling dones.any() adds an extra operation that might not be worth it. The overhead of checking dones.any() could actually be comparable to or greater than just calling reset() unconditionally, especially since:

  1. dones.any() requires a reduction operation across all environments
  2. For non-recurrent policies, reset() is essentially free (just returns)
  3. Even for recurrent policies, reset() with an all-False mask should be fast

The current implementation is actually cleaner and the performance impact is likely negligible either way. I'll withdraw my comment - unconditionally calling reset() is the right approach here.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes address prior review feedback by correcting the changelog category from "Changed" to "Fixed" to properly reflect that this is a bug fix rather than a feature change. The core fix remains unchanged: capturing the dones tensor from env.step() and calling policy_nn.reset(dones) after each step to clear recurrent hidden states when episodes terminate. This ensures recurrent policies (LSTM, GRU) don't carry over memory between episodes during evaluation in the RSL-RL play.py workflow.

Important Files Changed

Filename Score Overview
source/isaaclab/docs/CHANGELOG.rst 5/5 Updated changelog category from "Changed" to "Fixed" for the recurrent policy reset bug fix entry
scripts/reinforcement_learning/rsl_rl/play.py 4/5 Added recurrent state reset after env.step() by capturing dones tensor and calling policy_nn.reset(dones)

Confidence score: 4/5

  • This PR is safe to merge with low risk as it addresses a critical bug in recurrent policy evaluation
  • Score reflects that the fix correctly resolves the memory persistence issue and changelog now properly categorizes this as a bug fix, though minor performance optimization opportunities remain from previous reviews (calling reset unconditionally on every step even when no episodes terminate, and potential impacts on feedforward policies)
  • Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify the reset mechanism works correctly across both recurrent and feedforward policy types in production scenarios

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

[Bug Report] Recurrent Policy Memory Not Reset During Evaluation

4 participants