Skip to content

Conversation

@ozhanozen
Copy link
Contributor

Description

This PR introduces support for early stopping in Ray integration through the Stopper class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.

Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.

The PR also includes a CartpoleEarlyStopper example in vision_cartpole_cfg.py. This serves as a reference implementation that halts a trial if the out_of_bounds metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.

Fixes #3270.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen
Copy link
Contributor Author

Hi @garylvov, here is the PR as agreed.

I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within process_response_timeout, hence halting all the ray main process.

One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

@ozhanozen
Copy link
Contributor Author

@garylvov, a small update:

I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

@garylvov
Copy link
Collaborator

garylvov commented Aug 28, 2025

Hi thanks for your further investigation.

Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;)

I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by execute_job

@ozhanozen
Copy link
Contributor Author

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I agree. Nevertheless, I still wanted to try to kill it the following way:

self.proc.terminate()
try:
    self.proc.wait(timeout=20)
except subprocess.TimeoutExpired:
   self.proc.kill()
   self.proc.wait()

but it did not work. Did you manage to get a lead on what might work in a robust way?

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

Hi @ozhanozen I have yet another workaround that may just work ;)

I think we could add

parser.add_argument("--ray-proc-id", "-rid", 
                    type=int,default=None, help="Automatically configured by Ray integration, otherwise None.") 

to the training scripts (RSL RL, RL Games, SKRL, etc)

Then, for each job submitted in util, we can assign an extra PID. I haven't implemented it, but I've tested it manually with the following:

./IsaacLab/isaaclab.sh -p IsaacLab/scripts/reinforcement_learning/rl_games/train.py --task Isaac-Cartpole-v0 --headless -rid 31415

Then, the following has proven to be effective at shutting down all processes.

pkill -9 -f "rid 31415" && sleep 5

I think this will be effective, as for example, having a mismatched id like below doesn't kill the process

pkill -9 -f "rid 31414"  # does nothing, proves that kill commands wouldn't affect unrelated running training

Now we just need to add some sort of exit hook to the trials to kill the assigned ray proc id

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

We could maybe use https://docs.ray.io/en/latest/tune/api/callbacks.html to have each trial clean itself up

@garylvov
Copy link
Collaborator

garylvov commented Sep 9, 2025

Actually, now that I've played with this more, I think it's important that we SIGKILL instead of SIGINT

Maybe

os.kill(self.proc.pid, signal.SIGKILL)

would work without the extra rid stuff

@garylvov
Copy link
Collaborator

garylvov commented Sep 10, 2025

I looked into this even more, and I believe the RID is helpful as there are so many related processes for a single isaac training session:

image'

I believe I have a minimal version that kind of works pushing progress now

@garylvov
Copy link
Collaborator

garylvov commented Sep 10, 2025

I think it is better at killing old jobs, but it keeps on running into log extraction errors after terminating processes. Please let me know if you can figure out why!

@github-actions github-actions bot added the isaac-lab Related to Isaac Lab team label Sep 16, 2025
@ozhanozen
Copy link
Contributor Author

Hi @garylvov, I had just the opportunity to check this; sorry for the delay.

I had encountered two different issues (one regarding rl_games at #3531, one about ray initialization at #3533), which might be resulting in the log extraction errors you faced. If I combine the fixes on both these PRs with the current PR, I am able to successfully do early stopping without encountering any further errors. Could you also try again with these?

Note: You had also used self.proc.terminate(9), which I believe is an invalid operation, so I reverted this back. Nevertheless, the self.proc.kill() (that follows self.proc.terminate()) should already send the same kill signal as you wanted.

@garylvov
Copy link
Collaborator

Awesome, thank you so much for figuring it out!

Sounds good about the terminate issue, thank you for spotting that.

I'm excited to try this soon!

@garylvov garylvov marked this pull request as ready for review September 23, 2025 18:20
@garylvov garylvov requested a review from ooctipus as a code owner September 23, 2025 18:20
Copy link
Collaborator

@garylvov garylvov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kellyguo11,

This LGTM, just #3531 and #3533 need to be merged first

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds optional early stopping support to IsaacLab's Ray hyperparameter tuning integration. The implementation allows users to define custom Stopper classes (following Ray's tune.Stopper interface) to terminate underperforming trials early, reducing wasted compute time. The core changes include: adding a --ray-proc-id argument to all training scripts (rl_games, rsl_rl, sb3, skrl) for process identification; creating a ProcessCleanupCallback in tuner.py that forcibly terminates trial processes using SIGKILL; combining custom stoppers with the existing LogExtractionErrorStopper via CombinedStopper; and providing a reference implementation (CartpoleEarlyStopper) that stops trials when cart out-of-bounds exceeds 85% after 20 iterations. The feature integrates with IsaacLab's existing Ray workflow infrastructure (tuner.py, util.py) while maintaining backward compatibility—when no stopper is specified, behavior remains unchanged.

Important Files Changed

Filename Score Overview
scripts/reinforcement_learning/ray/tuner.py 3/5 Implements core early stopping logic with ProcessCleanupCallback using SIGKILL, combines custom stoppers with error stopper, adds RID-based process tracking, and increases MAX_LOG_EXTRACTION_ERRORS from 2 to 10
scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py 4/5 Adds CartpoleEarlyStopper reference implementation that stops trials when out_of_bounds metric exceeds 0.85 after 20 iterations
scripts/reinforcement_learning/rl_games/train.py 5/5 Adds --ray-proc-id argument to enable Ray process identification for early stopping
scripts/reinforcement_learning/rsl_rl/train.py 4/5 Adds --ray-proc-id argument but doesn't use it within the script itself
scripts/reinforcement_learning/sb3/train.py 4/5 Adds --ray-proc-id argument but doesn't use it within the script itself
scripts/reinforcement_learning/skrl/train.py 5/5 Adds --ray-proc-id argument to enable Ray process identification for early stopping
scripts/reinforcement_learning/ray/util.py 5/5 Extends argument processing to support both single-dash and double-dash argument prefixes

Confidence score: 3/5

  • This PR introduces complex process management that requires careful testing, particularly around the aggressive SIGKILL cleanup mechanism and the interaction between multiple trials running concurrently.
  • Score reduced due to: (1) aggressive process cleanup using pkill -9 which may leave GPU resources or simulation state in inconsistent states; (2) reliance on string matching for process identification (matching "rid {RID}" in command line) which could be fragile if command formatting changes; (3) the --ray-proc-id argument is added to training scripts but never actually used within them, requiring verification that the RID is properly passed through to subprocesses for pkill to match; (4) increased MAX_LOG_EXTRACTION_ERRORS from 2 to 10 suggests potential cleanup side effects that warrant monitoring; (5) no tests added despite introducing critical process lifecycle management logic.
  • Pay close attention to scripts/reinforcement_learning/ray/tuner.py, particularly the ProcessCleanupCallback implementation and the RID-based process matching logic, as these are most likely to cause issues with orphaned processes or premature trial termination.

Sequence Diagram

sequenceDiagram
    participant User
    participant tuner.py
    participant CartpoleEarlyStopper
    participant CombinedStopper
    participant IsaacLabTuneTrainable
    participant util.execute_job
    participant train.py
    participant Tensorboard

    User->>tuner.py: Run tuner with --stopper CartpoleEarlyStopper
    tuner.py->>CartpoleEarlyStopper: Instantiate stopper
    tuner.py->>CombinedStopper: Create with LogExtractionErrorStopper + CartpoleEarlyStopper
    tuner.py->>IsaacLabTuneTrainable: Create trainable instances
    
    loop For each trial
        IsaacLabTuneTrainable->>util.execute_job: Start training process
        util.execute_job->>train.py: Launch training workflow
        train.py->>Tensorboard: Write metrics to logs
        
        loop Training steps
            IsaacLabTuneTrainable->>Tensorboard: Load tensorboard logs
            Tensorboard-->>IsaacLabTuneTrainable: Return metrics dict
            IsaacLabTuneTrainable->>tuner.py: Report result with metrics
            tuner.py->>CombinedStopper: Check stop conditions
            CombinedStopper->>CartpoleEarlyStopper: __call__(trial_id, result)
            CartpoleEarlyStopper->>CartpoleEarlyStopper: Check iter >= 20 and out_of_bounds > 0.85
            alt Should stop trial
                CartpoleEarlyStopper->>CartpoleEarlyStopper: Add trial_id to _bad_trials
                CartpoleEarlyStopper-->>CombinedStopper: Return True
                CombinedStopper-->>tuner.py: Stop trial
                tuner.py->>ProcessCleanupCallback: Cleanup trial process
                ProcessCleanupCallback->>util.execute_job: pkill training process
            else Continue trial
                CartpoleEarlyStopper-->>CombinedStopper: Return False
                CombinedStopper-->>tuner.py: Continue trial
            end
        end
    end
    
    tuner.py->>User: Training complete
Loading

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces optional early stopping for Ray hyperparameter tuning through custom Stopper classes, allowing trials to terminate early when they're unlikely to yield useful results.

Key changes:

  • Added --stopper CLI argument to tuner.py for loading custom stopper classes
  • Integrated stoppers with CombinedStopper alongside existing LogExtractionErrorStopper
  • Implemented ProcessCleanupCallback to kill trial processes using pkill with RID-based pattern matching
  • Added -rid (ray-proc-id) argument to all training scripts for process identification
  • Updated util.py to support single-dash argument format (-rid)
  • Included CartpoleEarlyStopper example that stops trials with high out_of_bounds rates
  • Increased MAX_LOG_EXTRACTION_ERRORS from 2 to 10

Critical issue found:

  • The pkill pattern in ProcessCleanupCallback uses "rid {value}" but the actual command line uses "-rid {value}", causing process cleanup to fail

Confidence Score: 2/5

  • This PR has a critical bug in process cleanup that will prevent early stopping from working correctly.
  • The pkill pattern bug in ProcessCleanupCallback._cleanup_trial (line 223) will prevent proper process termination when trials are stopped early. The pattern searches for "rid {value}" but processes will have "-rid {value}" in their command line. This means stopped trials won't be cleaned up properly, potentially leaving zombie processes and wasting resources - the opposite of what early stopping aims to achieve. Once this critical bug is fixed, the implementation is otherwise sound.
  • Pay close attention to scripts/reinforcement_learning/ray/tuner.py - the process cleanup logic must be fixed before merge.

Important Files Changed

File Analysis

Filename Score Overview
scripts/reinforcement_learning/ray/tuner.py 3/5 Added early stopping support with custom stopper integration, process cleanup callback, and RID-based process tracking. Contains critical pkill pattern bug.
scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py 4/5 Added example CartpoleEarlyStopper demonstrating early stopping based on out_of_bounds metric. Hardcoded thresholds should be configurable.
scripts/reinforcement_learning/ray/util.py 5/5 Updated argument parsing to support -rid flag in addition to -- prefixed arguments.

Sequence Diagram

sequenceDiagram
    participant User
    participant Tuner as tuner.py
    participant Ray as Ray Tune
    participant Stopper as Custom Stopper
    participant Trainable as IsaacLabTuneTrainable
    participant Training as train.py
    participant Callback as ProcessCleanupCallback

    User->>Tuner: Start tuning with --stopper flag
    Tuner->>Tuner: Load custom stopper class
    Tuner->>Tuner: Instantiate stopper
    Tuner->>Tuner: Create CombinedStopper with LogExtractionErrorStopper
    Tuner->>Tuner: Generate random RID for trial
    Tuner->>Ray: Configure Tuner with stopper & callback
    
    Ray->>Trainable: Start trial with config (includes RID)
    Trainable->>Training: Execute train.py with -rid parameter
    Training-->>Trainable: Start training process
    
    loop Every step
        Trainable->>Training: Read tensorboard logs
        Training-->>Trainable: Return metrics
        Trainable->>Ray: Report metrics
        Ray->>Stopper: Check stopping criteria
        Stopper-->>Ray: Return stop decision
        
        alt Should stop trial
            Ray->>Trainable: Stop trial
            Ray->>Callback: on_trial_complete/on_trial_error
            Callback->>Callback: pkill -9 -f "rid {RID}"
            Callback-->>Ray: Cleanup complete
        end
    end
    
    Ray-->>User: Return tuning results
Loading

7 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces early stopping support for Ray-based hyperparameter tuning, allowing trials to terminate early when they show poor performance. The implementation includes a ProcessCleanupCallback that forcefully terminates trial processes using pkill, a CombinedStopper mechanism that merges the existing LogExtractionErrorStopper with optional custom stoppers, and an example CartpoleEarlyStopper that demonstrates stopping trials based on performance metrics.

Key changes:

  • Added --stopper CLI argument to optionally load custom stopper classes from config files
  • Implemented ProcessCleanupCallback to clean up processes when trials complete or error out
  • Integrated stoppers using Ray's CombinedStopper to combine multiple stop conditions
  • Generated unique random RID (ray process ID) for each trial to enable process-specific cleanup
  • Updated util.py to support both --arg and -arg command-line argument formats
  • Added -rid argument to all training scripts (rl_games, rsl_rl, sb3, skrl) for process identification
  • Increased MAX_LOG_EXTRACTION_ERRORS from 2 to 10 to be more tolerant of transient log extraction issues

The implementation addresses issue #3270 by enabling efficient hyperparameter tuning where poorly-performing trials can be terminated early rather than running to completion.

Confidence Score: 4/5

  • This PR is safe to merge with minor risks related to process cleanup reliability
  • The implementation is mostly sound with a well-designed architecture for early stopping. However, there's one confirmed bug in the process cleanup pattern that prevents processes from being killed properly (pkill pattern mismatch), and some style improvements could be made. The core functionality of the stopper integration is correct, and the changes to training scripts are minimal and safe.
  • scripts/reinforcement_learning/ray/tuner.py requires attention due to the pkill pattern bug that prevents proper process cleanup

Important Files Changed

File Analysis

Filename Score Overview
scripts/reinforcement_learning/ray/tuner.py 4/5 Added early stopping support with stopper integration, process cleanup callback, and random RID generation for trial cleanup
scripts/reinforcement_learning/ray/util.py 5/5 Fixed command argument parsing to support both --arg and -arg formats for RID parameter
scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py 4/5 Added CartpoleEarlyStopper example that stops trials when out_of_bounds metric exceeds threshold

Sequence Diagram

sequenceDiagram
    participant User
    participant Tuner
    participant Ray
    participant IsaacLabTuneTrainable
    participant TrainingProcess
    participant Stopper
    participant ProcessCleanupCallback

    User->>Tuner: Start tuning with --stopper arg
    Tuner->>Tuner: Load stopper class from config
    Tuner->>Tuner: Generate random RID for each trial
    Tuner->>Ray: Initialize with CombinedStopper
    Note over Ray: CombinedStopper includes LogExtractionErrorStopper + custom stopper
    
    loop For each trial
        Ray->>IsaacLabTuneTrainable: Start trial with config (includes RID)
        IsaacLabTuneTrainable->>TrainingProcess: Execute training command with -rid arg
        TrainingProcess-->>IsaacLabTuneTrainable: Stream metrics via tensorboard logs
        
        loop During training
            IsaacLabTuneTrainable->>IsaacLabTuneTrainable: Load tensorboard metrics
            IsaacLabTuneTrainable->>Ray: Report metrics
            Ray->>Stopper: Check if trial should stop
            Stopper-->>Ray: Return stop decision
            
            alt Trial should stop
                Ray->>IsaacLabTuneTrainable: Stop trial
                Ray->>ProcessCleanupCallback: on_trial_complete/on_trial_error
                ProcessCleanupCallback->>ProcessCleanupCallback: pkill -9 -f "rid {RID}"
                ProcessCleanupCallback->>TrainingProcess: Kill process
            end
        end
    end
    
    Ray-->>User: Return tuning results
Loading

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds early stopping support to Ray hyperparameter tuning through the Stopper class integration. The implementation allows trials to terminate early when they're unlikely to yield useful results, improving computational efficiency.

Key Changes:

  • Adds ProcessCleanupCallback to properly terminate trial processes when stopped
  • Integrates custom stoppers with CombinedStopper alongside existing LogExtractionErrorStopper
  • Adds --ray-proc-id (-rid) argument to all training scripts for process identification
  • Updates util.py to handle short-form arguments starting with single dash
  • Includes CartpoleEarlyStopper as a reference implementation example
  • Increases MAX_LOG_EXTRACTION_ERRORS from 2 to 10

Issues Found:

  • The pkill pattern in ProcessCleanupCallback uses rid {value} instead of -rid {value}, which works but is less precise and could cause false matches
  • The stopper validation only accepts uninstantiated classes, which may be limiting if users want to pass pre-configured stopper instances

Confidence Score: 3/5

  • This PR is mostly safe to merge with one process cleanup issue that should be fixed
  • The early stopping implementation is well-structured and follows Ray's patterns. However, the pkill pattern bug could cause issues in production with false positive matches, and the 5-second blocking sleep may impact performance during high-throughput trial scheduling. The core logic is sound and the feature addresses a real need.
  • Pay close attention to scripts/reinforcement_learning/ray/tuner.py - the ProcessCleanupCallback needs the pkill pattern fixed

Important Files Changed

File Analysis

Filename Score Overview
scripts/reinforcement_learning/ray/tuner.py 2/5 Adds early stopping support with ProcessCleanupCallback and stopper integration; contains critical pkill pattern bug and minor validation issue
scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py 4/5 Adds CartpoleEarlyStopper example implementation with hardcoded threshold values
scripts/reinforcement_learning/ray/util.py 5/5 Updates command parsing to handle short-form arguments starting with single dash

Sequence Diagram

sequenceDiagram
    participant User
    participant Tuner
    participant Ray
    participant IsaacLabTuneTrainable
    participant TrainingProcess
    participant Stopper
    participant ProcessCleanupCallback

    User->>Tuner: Start tuning with optional stopper
    Tuner->>Tuner: Load config and stopper class
    Tuner->>Stopper: Instantiate stopper()
    Tuner->>Tuner: Create CombinedStopper with LogExtractionErrorStopper + custom stopper
    Tuner->>Ray: Configure tuner with stoppers and callbacks
    Ray->>IsaacLabTuneTrainable: Schedule trial with RID
    IsaacLabTuneTrainable->>TrainingProcess: Execute training command with -rid
    
    loop Every step
        IsaacLabTuneTrainable->>TrainingProcess: Read tensorboard logs
        TrainingProcess-->>IsaacLabTuneTrainable: Return metrics
        IsaacLabTuneTrainable->>Ray: Report metrics
        Ray->>Stopper: Check if trial should stop
        alt Should stop
            Stopper-->>Ray: Return True
            Ray->>ProcessCleanupCallback: on_trial_complete/on_trial_error
            ProcessCleanupCallback->>TrainingProcess: pkill -9 -f "rid {RID}"
            ProcessCleanupCallback->>ProcessCleanupCallback: sleep(5)
        else Continue
            Stopper-->>Ray: Return False
        end
    end
    
    Ray-->>Tuner: All trials complete
    Tuner-->>User: Results
Loading

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

)
parser.add_argument("--export_io_descriptors", action="store_true", default=False, help="Export IO descriptors.")
parser.add_argument(
"--ray-proc-id", "-rid", type=int, default=None, help="Automatically configured by Ray integration, otherwise None."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a user sets this by accident? if it's meant to be configured directly by ray, does it have to be exposed as a user-level command line argument?

Copy link
Collaborator

@garylvov garylvov Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user sets it by accident, nothing different will happen than if the user didn't set it. This is a no-op in terms of functionality outside of the Ray stuff.

The reason this is added here is because one single ./isaaclab.sh -p train.py starts many processes, all with different PIDs. For example, the following shows many PIDs associated with one single isaac run:
image

When using python's subprocess to recover the PID, only one PID is returned out of the many subprocesses created by a single Isaac train command, and so it's hard to completely kill a run programmatically. This rid is used to correlate what group of subprocesses relate to a single Isaac lab run, so that a single isaac lab run can be pkilled by it's unique RID, without affecting unrelated runs on the same machine with a different RID. This is critical to our early stopping functionality; without it, we couldn't find a good way to make sure the run was killed, resulting in hanging blocking processes on the GPU. For more context, see: #3276 (comment) and #3276 (comment)

I tried to configure this without rid (using the isaac command itself as a unique identifier, or starting the command with RAY_PROC_ID= in a bash like way so that RID isn't exposed to the user), but I couldn't get it to work. Although the RID is not as clean as it could be, I think using another method to ensure runs are killed would require a significant refactor in the training scripts (mentioned in #3546 (comment)), or significantly more investigation into how to best enable early stopping

@kellyguo11 kellyguo11 moved this to In progress in Isaac Lab Nov 7, 2025
@garylvov garylvov requested a review from kellyguo11 November 7, 2025 20:19
@kellyguo11 kellyguo11 merged commit 5cb4728 into isaac-sim:main Nov 11, 2025
8 of 9 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in Isaac Lab Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Proposal] Early stopping while doing hyperparameter tuning with Ray integration

4 participants