Skip to content

[Bug Report] GPU resource have not free correctly after stopping the bad trials using ASHAScheduler #3332

@PierrePeng

Description

@PierrePeng

If you are submitting a bug report, please fill in the following details and use the tag [bug].

Describe the bug

I'm using ray feature to hpo with ASHAScheduler instead of default scheduler(FIFO). After stoping some bad trials and starting new trials, I found the old trials process doesn't exit correctly. So the gpu resource have not free.

Steps to reproduce

./isaaclab.sh -p scripts/reinforcement_learning/ray/tuner.py \
  --cfg_file scripts/reinforcement_learning/ray/hyperparameter_tuning/vision_cartpole_cfg.py \
  --cfg_class CartpoleTheiaJobCfg \
  --run_mode local \
  --workflow scripts/reinforcement_learning/rl_games/train.py \
  --num_workers_per_node <NUMBER_OF_GPUS_IN_COMPUTER>
  tuner = tune.Tuner(
        IsaacLabTuneTrainable,
        param_space=cfg,
        tune_config=tune.TuneConfig(
            metric=args.metric,
            mode=args.mode,
            scheduler=ASHAScheduler(time_attr="time_this_iter_s", grace_period=3,
                                    max_t=5, reduction_factor=2),
            search_alg=repeat_search,
            num_samples=args.num_samples,
            reuse_actors=True,
        ),
        run_config=run_config,
    )

System Info

Describe the characteristic of your environment:

  • Commit: [e.g. 2f9298d]
  • Isaac Sim Version: [5.0]
  • OS: [Ubuntu 22.04]
  • GPU: [RTX 4090]
  • CUDA: [12.8]
  • GPU Driver: [553.05]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions