Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4/5 trial fails due to lack of memory #4010

Open
diegotxegp opened this issue May 28, 2024 · 2 comments
Open

4/5 trial fails due to lack of memory #4010

diegotxegp opened this issue May 28, 2024 · 2 comments

Comments

@diegotxegp
Copy link

diegotxegp commented May 28, 2024

Describe the bug
4/5 trials fail due to lack of memory. I have 4 x GPUs RTX 2080 Super (8 GB) and 64 GB RAM but it seems that AutoML doesn't recognize my GPUs to make the most.

To Reproduce
Using the "Rotten Tomatoes" example from the Ludwig AI web. If you have more than one GPUs, you will be able to reproduce this error.

from ludwig.automl import auto_train
auto_train_results = auto_train(
dataset=self.df,
target="recommended",
time_limit_s=7200,
)

Expected behavior
Run the 5 trials with different results. No only one execution with 4 error due to lack of memory.

Screenshots

Trial trial_78e53127 completed after 11 iterations at 2024-05-28 13:24:32. Total running time: 21min 24s

Trial status: 4 ERROR | 1 TERMINATED
Current time: 2024-05-28 13:24:32. Total running time: 21min 24s
Logical resource usage: 0/20 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:G)
Current best trial: 78e53127 with metric_score=0.9420865774154663 and params={'trainer.learning_rate': 2.2103375806114728e-05, 'trainer.batch_size': 64, 'combiner.num_fc_layers': 1, 'combiner.output_size': 128, 'combiner.dropout': 0.012855425737772442}
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...ner.learning_rate trainer.batch_size ...ner.num_fc_layers combiner.output_size combiner.dropout iter total time (s) metric_score │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_78e53127 TERMINATED 2.21034e-05 64 1 128 0.0128554 11 1259.7 0.942087 │
│ trial_6a7803f9 ERROR 3.10601e-05 1024 3 256 0.0093055 │
│ trial_6950b4ec ERROR 0.000337902 1024 2 256 0.086701 │
│ trial_6225efbb ERROR 0.000705436 1024 1 128 0.0393212 │
│ trial_5d372a47 ERROR 0.000517778 1024 3 128 0.0782563 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Number of errored trials: 4
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name # failures error file │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_6a7803f9 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9/error.txt │
│ trial_6950b4ec 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec/error.txt │
│ trial_6225efbb 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb/error.txt │
│ trial_5d372a47 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47/error.txt │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯

2024-05-28 13:24:32,620 ERROR tune.py:1144 -- Trials did not complete: [trial_6a7803f9, trial_6950b4ec, trial_6225efbb, trial_5d372a47]
2024-05-28 13:24:32,631 WARNING experiment_analysis.py:916 -- Failed to read the results for 4 trials:

  • /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9
  • /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec
  • /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb
  • /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47
    /home/diego/.local/lib/python3.10/site-packages/ludwig/automl/automl.py:286: UserWarning: There was an error running the experiment. A trial failed to start. Consider increasing the time budget for experiment.
    warnings.warn(

Environment (please complete the following information):

  • OS: Ubuntu 22.04.3 LTS
  • Python version: 3.10.12
  • Ludwig version: 0.10.3

Additional context
The idea is using AutoML for its ease to autoconfig.

@arnavgarg1
Copy link
Contributor

Hey @diegotxegp,

Are you able to try setting max_concurrent_trials to a value like 1 or 2? https://ludwig.ai/latest/configuration/hyperparameter_optimization/#executor

Regarding GPU usage - is your CUDA_VISIBLE_DEVICES environment variable set?

@diegotxegp
Copy link
Author

Thank you for your quick response.

The point is that I am trying it automatically with AutoML. Since the error raised, I added "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"" as you said ,and regarding "max_current_trials", I set it like as follows, but with not much difference:

Code:

from ludwig.automl import auto_train
auto_train_results = auto_train(
dataset=self.df,
target=selected_targets[0],
time_limit_s=7200,
num_samples=4,
cpu_resources_per_trial=5,
gpu_resources_per_trial=1,
max_concurrent_trials=1,
)

AutoML config:

{ 'eval_split': 'validation',
'executor': { 'cpu_resources_per_trial': 5,
'gpu_resources_per_trial': 1,
'kubernetes_namespace': None,
'max_concurrent_trials': None,
'num_samples': 5,
'scheduler': { 'brackets': 1,
'grace_period': 72,
'max_t': 7200,
'metric': None,
'mode': None,
'reduction_factor': 5.0,
'stop_last_trials': True,
'time_attr': 'time_total_s',
'type': 'async_hyperband'},
'time_budget_s': 7200,
'trial_driver_resources': {'CPU': 1, 'GPU': 0},
'type': 'ray'},
'goal': 'maximize',
'metric': 'roc_auc',
'output_feature': 'recommended',
'parameters': { 'combiner.dropout': { 'lower': 0.0,
'space': 'uniform',
'upper': 0.1},
'combiner.num_fc_layers': { 'lower': 1,
'space': 'randint',
'upper': 4},
'combiner.output_size': { 'categories': [128, 256],
'space': 'choice'},
'trainer.batch_size': { 'categories': [ 64,
128,
256,
512,
1024],
'space': 'choice'},
'trainer.learning_rate': { 'lower': 2e-05,
'space': 'loguniform',
'upper': 0.001}},
'search_alg': {'type': 'hyperopt'},
'split': 'validation'}

@diegotxegp diegotxegp changed the title 4/5 trial fails sdue to lack of memory 4/5 trial fails due to lack of memory May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants