4/5 trial fails due to lack of memory #4010

diegotxegp · 2024-05-28T12:04:48Z

Describe the bug
4/5 trials fail due to lack of memory. I have 4 x GPUs RTX 2080 Super (8 GB) and 64 GB RAM but it seems that AutoML doesn't recognize my GPUs to make the most.

To Reproduce
Using the "Rotten Tomatoes" example from the Ludwig AI web. If you have more than one GPUs, you will be able to reproduce this error.

from ludwig.automl import auto_train
auto_train_results = auto_train(
dataset=self.df,
target="recommended",
time_limit_s=7200,
)

Expected behavior
Run the 5 trials with different results. No only one execution with 4 error due to lack of memory.

Screenshots

Trial trial_78e53127 completed after 11 iterations at 2024-05-28 13:24:32. Total running time: 21min 24s

Trial status: 4 ERROR | 1 TERMINATED
Current time: 2024-05-28 13:24:32. Total running time: 21min 24s
Logical resource usage: 0/20 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:G)
Current best trial: 78e53127 with metric_score=0.9420865774154663 and params={'trainer.learning_rate': 2.2103375806114728e-05, 'trainer.batch_size': 64, 'combiner.num_fc_layers': 1, 'combiner.output_size': 128, 'combiner.dropout': 0.012855425737772442}
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...ner.learning_rate trainer.batch_size ...ner.num_fc_layers combiner.output_size combiner.dropout iter total time (s) metric_score │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_78e53127 TERMINATED 2.21034e-05 64 1 128 0.0128554 11 1259.7 0.942087 │
│ trial_6a7803f9 ERROR 3.10601e-05 1024 3 256 0.0093055 │
│ trial_6950b4ec ERROR 0.000337902 1024 2 256 0.086701 │
│ trial_6225efbb ERROR 0.000705436 1024 1 128 0.0393212 │
│ trial_5d372a47 ERROR 0.000517778 1024 3 128 0.0782563 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Number of errored trials: 4
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name # failures error file │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_6a7803f9 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9/error.txt │
│ trial_6950b4ec 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec/error.txt │
│ trial_6225efbb 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb/error.txt │
│ trial_5d372a47 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47/error.txt │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯

2024-05-28 13:24:32,620 ERROR tune.py:1144 -- Trials did not complete: [trial_6a7803f9, trial_6950b4ec, trial_6225efbb, trial_5d372a47]
2024-05-28 13:24:32,631 WARNING experiment_analysis.py:916 -- Failed to read the results for 4 trials:

/home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9
/home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec
/home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb
/home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47
/home/diego/.local/lib/python3.10/site-packages/ludwig/automl/automl.py:286: UserWarning: There was an error running the experiment. A trial failed to start. Consider increasing the time budget for experiment.
warnings.warn(

Environment (please complete the following information):

OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12
Ludwig version: 0.10.3

Additional context
The idea is using AutoML for its ease to autoconfig.

arnavgarg1 · 2024-05-28T16:13:42Z

Hey @diegotxegp,

Are you able to try setting max_concurrent_trials to a value like 1 or 2? https://ludwig.ai/latest/configuration/hyperparameter_optimization/#executor

Regarding GPU usage - is your CUDA_VISIBLE_DEVICES environment variable set?

diegotxegp · 2024-05-29T09:02:01Z

Thank you for your quick response.

The point is that I am trying it automatically with AutoML. Since the error raised, I added "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"" as you said ,and regarding "max_current_trials", I set it like as follows, but with not much difference:

Code:

from ludwig.automl import auto_train
auto_train_results = auto_train(
dataset=self.df,
target=selected_targets[0],
time_limit_s=7200,
num_samples=4,
cpu_resources_per_trial=5,
gpu_resources_per_trial=1,
max_concurrent_trials=1,
)

AutoML config:

{ 'eval_split': 'validation',
'executor': { 'cpu_resources_per_trial': 5,
'gpu_resources_per_trial': 1,
'kubernetes_namespace': None,
'max_concurrent_trials': None,
'num_samples': 5,
'scheduler': { 'brackets': 1,
'grace_period': 72,
'max_t': 7200,
'metric': None,
'mode': None,
'reduction_factor': 5.0,
'stop_last_trials': True,
'time_attr': 'time_total_s',
'type': 'async_hyperband'},
'time_budget_s': 7200,
'trial_driver_resources': {'CPU': 1, 'GPU': 0},
'type': 'ray'},
'goal': 'maximize',
'metric': 'roc_auc',
'output_feature': 'recommended',
'parameters': { 'combiner.dropout': { 'lower': 0.0,
'space': 'uniform',
'upper': 0.1},
'combiner.num_fc_layers': { 'lower': 1,
'space': 'randint',
'upper': 4},
'combiner.output_size': { 'categories': [128, 256],
'space': 'choice'},
'trainer.batch_size': { 'categories': [ 64,
128,
256,
512,
1024],
'space': 'choice'},
'trainer.learning_rate': { 'lower': 2e-05,
'space': 'loguniform',
'upper': 0.001}},
'search_alg': {'type': 'hyperopt'},
'split': 'validation'}

diegotxegp changed the title ~~4/5 trial fails sdue to lack of memory~~ 4/5 trial fails due to lack of memory May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4/5 trial fails due to lack of memory #4010

4/5 trial fails due to lack of memory #4010

diegotxegp commented May 28, 2024 •

edited

Loading

arnavgarg1 commented May 28, 2024

diegotxegp commented May 29, 2024

4/5 trial fails due to lack of memory #4010

4/5 trial fails due to lack of memory #4010

Comments

diegotxegp commented May 28, 2024 • edited Loading

arnavgarg1 commented May 28, 2024

diegotxegp commented May 29, 2024

diegotxegp commented May 28, 2024 •

edited

Loading