You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
@ilovesdu It's easy to expand the train_demo.sh to Multi-Node or Multi-GPU training as follows. you can try this. I have tried the similar command on 2 V100 GPUs, it works fine.
I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:
python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py
--run_name protenix_finetune
--seed 42
--base_dir ./output
--dtype bf16
--project protenix
--use_wandb false
--diffusion_batch_size 48
--eval_interval 400
--log_interval 50
--checkpoint_interval 400
--ema_decay 0.999
--train_crop_size 384
--max_steps 100000
--warmup_steps 2000
--lr 0.001
--sample_diffusion.N_step 20
--load_checkpoint_path ${checkpoint_path}
--load_ema_checkpoint_path ${checkpoint_path}
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925
The error message I received is as follows:
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./runner/train.py FAILED
Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I’d greatly appreciate your guidance on how to address these issues. Thank you!
The text was updated successfully, but these errors were encountered: