Finetune issue #82

ilovesdu · 2025-03-13T11:42:59Z

I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:

python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py
--run_name protenix_finetune
--seed 42
--base_dir ./output
--dtype bf16
--project protenix
--use_wandb false
--diffusion_batch_size 48
--eval_interval 400
--log_interval 50
--checkpoint_interval 400
--ema_decay 0.999
--train_crop_size 384
--max_steps 100000
--warmup_steps 2000
--lr 0.001
--sample_diffusion.N_step 20
--load_checkpoint_path ${checkpoint_path}
--load_ema_checkpoint_path ${checkpoint_path}
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925

The error message I received is as follows:

File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./runner/train.py FAILED

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I’d greatly appreciate your guidance on how to address these issues. Thank you!

zhangyuxuann · 2025-03-13T14:56:30Z

@ilovesdu It's easy to expand the train_demo.sh to Multi-Node or Multi-GPU training as follows. you can try this. I have tried the similar command on 2 V100 GPUs, it works fine.

torchrun \
    --nproc_per_node $NPROC \
    --master_addr $WORKER_0_HOST \
    --master_port $WORKER_0_PORT \
    --node_rank=$ID \
    --nnodes=$WORKER_NUM \
    ./runner/train.py \
    --run_name protenix_train \
    --seed 42 \
    --base_dir ./output \
    ...

you should provide an "indices_fpath" to construct the test set, also the "pdb_list" can be provided to filter the "indices.csv". please refer to https://github.com/bytedance/Protenix/blob/main/docs/prepare_training_data.md#indices-csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune issue #82

Finetune issue #82

ilovesdu commented Mar 13, 2025

zhangyuxuann commented Mar 13, 2025

Finetune issue #82

Finetune issue #82

Comments

ilovesdu commented Mar 13, 2025

./runner/train.py FAILED

Failures: [1]: time : 2025-03-12_22:38:01 host : DESKTOP-KH3KJRU. rank : 1 (local_rank: 1) exitcode : 1 (pid: 79174) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2025-03-12_22:38:01 host : DESKTOP-KH3KJRU. rank : 0 (local_rank: 0) exitcode : 1 (pid: 79173) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

zhangyuxuann commented Mar 13, 2025

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html