Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune issue #82

Open
ilovesdu opened this issue Mar 13, 2025 · 1 comment
Open

Finetune issue #82

ilovesdu opened this issue Mar 13, 2025 · 1 comment

Comments

@ilovesdu
Copy link

I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:

python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py
--run_name protenix_finetune
--seed 42
--base_dir ./output
--dtype bf16
--project protenix
--use_wandb false
--diffusion_batch_size 48
--eval_interval 400
--log_interval 50
--checkpoint_interval 400
--ema_decay 0.999
--train_crop_size 384
--max_steps 100000
--warmup_steps 2000
--lr 0.001
--sample_diffusion.N_step 20
--load_checkpoint_path ${checkpoint_path}
--load_ema_checkpoint_path ${checkpoint_path}
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925

The error message I received is as follows:

File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./runner/train.py FAILED

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I’d greatly appreciate your guidance on how to address these issues. Thank you!

@zhangyuxuann
Copy link
Collaborator

@ilovesdu It's easy to expand the train_demo.sh to Multi-Node or Multi-GPU training as follows. you can try this. I have tried the similar command on 2 V100 GPUs, it works fine.

torchrun \
    --nproc_per_node $NPROC \
    --master_addr $WORKER_0_HOST \
    --master_port $WORKER_0_PORT \
    --node_rank=$ID \
    --nnodes=$WORKER_NUM \
    ./runner/train.py \
    --run_name protenix_train \
    --seed 42 \
    --base_dir ./output \
    ...

you should provide an "indices_fpath" to construct the test set, also the "pdb_list" can be provided to filter the "indices.csv". please refer to https://github.com/bytedance/Protenix/blob/main/docs/prepare_training_data.md#indices-csv

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants