Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run run_grpo_rec_lora.sh #154

Open
ai-kunkun opened this issue Mar 12, 2025 · 0 comments
Open

Run run_grpo_rec_lora.sh #154

ai-kunkun opened this issue Mar 12, 2025 · 0 comments

Comments

@ai-kunkun
Copy link

脚本文件:
`cd src/open-r1-multimodal
export DEBUG_MODE="true"

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

RUN_NAME="Qwen2-VL-2B-GRPO-REC-lora"
export DEBUG_MODE="true"
export HF_HOME=/projects/VLM-R1/huggingface_cache
export LOG_PATH="./debug_log_$RUN_NAME.txt"
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12346"
src/open_r1/grpo_rec.py
--deepspeed local_scripts/zero2.json
--output_dir output/$RUN_NAME
--model_name_or_path /projects/VLM-R1/models/Qwen2-VL-2B
--dataset_name data_config/rec.yaml
--image_root /projects/VLM-R1/data/train2017
--max_prompt_length 1024
--num_generations 8
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16
--torch_dtype bfloat16
--data_seed 42
--report_to wandb
--gradient_checkpointing true
--attn_implementation flash_attention_2
--num_train_epochs 1
--run_name $RUN_NAME
--save_steps 200
--save_only_model true
--learning_rate 1e-5
--use_peft true
--lora_r 64
--lora_alpha 128
--lora_dropout 0.05
--lora_task_type CAUSAL_LM
--freeze_vision_modules true

`

报错:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')`.
W0312 20:59:45.668000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306400 closing signal SIGTERM
W0312 20:59:52.055000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306401 closing signal SIGTERM
W0312 20:59:52.056000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306402 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306403 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306404 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306406 closing signal SIGTERM
W0312 20:59:52.058000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306407 closing signal SIGTERM
W0312 21:00:22.597000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306401 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0312 21:00:30.954000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306402 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0312 21:00:33.524000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 5 (pid: 2306405) of binary: /home/anaconda3/envs/vlm-r1/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/vlm-r1/bin/torchrun", line 8, in
sys.exit(main())
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/open_r1/grpo_rec.py FAILED

Failures:
<NO_OTHER_FAILURES>`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant