Run run_grpo_rec_lora.sh #154

ai-kunkun · 2025-03-12T13:05:04Z

脚本文件：
`cd src/open-r1-multimodal
export DEBUG_MODE="true"

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

RUN_NAME="Qwen2-VL-2B-GRPO-REC-lora"
export DEBUG_MODE="true"
export HF_HOME=/projects/VLM-R1/huggingface_cache
export LOG_PATH="./debug_log_$RUN_NAME.txt"
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12346"
src/open_r1/grpo_rec.py
--deepspeed local_scripts/zero2.json
--output_dir output/$RUN_NAME
--model_name_or_path /projects/VLM-R1/models/Qwen2-VL-2B
--dataset_name data_config/rec.yaml
--image_root /projects/VLM-R1/data/train2017
--max_prompt_length 1024
--num_generations 8
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16
--torch_dtype bfloat16
--data_seed 42
--report_to wandb
--gradient_checkpointing true
--attn_implementation flash_attention_2
--num_train_epochs 1
--run_name $RUN_NAME
--save_steps 200
--save_only_model true
--learning_rate 1e-5
--use_peft true
--lora_r 64
--lora_alpha 128
--lora_dropout 0.05
--lora_task_type CAUSAL_LM
--freeze_vision_modules true

`

报错：
`You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with` model.to('cuda')`.
W0312 20:59:45.668000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306400 closing signal SIGTERM
W0312 20:59:52.055000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306401 closing signal SIGTERM
W0312 20:59:52.056000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306402 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306403 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306404 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306406 closing signal SIGTERM
W0312 20:59:52.058000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306407 closing signal SIGTERM
W0312 21:00:22.597000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306401 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0312 21:00:30.954000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306402 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0312 21:00:33.524000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 5 (pid: 2306405) of binary: /home/anaconda3/envs/vlm-r1/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/vlm-r1/bin/torchrun", line 8, in
sys.exit(main())
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, kwargs)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/open_r1/grpo_rec.py FAILED

Failures:
<NO_OTHER_FAILURES>`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run run_grpo_rec_lora.sh #154

Run run_grpo_rec_lora.sh #154

ai-kunkun commented Mar 12, 2025

Run run_grpo_rec_lora.sh #154

Run run_grpo_rec_lora.sh #154

Comments

ai-kunkun commented Mar 12, 2025

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

src/open_r1/grpo_rec.py FAILED