You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
报错: You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')`.
W0312 20:59:45.668000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306400 closing signal SIGTERM
W0312 20:59:52.055000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306401 closing signal SIGTERM
W0312 20:59:52.056000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306402 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306403 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306404 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306406 closing signal SIGTERM
W0312 20:59:52.058000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306407 closing signal SIGTERM
W0312 21:00:22.597000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306401 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0312 21:00:30.954000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306402 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0312 21:00:33.524000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 5 (pid: 2306405) of binary: /home/anaconda3/envs/vlm-r1/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/vlm-r1/bin/torchrun", line 8, in
sys.exit(main())
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/open_r1/grpo_rec.py FAILED
Failures:
<NO_OTHER_FAILURES>`
The text was updated successfully, but these errors were encountered:
脚本文件:
`cd src/open-r1-multimodal
export DEBUG_MODE="true"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
RUN_NAME="Qwen2-VL-2B-GRPO-REC-lora"
export DEBUG_MODE="true"
export HF_HOME=/projects/VLM-R1/huggingface_cache
export LOG_PATH="./debug_log_$RUN_NAME.txt"
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12346"
src/open_r1/grpo_rec.py
--deepspeed local_scripts/zero2.json
--output_dir output/$RUN_NAME
--model_name_or_path /projects/VLM-R1/models/Qwen2-VL-2B
--dataset_name data_config/rec.yaml
--image_root /projects/VLM-R1/data/train2017
--max_prompt_length 1024
--num_generations 8
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16
--torch_dtype bfloat16
--data_seed 42
--report_to wandb
--gradient_checkpointing true
--attn_implementation flash_attention_2
--num_train_epochs 1
--run_name $RUN_NAME
--save_steps 200
--save_only_model true
--learning_rate 1e-5
--use_peft true
--lora_r 64
--lora_alpha 128
--lora_dropout 0.05
--lora_task_type CAUSAL_LM
--freeze_vision_modules true
`
报错:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')`.W0312 20:59:45.668000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306400 closing signal SIGTERM
W0312 20:59:52.055000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306401 closing signal SIGTERM
W0312 20:59:52.056000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306402 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306403 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306404 closing signal SIGTERM
W0312 20:59:52.057000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306406 closing signal SIGTERM
W0312 20:59:52.058000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2306407 closing signal SIGTERM
W0312 21:00:22.597000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306401 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0312 21:00:30.954000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 2306402 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0312 21:00:33.524000 2306316 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 5 (pid: 2306405) of binary: /home/anaconda3/envs/vlm-r1/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/vlm-r1/bin/torchrun", line 8, in
sys.exit(main())
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anaconda3/envs/vlm-r1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/open_r1/grpo_rec.py FAILED
Failures:
<NO_OTHER_FAILURES>`
The text was updated successfully, but these errors were encountered: