We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I'm running on two A6000 GPUs(48G), but I'm encountering an out-of-memory error. Does anyone know how to optimize this? Here are the parameters:cd src/open-r1-multimodal export DEBUG_MODE="true" export CUDA_VISIBLE_DEVICES=1,2 RUN_NAME="Qwen2.5-VL-3B-GRPO-REC" export LOG_PATH="./debug_log_$RUN_NAME.txt" torchrun --nproc_per_node="2" --nnodes="1" --node_rank="0" --master_addr="127.0.0.1" --master_port="12346" src/open_r1/grpo_rec.py --deepspeed local_scripts/zero3.json --output_dir quanzhong/$RUN_NAME --model_name_or_path VLM-R1/Qwen2.5-VL-3B-Instruct --dataset_name data_config/rec.yaml --image_root VLM-R1/camotrain --max_prompt_length 1024 --num_generations 2 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --logging_steps 1 --bf16 --torch_dtype bfloat16 --data_seed 42 --report_to wandb --gradient_checkpointing true --attn_implementation flash_attention_2 --num_train_epochs 2 --run_name $RUN_NAME --save_steps 100 --save_only_model true
The text was updated successfully, but these errors were encountered:
Since you alreay set per_device_train_batch_size = 1, another thing to try is to set --max_pixels to a smaller values, like 401408
--max_pixels
Sorry, something went wrong.
No branches or pull requests
I'm running on two A6000 GPUs(48G), but I'm encountering an out-of-memory error. Does anyone know how to optimize this? Here are the parameters:cd src/open-r1-multimodal
export DEBUG_MODE="true"
export CUDA_VISIBLE_DEVICES=1,2
RUN_NAME="Qwen2.5-VL-3B-GRPO-REC"
export LOG_PATH="./debug_log_$RUN_NAME.txt"
torchrun --nproc_per_node="2"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12346"
src/open_r1/grpo_rec.py
--deepspeed local_scripts/zero3.json
--output_dir quanzhong/$RUN_NAME
--model_name_or_path VLM-R1/Qwen2.5-VL-3B-Instruct
--dataset_name data_config/rec.yaml
--image_root VLM-R1/camotrain
--max_prompt_length 1024
--num_generations 2
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--torch_dtype bfloat16
--data_seed 42
--report_to wandb
--gradient_checkpointing true
--attn_implementation flash_attention_2
--num_train_epochs 2
--run_name $RUN_NAME
--save_steps 100
--save_only_model true
The text was updated successfully, but these errors were encountered: