qwen2.5-vl-72b lora grpo 32卡a100训练报错:OOM #152

szww3427 · 2025-03-12T02:10:04Z

[rank6]: Traceback (most recent call last):
[rank6]:   File "/compliance_nas/***/VLM-R1/src/open-r1-multimodal/src/open_r1/grpo_rec.py", line 336, in <module>
[rank6]:     main(script_args, training_args, model_args)
[rank6]:   File "/compliance_nas/***/VLM-R1/src/open-r1-multimodal/src/open_r1/grpo_rec.py", line 325, in main
[rank6]:     trainer.train()
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
[rank6]:     return inner_training_loop(
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank6]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3698, in training_step
[rank6]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank6]:   File "/compliance_nas/177569/VLM-R1/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py", line 703, in compute_loss
[rank6]:     inputs = self._generate_and_score_completions(inputs, model)
[rank6]:   File "/compliance_nas/177569/VLM-R1/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py", line 560, in _generate_and_score_completions
[rank6]:     with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model:
[rank6]:   File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
[rank6]:     return next(self.gen)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/trl/models/utils.py", line 210, in unwrap_model_for_generation
[rank6]:     with deepspeed.zero.GatheredParameters(model.parameters()):
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2235, in __enter__
[rank6]:     self.params[0].all_gather(param_list=self.params)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1154, in all_gather
[rank6]:     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1522, in _all_gather
[rank6]:     self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1810, in _allgather_params_coalesced
[rank6]:     flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 6 has a total capacity of 95.62 GiB of which 920.29 MiB is free. Process 47199 has 95.39 GiB memory in use. Process 47199 has 95.39 GiB memory in use. Of the allocated memory 92.90 GiB is allocated by PyTorch, and 845.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

qwen2.5-vl-7b模型在2卡a100上可以正常运行，按常理72b的lora grpo在32卡 a100上也可以正常运行，结果直接oom了

The text was updated successfully, but these errors were encountered:

gmyFighting · 2025-03-13T06:07:59Z

same question

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2.5-vl-72b lora grpo 32卡a100训练报错:OOM #152

qwen2.5-vl-72b lora grpo 32卡a100训练报错:OOM #152

szww3427 commented Mar 12, 2025

gmyFighting commented Mar 13, 2025

qwen2.5-vl-72b lora grpo 32卡a100训练报错:OOM #152

qwen2.5-vl-72b lora grpo 32卡a100训练报错:OOM #152

Comments

szww3427 commented Mar 12, 2025

gmyFighting commented Mar 13, 2025