Resolve gradient accumulation memory leak #66

aurelion-source · 2025-03-31T06:36:12Z

I found three main sources of memory increase:

With grad_accum > 1, activations from the current iteration aren’t freed until two subsequent forward passes are completed. This results in two large activation buffers being kept in memory at all times.
Fix: Freed activations immediately after the backward pass, reducing memory usage for grad_accum=2 to match grad_accum=1.
With gradient overflow (due to large batch size), DS skips the current iteration (no optimizer step) and updates the loss scale. This is happening with grad_accum>4, and the skip prevents some reduction buffers from being freed correctly. They persist throughout training.
Fix: Explicitly free IPG buffers when an overflow is detected.
With grad_accum > 1, gradients accumulate via multiple loss.backward() calls before the optimizer step, with PyTorch handling accumulation internally. Memory profiling shows that PyTorch allocates an additional copy of FP16 gradients during the second backward pass instead of performing in-place operations, introducing a fixed overhead. Since this behavior is internal to PyTorch, no fix was applied.

Tested on a 1-3B GPT-NeoX model configuration.
Loss values and iteration time are identical to the main branch.

Quentin-Anthony · 2025-04-01T16:47:13Z

This LGTM. Thanks!

aurelion-source added 2 commits March 31, 2025 02:14

Fixes grad_acc memory leak issues

f4f5b0b

Keep memory profiling line

be936d0

aurelion-source requested a review from Quentin-Anthony March 31, 2025 06:36

aurelion-source self-assigned this Mar 31, 2025

Quentin-Anthony approved these changes Apr 1, 2025

View reviewed changes

Quentin-Anthony merged commit 65d9f99 into main Apr 1, 2025
6 of 14 checks passed

Provide feedback