Skip to content

Conversation

@aurelion-source
Copy link

I found three main sources of memory increase:

  1. With grad_accum > 1, activations from the current iteration aren’t freed until two subsequent forward passes are completed. This results in two large activation buffers being kept in memory at all times.
    Fix: Freed activations immediately after the backward pass, reducing memory usage for grad_accum=2 to match grad_accum=1.
  2. With gradient overflow (due to large batch size), DS skips the current iteration (no optimizer step) and updates the loss scale. This is happening with grad_accum>4, and the skip prevents some reduction buffers from being freed correctly. They persist throughout training.
    Fix: Explicitly free IPG buffers when an overflow is detected.
  3. With grad_accum > 1, gradients accumulate via multiple loss.backward() calls before the optimizer step, with PyTorch handling accumulation internally. Memory profiling shows that PyTorch allocates an additional copy of FP16 gradients during the second backward pass instead of performing in-place operations, introducing a fixed overhead. Since this behavior is internal to PyTorch, no fix was applied.

Tested on a 1-3B GPT-NeoX model configuration.
Loss values and iteration time are identical to the main branch.

@Quentin-Anthony
Copy link
Member

This LGTM. Thanks!

@Quentin-Anthony Quentin-Anthony merged commit 65d9f99 into main Apr 1, 2025
6 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants