Refine the gradient accumulation API #9078

rpsilva-aws · 2025-05-01T21:56:35Z

In this PR, we refine the gradient accumulation API to include:

Making the body function wrapper pure without side effects
Enforcing train step pure requirements
Simplifying the mapping logic
Moving all train step non specific loop logic to the body wrapper
Initializing local accumulated gradients on the device (prevent requiring a data transfer if not present)
Change the API to return a tuple of carried tensors, instead of unpacking
Remove the explicit buffer donation, given the function is pure and Extend device data node binding API to not clone specified input tensors #9054
RNG fix for all iterations

Testing:

Validated the existing A/B testing for MLP with and without grad checkpointing
Added a few basic sanity tests
Validated the API with Llama 3.1 8B

rpsilva-aws · 2025-05-06T20:22:08Z

Hmm, only PJRT_DEVICE=CUDA is having issues with the existing MLP A/B test: SIG11 on torch_xla::runtime::PjRtComputationClient::PjRtShardedData::GetHandle(). I'll look into it.

rpsilva-aws · 2025-05-08T05:50:18Z

I don't reproduce the same SIG11 observed on https://github.com/pytorch/xla/actions/runs/14850406478/job/41698575596?pr=9078 with CUDA, as it succeeds with NVIDIA A100-SXM4-40GB:

| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |

with the same 2.7 container: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.7.0_3.10_cuda_12.6

The CI run doesn't even hit a single print line in the root test file (be it 1 or 2):

+ python3 /__w/xla/xla/pytorch/xla/test/spmd/test_train_spmd_linear_model.py --skip-gradient-checkpointing
/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:236: UserWarning: XLA_USE_SPMD is being deprecated. Use torch_xla.runtime.use_spmd() without setting XLA_USE_SPMD env-var.
  warnings.warn("XLA_USE_SPMD is being deprecated. "
./usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:242: UserWarning: Replicating tensors already initialized on non-virtual XLA device for SPMD to force SPMD mode. This is one-time overhead to setup, and to minimize such, please set SPMD mode before initializting tensors (i.e., call use_spmd() in the beginning of the program).
  warnings.warn(
*** Received signal 11 ***
...

@tengyifei - I assume these were running earlier, as we have brought in the CI, anything I am missing? It wouldn't expect it to be specific to T4s (G4dn).

tengyifei · 2025-05-09T01:35:03Z

@rpsilva-aws the GPU CI is not very stable. I would worry only about TPU CI and CPU CI for now, and make sure your tests are registered in those two environments!

tengyifei · 2025-05-09T01:35:27Z

You're welcome to file a GPU-specific issue.

rpsilva-aws · 2025-05-09T02:50:02Z

@tengyifei Thanks, perfect - will do! TPU, CPU (and TRN) are all covered :) Do we have the means to disable a test for GPU?

rpsilva-aws · 2025-05-09T02:55:05Z

#9128

tengyifei · 2025-05-09T03:00:15Z

@rpsilva-aws you can disable a GPU test by marking it as "skipped" using the unittest API. When you disable a test you should attach the bug reference URL in the message. Use your judgement as to whether the test is truly due to some other broken thing in GPU vs caused by a bug in your PR.

rpsilva-aws · 2025-05-09T03:20:00Z

Absolutely, had I not tried to reproduce with A100, it'd be harder to judge - but given it succeeded with it (all other devices aside), I don't think it is a bug in the PR.

In any case, I will take the slow approach and flush a few logs, and partially skip the test and make a better decision after a couple CI runs.

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 7 times, most recently from 9bfb1f7 to 930b208 Compare May 5, 2025 20:24

rpsilva-aws assigned tengyifei and unassigned tengyifei May 5, 2025

rpsilva-aws requested a review from tengyifei May 5, 2025 20:33

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 4 times, most recently from cf46ce1 to e0df762 Compare May 6, 2025 02:31

rpsilva-aws marked this pull request as ready for review May 6, 2025 03:04

tengyifei approved these changes May 6, 2025

View reviewed changes

rpsilva-aws mentioned this pull request May 9, 2025

SPMD Linear Model test failing with GA API refinement #9128

Open

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 3 times, most recently from 498d10c to 0c84b82 Compare May 9, 2025 03:17

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 2 times, most recently from 128fb42 to c6acb9c Compare May 9, 2025 05:57

Refine the gradient accumulation API

82c5864

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch from c6acb9c to 82c5864 Compare May 9, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine the gradient accumulation API #9078

Refine the gradient accumulation API #9078

rpsilva-aws commented May 1, 2025 •

edited

Loading

rpsilva-aws commented May 6, 2025

rpsilva-aws commented May 8, 2025 •

edited

Loading

tengyifei commented May 9, 2025

tengyifei commented May 9, 2025

rpsilva-aws commented May 9, 2025 •

edited

Loading

rpsilva-aws commented May 9, 2025

tengyifei commented May 9, 2025

rpsilva-aws commented May 9, 2025 •

edited

Loading

Refine the gradient accumulation API #9078

Are you sure you want to change the base?

Refine the gradient accumulation API #9078

Conversation

rpsilva-aws commented May 1, 2025 • edited Loading

rpsilva-aws commented May 6, 2025

rpsilva-aws commented May 8, 2025 • edited Loading

tengyifei commented May 9, 2025

tengyifei commented May 9, 2025

rpsilva-aws commented May 9, 2025 • edited Loading

rpsilva-aws commented May 9, 2025

tengyifei commented May 9, 2025

rpsilva-aws commented May 9, 2025 • edited Loading

rpsilva-aws commented May 1, 2025 •

edited

Loading

rpsilva-aws commented May 8, 2025 •

edited

Loading

rpsilva-aws commented May 9, 2025 •

edited

Loading

rpsilva-aws commented May 9, 2025 •

edited

Loading