Improvements and fixes to gradient accumulation #993

apoorvtintin · 2025-02-14T01:10:58Z

Fix to with_minibatch_steps decorator to generate correct primal outputs shapes.
Improved with_minibatch_steps to take a minibatch_partitioner that constraints the accumulation minibatch to the same PartitionSpec as input_partitioner.

Misc:

Enable gradient accumulation for Fuji 3B on TRN2

kelvin-zou · 2025-02-14T02:25:39Z

axlearn/experiments/text/gpt/fuji.py

+                                        # Note: the batch axes are different here than in
+                                        # `cfg.batch_axis_names`,
+                                        # as we partition sequence dim over `seq`.
+                                        (None, 1): PartitionSpec(("data", "expert", "fsdp")),


I am wondering, if we have a default input partition with axis=0 on ("data", "expert", "fsdp") and axis=1 on "seq", do we still need this?

Thanks for the quick review.
(None, 1) is for the target_num_bytes and (None, 2) is for the input_ids and target_labels, so we need both. Together they will work for most cases, but for the outliers where a specific sharding is required the ability to change sharding for the minibatches will be good to have.

Let me know if this answers your question.

kelvin-zou · 2025-02-14T02:26:01Z

axlearn/common/gradient_accumulation.py

        ),
-        input_partition_spec(),


To me, it seems rather a hack than a proper solution, that is, we want to have a different input_partition_spec() than the default one, then we need this?

Sorry I missed the default case, added it.

I think the below partition spec is good as a default, but the ability to change PartitionSpec might be good to have, what do you think?

(None, 1): PartitionSpec(("data", "expert", "fsdp")), (None, 2): PartitionSpec(("data", "expert", "fsdp"), "seq"),

axlearn/common/gradient_accumulation.py

- Fix to with_minibatch_steps decorator to generate correct primal outputs shapes. - Improved with_minibatch_steps to take a minibatch_partitioner that contraints the input batch to the same PartitionSpec as Input Partitioner.

apghml · 2025-02-19T00:26:53Z

axlearn/common/gradient_accumulation.py

@@ -57,39 +59,38 @@ def _make_scan_minibatch_inputs(
    param_noise_key: Tensor,
    minibatch_size: int,
    minibatch_index: int,
+    minibatch_partitioner: Optional[InputPartitionFn],


Echoing Kelvin's comment, could you explain concretely why we need this functionality? If it's just something that might be useful, maybe we can wait until we are certain that we will need it?

In the case where gradient accumulation is not enabled, the inputs to the graph are sharded as per the policy in input_partitioner. This ensures the batch dimension is sharded on data, expert and fsdp axes while sequence dimension is replicated on model axis.

Gradient accumulation wraps the train steps in a scan loop, while the input_partitioner shards the input batch to correctly at first. In the gradient accumulation wrapper the input batches are resharded/overridden by the function _make_scan_minibatch_inputs and sharded along all axes available which is probably unexpected and inefficient. Minibatches should follow the same PartitionSpec as input_batches.

The addition of the minibatch_partitioner allows the minibatches to use the same sharding/PartitionSpec as input_partitioner provides in the input batches in the case gradient accumulation is not used.

If we just preserve the sharding the input already has, would that also address the concern about the input sharding being changed?

Yeah preserving sharding of the input and not having a sharding_constraint for minibatches would address the concern as well.

apghml · 2025-02-19T00:28:42Z

axlearn/common/gradient_accumulation.py

+    # Default partitioner for minibatches.
+    if not minibatch_partitioner:
+        minibatch_partitioner = partition_by_path_rank(
+            path_rank_to_partition={


Can we default this to the same sharding the input is already using along all non-batch axes?

Just confirming if I read it correctly, we want to default to input_partition_specs from utils.py like before, and not what the input_partitioner sets.

Or the ask is to use the partition_by_path_rank to replicate what input_partition_specs was doing.

Not exactly. I was envisioning that for all axes other than axis 0, we default to whatever sharding the input already has. For axis 0, ideally we could also keep whatever sharding the input already has too, although I'm not sure that would work with logical batching.

For axis 0, ideally we could also keep whatever sharding the input already has too, although I'm not sure that would work with logical batching

I think preserving the sharding of the input would be perfect, logical batching already inserts the correct sharding constraint after squeezing out the padded batches

apoorvtintin · 2025-02-26T22:59:21Z

Removed additional sharding constraints from gradient accumulation decorator, minibatches now should use the sharding spec created by the input_partitioner.

apghml · 2025-02-27T21:41:43Z

axlearn/common/gradient_accumulation.py

        ),
-        input_partition_spec(),
+        inputs["input_batch"],


Suppose we have a global input batch of size 100 running on 10 chips (so a per chip size of 10) and we want to switch to doing 10 grad accumulation steps each with a global batch size of 10 (1 per chip per accumulation step).

Suppose that the input is originally sharded evenly across the chips (first 10 on first chip, second 10 on second chip, etc). Then when we get the first slice of 10 for the first grad accumulation step, won't all these examples be on the same chip? Will that cause a problem? (E.g., if we worry XLA might not automatically reshard the examples across chips?)

Maybe we should reshard the batch axis only?

+1 on the potential design problem here. Can you double check and ensure that axis=0 is confirmed to be batch size?

We can completely avoid the batch reshards using a reshape + transpose. I added it to the, PR let me know if it addresses your concerns.

Using the same example as @apghml:

Suppose we have a global input batch of size 100 running on 10 chips (so a per chip size of 10) and we want to switch to doing 10 grad accumulation steps each with a global batch size of 10 (1 per chip per accumulation step).
Suppose that the input is originally sharded evenly across the chips (first 10 on first chip, second 10 on second chip, etc). Then when we get the first slice of 10 for the first grad accumulation step, won't all these examples be on the same chip? Will that cause a problem? (E.g., if we worry XLA might not automatically reshard the examples across chips?)

Rather than using first 10 batches available in the global batch array for the first iteration, we construct the minibatch using the first batch from every device that is minibatch 0 =>[0, 10, 20 ....], minibatch 1 => [1, 11, 21, ...]. This is achieved using the reshape and transpose.

Essentially the logic here is to ensure each device uses local batches avoiding extra reshards.
This also scales well across multiple nodes as each node only runs a local reshape + transpose, also higher per device BS is also supported.

This should addresses the concerns around input batch reshards, let me know if there are still more concerns.

+1 on the potential design problem here. Can you double check and ensure that axis=0 is confirmed to be batch size?

@kelvin-zou I can't think of a way to get size of a specific axis at runtime, but I do believe JAX should be able to give an informative error if the batch size % batch axis size != 0.

Thanks for the explanation. Can you add a test that fails without this fix?

apghml · 2025-02-27T21:42:56Z

axlearn/common/gradient_accumulation.py

@@ -172,12 +167,26 @@ def fwd_helper(
                otherwise None.
            """
            minibatch_size = _compute_minibatch_size(inputs["input_batch"], steps=steps)
+
+            # Create a sample minibatch for the carry buffer creation below


Could you explain in more detail why this is needed?

I saw broadcasting errors coming from the scan body, (example below), JAX complained that the carry buffer shape and the output of minibatch step are incompatible.

PS below error where acc=4 and full batch size is 32
TypeError: add got incompatible shapes for broadcasting: (32, 4096, 3072), (8, 4096, 3072).

The carry buffer initialization uses the full batch while creating the buffer, which does not match with the output of minibatch step since it would use the shapes of minibatch.

The simple fix for this is to use a minibatch sample for creating carry buffer ensuring it's shapes are same as the minibatch step.

Let me know if I missed something.

Do we know why this issue wasn't causing errors before?

The unit test uses a toy model which does not have any metric/output that relies on batch size which is why it does not catch this issue. I dug a bit deeper and found that for fuji modelsoutput_collection/module_outputs/decoder/transformer/layer3/output carries batch dimension in output - ref below.

path (GetAttrKey(name='output_collection'), GetAttrKey(name='module_outputs'), DictKey(key='decoder'), DictKey(key='transformer'), DictKey(key='layer3'), DictKey(key='output')) shape (32, 4096, 3072)

kelvinzou

Overall looks good to me, will approve once addressed @apghml 's comments.

kelvinzou · 2025-02-28T16:54:33Z

axlearn/common/gradient_accumulation.py

@@ -172,12 +167,26 @@ def fwd_helper(
                otherwise None.
            """
            minibatch_size = _compute_minibatch_size(inputs["input_batch"], steps=steps)
+
+            # Create a sample minibatch for the carry buffer creation below


kelvinzou · 2025-02-28T16:55:05Z

axlearn/common/gradient_accumulation.py

        ),
-        input_partition_spec(),
+        inputs["input_batch"],


+1 on the potential design problem here. Can you double check and ensure that axis=0 is confirmed to be batch size?

apghml · 2025-03-05T19:54:56Z

axlearn/common/gradient_accumulation.py

+                """Helper function that adds a minibatch dimension while evenly dividing
+                batches across gradient accumulation iterations.
+
+                Input dimension is [GBS, seq], this first reshaped to [MBS, steps, seq],


Replace the acronyms with full names?

apghml · 2025-03-05T19:56:23Z

axlearn/common/gradient_accumulation.py

        ),
-        input_partition_spec(),
+        inputs["input_batch"],


Thanks for the explanation. Can you add a test that fails without this fix?

apghml · 2025-03-05T19:59:40Z

axlearn/common/gradient_accumulation.py

+                # Set up transpose to swap the first two dimensions.
+                dims = list(range(x.ndim))
+                dims[0], dims[1] = dims[1], dims[0]
+                return x.transpose(dims)


Could we replace these three lines with one line if we use jnp.moveaxis?

apoorvtintin requested review from ruomingp, markblee and a team as code owners February 14, 2025 01:10

kelvin-zou reviewed Feb 14, 2025

View reviewed changes

apoorvtintin force-pushed the mainline_grad_accum_fix branch 2 times, most recently from 9b0f9a3 to 32a78ea Compare February 14, 2025 23:15

apghml requested changes Feb 15, 2025

View reviewed changes

axlearn/common/gradient_accumulation.py Outdated Show resolved Hide resolved

Improvements and fixes to gradient accumulation

186e082

- Fix to with_minibatch_steps decorator to generate correct primal outputs shapes. - Improved with_minibatch_steps to take a minibatch_partitioner that contraints the input batch to the same PartitionSpec as Input Partitioner.

apoorvtintin force-pushed the mainline_grad_accum_fix branch from 32a78ea to 186e082 Compare February 18, 2025 22:07

apoorvtintin requested review from apghml and kelvin-zou February 18, 2025 22:08

apghml reviewed Feb 19, 2025

View reviewed changes

apoorvtintin requested a review from apghml February 26, 2025 22:59

remove sharding constraints from gradient accumulation

306f089

apoorvtintin force-pushed the mainline_grad_accum_fix branch from 8c16718 to 306f089 Compare February 26, 2025 23:55

apghml reviewed Feb 27, 2025

View reviewed changes

kelvinzou reviewed Feb 28, 2025

View reviewed changes

introduce reshape+transpose to avoid input batch reshard in accumulation

6be8094

apoorvtintin requested review from apghml and kelvinzou March 5, 2025 18:20

apghml reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements and fixes to gradient accumulation #993

Improvements and fixes to gradient accumulation #993

apoorvtintin commented Feb 14, 2025 •

edited

Loading

kelvin-zou Feb 14, 2025

apoorvtintin Feb 14, 2025 •

edited

Loading

kelvin-zou Feb 14, 2025

apoorvtintin Feb 14, 2025

apghml Feb 19, 2025

apoorvtintin Feb 19, 2025 •

edited

Loading

apghml Feb 19, 2025

apoorvtintin Feb 26, 2025

apghml Feb 19, 2025

apoorvtintin Feb 19, 2025 •

edited

Loading

apghml Feb 19, 2025

apoorvtintin Feb 26, 2025

apoorvtintin commented Feb 26, 2025

apghml Feb 27, 2025 •

edited

Loading

kelvinzou Feb 28, 2025

apoorvtintin Mar 5, 2025 •

edited

Loading

apghml Mar 5, 2025

apghml Feb 27, 2025

kelvinzou Feb 28, 2025

apoorvtintin Mar 4, 2025 •

edited

Loading

apghml Mar 4, 2025

apoorvtintin Mar 4, 2025 •

edited

Loading

kelvinzou left a comment

kelvinzou Feb 28, 2025

kelvinzou Feb 28, 2025

apghml Mar 5, 2025

apghml Mar 5, 2025

apghml Mar 5, 2025

Improvements and fixes to gradient accumulation #993

Are you sure you want to change the base?

Improvements and fixes to gradient accumulation #993

Conversation

apoorvtintin commented Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

apoorvtintin Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin commented Feb 26, 2025

apghml Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

kelvinzou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvtintin commented Feb 14, 2025 •

edited

Loading

apoorvtintin Feb 14, 2025 •

edited

Loading

apoorvtintin Feb 19, 2025 •

edited

Loading

apoorvtintin Feb 19, 2025 •

edited

Loading

apghml Feb 27, 2025 •

edited

Loading

apoorvtintin Mar 5, 2025 •

edited

Loading

apoorvtintin Mar 4, 2025 •

edited

Loading

apoorvtintin Mar 4, 2025 •

edited

Loading