DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

juyterman1000 · 2025-08-15T00:30:02Z

… meta key (max_mem)

sfc-gh-truwase · 2025-08-15T17:17:36Z

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

eternalNight · 2025-08-18T01:47:54Z

csrc/compile/z3.cpp

+            std::vector<int64_t> host_counts(world_size);
+            for (int i = 0; i < world_size; ++i) {
+                host_counts[i] = all_counts[i].to(torch::kCPU).item<int64_t>();
+                if (host_counts[i] > max_count) { max_count = host_counts[i]; }


Could you elaborate more on when ds_tensor.numel() of the same paramter can differ on different ranks? I think padding is already taken into account when the parameter is partitioned among the ranks (ref: https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L1664)

In case partition sizes do vary across ranks, can we fix that in partition_parameters.py to avoid synchronous communication here? launchAllGather() is on the critical path, so synchronous allgather can hurt performance.

Thanks for the sharp catch., I’ve removed the synchronous size‑allgather from the hot path in launchAllGather() and now use a fixed‑count NCCL allgather, trimming any end padding to the true param size. To keep things better without paying the runtime cost, I added a one‑time registration‑time assertion that shard sizes match across ranks, if there’s ever a mismatch, we’ll catch it at source rather than synchronize in the critical path. Changes are in the updated PR.

I think we can further optimize the code by making allgatherParam() allocating a buffer with padding in the first place (today it allocates a buffer of ds_shape which is the true size of the gathered parameter). With that we don't need any additional memcpy or GPU memory allocation/deallocation. Instead we can slice the gathered output_buf before returning it. My understanding is that torch can correctly track the refcount to the underlying buffers even living tensors use only part of them, but correct me if I'm wrong.

@eternalNight thanks for the suggestion. @juyterman1000 if you agree with this, do you want to address in a follow up PR? A benefit of a follow up PR is that it could document the perf benefit of the optimization separately from functionality.

Hi @juyterman1000,

Thank you for the PR! As some changes are unclear to me, can you explain a bit more?
You now added an assertion to ensure the even sharding, which totally makes sense to me. Do we still need the changes launchAllGather? The additional memory allocation and copy might cause the significant overhead in some cases.

@eternalNight Yes, we can allocate a buffer sized to world_size * shard_elems up front and slice it to the true size on return. PyTorch views hold a reference to the underlying storage; returning a sliced view does not break refcounting. We can cache the padded buffer per param to avoid repeat allocations. @sfc-gh-truwase . Agreed on the follow up.I’ll include micro-benchmarks showing the removal of 1 alloc + 1 memcpy per all-gather and any other gains. @tohtana With the even-sharding assertion in place, we don’t need the extra copy logic in launchAllGather(). We can issue a direct AllGather with a uniform shard ,elems count into the padded buffer and return a view of the first true_numel elements reshaped to the original param shape. The symmetric-memory path will stay as-is. This is to avoid additional copy overhead.

@juyterman1000 The path using symmetric memory is experimental and not well optimized. So we need to keep non-symmetric memory path as the choice for the best performance.
If the allocation and copy are for uneven partitioning and the assertion block such an uneven partitioning, why can't we remove them?

@tohtana Yep. The non-symmetric memory path is indeed the performance path, and we can remove the extra copy logic since the uniform shard assertion prevents uneven partitioning.

juyterman1000 · 2025-08-18T03:43:10Z

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Thanks for the check. I followed the formatting prerequisites in the contributing guide and ran the full pre-commit suite. I’ve pushed the updates.

eternalNight · 2025-08-21T04:17:44Z

csrc/compile/z3.cpp

+            const int64_t shard_elems = ds_tensor.numel();
+
+            // Perform all-gather directly into the pre-allocated padded output buffer
+            ncclResult_t result = ncclAllGather(ds_tensor.flatten().data_ptr(),


Why replacing .contiguous() with .flatten()? .contiguous() makes sure that the underlying storage is contiguous which is required by nccl. .flatten() is a view-change and does not guarantee that.

Note: I believe the sharded tensors are already contiguous as they are already defragmented by DeepSpeedZeroOptimizer_Stage3.defragment(), but adding a .contiguous() does not hurt anyway and may help later when the layout of sharded tensors is changed.

yep its already contiguous due to DeepSpeedZeroOptimizer_Stage3.defragment().

The latest version looks good to me. Unfortunately resolving threads do not work ...

eternalNight · 2025-08-21T04:28:45Z

csrc/compile/z3.cpp

+        }
+
+        at::Tensor output_buf;
+        if (param_registry_->hasGatheredParam(ds_id)) {


I'm not sure when isValid(ds_id) is false while hasGatheredParam(ds_id) is true. They are both set at the end of launchAllGather(), and releasing a gathered param will unset the valid flag in unregisterGatheredParam().

… meta key (max_mem) Signed-off-by: Abhishek <[email protected]>

…s at registration (max_mem) Signed-off-by: Abhishek <[email protected]>

…iew; launchAllGather issues direct NCCL AllGather for uniform shards; add registration-time uniform-shard validation Signed-off-by: Abhishek <[email protected]>

Signed-off-by: Abhishek <[email protected]>

tohtana · 2025-08-27T08:19:44Z

Hi @juyterman1000,

Thank you, the changes look almost good to me, but I encountered the compilation error.

/home/ray/default/DeepSpeed/deepspeed/ops/csrc/compile/z3.cpp:569:30: error: cannot convert ‘std::vector<at::Tensor>’ to ‘std::vector<std::vector<at::Tensor> >&’
  569 |     process_group->allgather(all_counts, local_count_tensor)->wait();
      |                              ^~~~~~~~~~
      |                              |
      |                              std::vector<at::Tensor>

I think just wrapping the arguements with std::vector should work. Can you also add a test in TestDeepCompile (tests/unit/runtime/compile/test_compile_zero.py)? You can reuse the existing test and set a different hidden size to make the shard size uneven.

- Fix allgather call to use proper vector wrapping for process_group API - Add test_uneven_shard_assertion to verify registration-time validation - Add hidden_dim_override parameter to compare_loss utility function Signed-off-by: Abhishek <[email protected]>

tohtana · 2025-08-29T05:26:06Z

Hi @juyterman1000
I still see a compilation error. The first argument is a buffer to store the result of allgather. So you can't pass an rvalue.

/home/ray/default//DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.cpp:431:26: error: cannot bind non-const lvalue reference of type ‘std::vector<std::vector<at::Tensor> >&’ to an rvalue of type ‘std::vector<std::vector<at::Tensor> >’
  431 |         ->allgather(std::vector<std::vector<at::Tensor>>{all_counts},
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/includes/deepcompile.h:21,
                 from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.h:6,
                 from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.cpp:6:

Signed-off-by: Abhishek <[email protected]>

juyterman1000 · 2025-09-07T02:14:38Z

All changes are implemented.

tohtana · 2025-09-08T08:01:50Z

@juyterman1000
The existing tests in test_compile_zero.py::TestDeepCompile::test don't pass. They raise the new error that this PR adds:

RuntimeError: ZeRO-3 registration error: non-uniform shard sizes detected across ranks. Please check parameter partitioning.

By the way, it seems #7509 broke z1/2 paths for DeepCompile, but you can still check it with Z3.

… uniform shard assertion to check padded sizes instead of actual sizes - DeepSpeed adds padding to ensure even division across ranks - Change test from expecting failure to verifying correct handling of padded params - This fixes existing DeepCompile tests that use parameters not evenly divisible by world_size Signed-off-by: Abhishek <[email protected]>

…terman1000/DeepSpeed into fix/dc-zero3-allgather-uneven-shards # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

…not grad_buffer - Fixes edge cases where grad_buffer is partition-sized - Keeps uniform padded size invariant and avoids false positives Signed-off-by: Abhishek <[email protected]>

- Ensures buffer reuse and allocations match padded layout Signed-off-by: Abhishek <[email protected]>

…ather-uneven-shards

juyterman1000 requested review from loadams and tjruwase as code owners August 15, 2025 00:30

juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from eac514f to 1f39153 Compare August 15, 2025 00:32

eternalNight reviewed Aug 18, 2025

View reviewed changes

sfc-gh-truwase requested a review from tohtana August 18, 2025 16:13

eternalNight reviewed Aug 21, 2025

View reviewed changes

adalakoti90 added 4 commits August 21, 2025 20:11

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling…

2e6f1b8

… meta key (max_mem) Signed-off-by: Abhishek <[email protected]>

DeepCompile ZeRO-3: remove size-allgather; enforce uniform shard size…

24ceb0b

…s at registration (max_mem) Signed-off-by: Abhishek <[email protected]>

Z3: allgatherParam pre-allocates padded buffer and returns a sliced v…

94e3dd6

…iew; launchAllGather issues direct NCCL AllGather for uniform shards; add registration-time uniform-shard validation Signed-off-by: Abhishek <[email protected]>

Z3: use .contiguous() for NCCL allgather send buffer; add comment

ffa2aba

Signed-off-by: Abhishek <[email protected]>

juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from 34df823 to ffa2aba Compare August 22, 2025 03:12

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

04f9bd4

juyterman1000 and others added 3 commits August 27, 2025 05:52

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

80ff735

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

04e7617

juyterman1000 and others added 4 commits August 30, 2025 19:51

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

daa3e73

Fix allgather lvalue reference error

4320c5a

Signed-off-by: Abhishek <[email protected]>

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

0c13f9e

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

d038200

adalakoti90 and others added 5 commits September 11, 2025 22:52

Merge branch 'master' into fix/dc-zero3-allgather-uneven-shards

a01d899

Z3 registration: compute padded-per-rank from ds_shape (total_numel) …

4f322d4

…not grad_buffer - Fixes edge cases where grad_buffer is partition-sized - Keeps uniform padded size invariant and avoids false positives Signed-off-by: Abhishek <[email protected]>

Z3 allgatherParam: derive padded_numel via padded_per_rank from ds_shape

05a13ed

- Ensures buffer reuse and allocations match padded layout Signed-off-by: Abhishek <[email protected]>

Merge remote-tracking branch 'upstream/master' into fix/dc-zero3-allg…

c8f11bb

…ather-uneven-shards

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

Are you sure you want to change the base?

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

Uh oh!

Conversation

juyterman1000 commented Aug 15, 2025

Uh oh!

sfc-gh-truwase commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juyterman1000 commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juyterman1000 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tohtana commented Aug 27, 2025

Uh oh!

tohtana commented Aug 29, 2025

Uh oh!

juyterman1000 commented Sep 7, 2025

Uh oh!

tohtana commented Sep 8, 2025

Uh oh!

Uh oh!

juyterman1000 Aug 22, 2025 •

edited

Loading