Skip to content

Conversation

juyterman1000
Copy link

… meta key (max_mem)

@juyterman1000 juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from eac514f to 1f39153 Compare August 15, 2025 00:32
@sfc-gh-truwase
Copy link
Collaborator

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

std::vector<int64_t> host_counts(world_size);
for (int i = 0; i < world_size; ++i) {
host_counts[i] = all_counts[i].to(torch::kCPU).item<int64_t>();
if (host_counts[i] > max_count) { max_count = host_counts[i]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on when ds_tensor.numel() of the same paramter can differ on different ranks? I think padding is already taken into account when the parameter is partitioned among the ranks (ref: https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/partition_parameters.py#L1664)

In case partition sizes do vary across ranks, can we fix that in partition_parameters.py to avoid synchronous communication here? launchAllGather() is on the critical path, so synchronous allgather can hurt performance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sharp catch., I’ve removed the synchronous size‑allgather from the hot path in launchAllGather() and now use a fixed‑count NCCL allgather, trimming any end padding to the true param size. To keep things better without paying the runtime cost, I added a one‑time registration‑time assertion that shard sizes match across ranks, if there’s ever a mismatch, we’ll catch it at source rather than synchronize in the critical path. Changes are in the updated PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can further optimize the code by making allgatherParam() allocating a buffer with padding in the first place (today it allocates a buffer of ds_shape which is the true size of the gathered parameter). With that we don't need any additional memcpy or GPU memory allocation/deallocation. Instead we can slice the gathered output_buf before returning it. My understanding is that torch can correctly track the refcount to the underlying buffers even living tensors use only part of them, but correct me if I'm wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eternalNight thanks for the suggestion. @juyterman1000 if you agree with this, do you want to address in a follow up PR? A benefit of a follow up PR is that it could document the perf benefit of the optimization separately from functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @juyterman1000,

Thank you for the PR! As some changes are unclear to me, can you explain a bit more?
You now added an assertion to ensure the even sharding, which totally makes sense to me. Do we still need the changes launchAllGather? The additional memory allocation and copy might cause the significant overhead in some cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eternalNight Yes, we can allocate a buffer sized to world_size * shard_elems up front and slice it to the true size on return. PyTorch views hold a reference to the underlying storage; returning a sliced view does not break refcounting. We can cache the padded buffer per param to avoid repeat allocations. @sfc-gh-truwase . Agreed on the follow up.I’ll include micro-benchmarks showing the removal of 1 alloc + 1 memcpy per all-gather and any other gains. @tohtana With the even-sharding assertion in place, we don’t need the extra copy logic in launchAllGather(). We can issue a direct AllGather with a uniform shard ,elems count into the padded buffer and return a view of the first true_numel elements reshaped to the original param shape. The symmetric-memory path will stay as-is. This is to avoid additional copy overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juyterman1000 The path using symmetric memory is experimental and not well optimized. So we need to keep non-symmetric memory path as the choice for the best performance.
If the allocation and copy are for uneven partitioning and the assertion block such an uneven partitioning, why can't we remove them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tohtana Yep. The non-symmetric memory path is indeed the performance path, and we can remove the extra copy logic since the uniform shard assertion prevents uneven partitioning.

@juyterman1000
Copy link
Author

@juyterman1000 can you please address the formatting issue using https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Thanks for the check. I followed the formatting prerequisites in the contributing guide and ran the full pre-commit suite. I’ve pushed the updates.

@sfc-gh-truwase sfc-gh-truwase requested a review from tohtana August 18, 2025 16:13
const int64_t shard_elems = ds_tensor.numel();

// Perform all-gather directly into the pre-allocated padded output buffer
ncclResult_t result = ncclAllGather(ds_tensor.flatten().data_ptr(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replacing .contiguous() with .flatten()? .contiguous() makes sure that the underlying storage is contiguous which is required by nccl. .flatten() is a view-change and does not guarantee that.

Note: I believe the sharded tensors are already contiguous as they are already defragmented by DeepSpeedZeroOptimizer_Stage3.defragment(), but adding a .contiguous() does not hurt anyway and may help later when the layout of sharded tensors is changed.

Copy link
Author

@juyterman1000 juyterman1000 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep its already contiguous due to DeepSpeedZeroOptimizer_Stage3.defragment().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version looks good to me. Unfortunately resolving threads do not work ...

}

at::Tensor output_buf;
if (param_registry_->hasGatheredParam(ds_id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure when isValid(ds_id) is false while hasGatheredParam(ds_id) is true. They are both set at the end of launchAllGather(), and releasing a gathered param will unset the valid flag in unregisterGatheredParam().

…iew; launchAllGather issues direct NCCL AllGather for uniform shards; add registration-time uniform-shard validation

Signed-off-by: Abhishek <[email protected]>
@juyterman1000 juyterman1000 force-pushed the fix/dc-zero3-allgather-uneven-shards branch from 34df823 to ffa2aba Compare August 22, 2025 03:12
@tohtana
Copy link
Contributor

tohtana commented Aug 27, 2025

Hi @juyterman1000,

Thank you, the changes look almost good to me, but I encountered the compilation error.

/home/ray/default/DeepSpeed/deepspeed/ops/csrc/compile/z3.cpp:569:30: error: cannot convert ‘std::vector<at::Tensor>’ to ‘std::vector<std::vector<at::Tensor> >&’
  569 |     process_group->allgather(all_counts, local_count_tensor)->wait();
      |                              ^~~~~~~~~~
      |                              |
      |                              std::vector<at::Tensor>

I think just wrapping the arguements with std::vector should work. Can you also add a test in TestDeepCompile (tests/unit/runtime/compile/test_compile_zero.py)? You can reuse the existing test and set a different hidden size to make the shard size uneven.

juyterman1000 and others added 3 commits August 27, 2025 05:52
- Fix allgather call to use proper vector wrapping for process_group API
- Add test_uneven_shard_assertion to verify registration-time validation
- Add hidden_dim_override parameter to compare_loss utility function

Signed-off-by: Abhishek <[email protected]>
@tohtana
Copy link
Contributor

tohtana commented Aug 29, 2025

Hi @juyterman1000
I still see a compilation error. The first argument is a buffer to store the result of allgather. So you can't pass an rvalue.

/home/ray/default//DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.cpp:431:26: error: cannot bind non-const lvalue reference of type ‘std::vector<std::vector<at::Tensor> >&’ to an rvalue of type ‘std::vector<std::vector<at::Tensor> >’
  431 |         ->allgather(std::vector<std::vector<at::Tensor>>{all_counts},
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/includes/deepcompile.h:21,
                 from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.h:6,
                 from /home/ray/default/DeepSpeed.pr7489/deepspeed/ops/csrc/compile/z3.cpp:6:

@juyterman1000
Copy link
Author

All changes are implemented.

@tohtana
Copy link
Contributor

tohtana commented Sep 8, 2025

@juyterman1000
The existing tests in test_compile_zero.py::TestDeepCompile::test don't pass. They raise the new error that this PR adds:

RuntimeError: ZeRO-3 registration error: non-uniform shard sizes detected across ranks. Please check parameter partitioning.

By the way, it seems #7509 broke z1/2 paths for DeepCompile, but you can still check it with Z3.

adalakoti90 and others added 5 commits September 11, 2025 22:52
… uniform shard assertion to check padded sizes instead of actual sizes - DeepSpeed adds padding to ensure even division across ranks - Change test from expecting failure to verifying correct handling of padded params - This fixes existing DeepCompile tests that use parameters not evenly divisible by world_size Signed-off-by: Abhishek <[email protected]>
…terman1000/DeepSpeed into fix/dc-zero3-allgather-uneven-shards

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
…not grad_buffer

- Fixes edge cases where grad_buffer is partition-sized
- Keeps uniform padded size invariant and avoids false positives

Signed-off-by: Abhishek <[email protected]>
- Ensures buffer reuse and allocations match padded layout

Signed-off-by: Abhishek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants