[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

huanghua1994 · 2025-09-29T21:50:47Z

Description

This is a draft PR for saving some work and discussion.

Recently we used TE/JAX's grouped_gemm() interface for a MoE model's inference. Nsys shows a GPU bubble when grouped_gemm() is copying the group_sizes array from device to host. This is a known issue when we were designing the grouped_gemm() interface. It's performance impact for training / inference prefill stage is relatively small but cannot be ignored in inference decode stage. This draft aims to partially address the bubble issue.

Our target model uses MLP-MoE, i.e., each expert is a MLP layer. After fusing GEMMs, each MLP-MoE layer needs two grouped_gemm() with the same group_sizes array. This PR allows issuing an async D2H copy of the group_size array before entering grouped_gemm(), then grouped_gemm() can reuse the downloaded group_sizes. We have validated the correctness of the implementation in this PR in our target model.

This PR does not solve the issue of breaking CUDA graph in grouped_gemm() since in the async copy mode it still needs to call cudaEventSynchronize(). Furthermore, the D2H memcpy does not overlap with other operations for copying and dispatching tokens to experts in our implementation for the target model, since those JAX-native operations are captured and executed in CUDA graph, while the async D2H copy does not support CUDA graph.

@phu0ngng @mingxu1067 Please let me know your comments and suggestions. Much appreciated!

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added GroupedGemmCopySizesPrimitive for async copying of group_sizes from GPU to host
Added optional argument use_async_d2h_group_sizes for grouped_gemm(), the default value is False so the original code path will be used by default

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Hua Huang <[email protected]>

for more information, see https://pre-commit.ci

phu0ngng

I think it is a good improvement for now.

We should probably provide a GroupedLayerNormMLP VJP op, which encloses the grouped_gemm_copy_group_sizes function and the use_async_d2h_group_sizes option so that we don't expose these two to users as they can be pretty bug-prone.

phu0ngng · 2025-09-30T13:54:30Z

transformer_engine/jax/csrc/extensions/gemm.cpp

+               "supported number ", max_num_gemms, " to be downloaded in advance.");
+    host_num_gemms = num_gemms;
+    // Wait for current compute stream to finish
+    cudaStream_t compute_stream_0 = nvte_get_compute_stream(0);


@mingxu1067 could you check if this causes the same stream sync issue as last time when we used the compute_stream(0) instead of the stream given by XLA?

Just a note: this part follows the logic in https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/gemm/cublaslt_gemm.cu#L915

phu0ngng · 2025-09-30T13:55:32Z

transformer_engine/jax/csrc/extensions/gemm.cpp

+  auto init = [&]() {
+    NVTE_CHECK_CUDA(cudaEventCreate(&d2h_event));
+    NVTE_CHECK_CUDA(cudaMallocHost(&host_group_sizes_internal, sizeof(int32_t) * max_num_gemms));
+  };


If this causes any issues, we could consider moving this allocation into the FFI prepare phase.

huanghua1994 added 2 commits September 29, 2025 14:28

Try async copy of grouped GEMM group_sizes data

6044269

Signed-off-by: Hua Huang <[email protected]>

Fix primitive config and stream sync

d6d7b57

Signed-off-by: Hua Huang <[email protected]>

huanghua1994 requested review from mingxu1067 and phu0ngng September 29, 2025 21:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7c4ab7

for more information, see https://pre-commit.ci

phu0ngng approved these changes Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Uh oh!

huanghua1994 commented Sep 29, 2025

Uh oh!

phu0ngng left a comment

Uh oh!

phu0ngng Sep 30, 2025

Uh oh!

huanghua1994 Sep 30, 2025

Uh oh!

phu0ngng Sep 30, 2025

Uh oh!

Uh oh!

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Are you sure you want to change the base?

[JAX][Draft] Async issuing D2H memcpy for grouped_gemm group_sizes array #2213

Uh oh!

Conversation

huanghua1994 commented Sep 29, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

phu0ngng left a comment

Choose a reason for hiding this comment

Uh oh!

phu0ngng Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

huanghua1994 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!