Skip to content

Conversation

@rtmadduri
Copy link
Collaborator

@rtmadduri rtmadduri commented Oct 20, 2025

This PR adds chunking logic and enables the shared memory optimization feature for Decode for the CDNA3 architecture.

The major addition of the PR is rewriting the shared memory calculation and chunking to better suit the CDNA3 architecture which only allows 64KiB of shared memory per CU.

The PR makes corresponding changes to test_batch_decode_kernels_hip.py and examples/test_batch_decode_example.py

examples/test_batch_decode_example.py

JIT: Using prebuilt ops
PASS

test_batch_decode_kernels_hip.py

================================= 720 passed in 74.37s (0:01:14) ================================= 

Complete HIP PyTest suite

=================================  16388 passed, 18 skipped in 148.54s (0:02:28) ================================= 

C++ test suite

89% tests passed, 3 tests failed out of 27

Total Test time (real) = 259.46 sec

The following tests FAILED:
          3 - FlashInferCorrectnessTest.VariableLengthMergeKernelCorrectnessTestFP16 (Failed)
         20 - MfmaRowSumTest.CorrectResults (Failed)
         27 - test_rowsum_hip (Failed)

Note: See here for more info about the above known failures

Improvement over the existing implementation:

num_qo_heads = 32
kv_len = 8196
num_kv_heads = 32
head_dim = 128
num_iter = 500

Average time per iteration in seconds:

Current Flashinfer + ROCm Decode MI325: 6.3011
Shared memory Optimization Decode MI325 (This PR): 0.11113595962524414
Upstream Flashinfer v0.2.5 Decode H100: 0.09272098541259766

@rtmadduri rtmadduri self-assigned this Oct 20, 2025
@rtmadduri rtmadduri changed the title [Draft] | Decode feature chunking logic and shared mem optimization Decode feature chunking logic and shared mem optimization Oct 20, 2025

constexpr uint32_t vec_size = std::max(16UL / sizeof(DTypeKV), HEAD_DIM / 32UL);
// AMD CDNA3 optimized vector size - prefer smaller vec_size for better occupancy
constexpr uint32_t vec_size = (HEAD_DIM < 256U)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are going to use vec size of 4 for HEAD_DIM < 256 and 8 and for higher HEAD_DIM sizes. Vec size of 4 should translate into a float4 or 128b vector loads and for higher vec sizes it should most likely translate into be multiple 128b loads. Can you clarify the relation with vec_size and how it relates with the occupancy as noted in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incorrect and misleading. I have corrected the comment and it now reads

  // Optimizing vec_size for CDNA3 architecture.
  // This helps keep the dynamic shared memory allocation within hardware threshold for CDNA3

This solution took a little bit of experimenting to arrive. The thread_block configuration (bdx, bdy, bdz) is influenced by the vec_size.

Lets take an example - For HEAD_DIM = 128,
We have

constexpr uint32_t vec_size = max(8/2, 128/64) = max(4, 2) = 4
constexpr bdx = HEAD_DIM/vec_size = 128/4 = 32

This in-turn has an impact of how many threads we launch across y, z dims and also influences how much dynamic shared memory we allocate using the smem formula:

const uint32_t smem_size =
        2U * NUM_STAGES_SMEM * bdy * tile_size_per_bdx * bdz * HEAD_DIM * sizeof(DTypeKV) +
        2U * bdy * bdz * sizeof(float);

Making the vec_size a function of HEAD_DIM helped me tune the register and dynamic shared memory allocation to cover more use-cases.

2U * bdy * bdz * sizeof(float);
// This has been hard coded to 2U. Previous implementation involved a macro redirection that
// always resulted in 2U for H100 or CDNA3 architecture. Please take a look at
// gpu_iface/dispatch.cuh - DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM macro
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really challenging why we should set NUM_STAGES_SMEM to 2, but the heuristic for using 2 for CDNA3 is not clear to me here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Nvidia hardware, this typically points to the ability of newer GPUs (H100) to pipeline shared memory operations more efficiently. Something like a multi-stage shared memory buffer.

One stage of smem buffering would involve loading data -> computing -> syncthreads -> repeat.

The H100 architecture allows asynchronous copy from global memory to shared memory using cp.async. These async copies allow multiple in-flight stages of shared memory data.

On the CDNA 3, I try to do something similar (though we have cp_async disabled for now). That is why this is set to 2U.

On an implementation level, we have

#define DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM(compute_capacity, NUM_STAGES_SMEM, ...)
  if (compute_capacity.first >= 8) { 
    constexpr uint32_t NUM_STAGES_SMEM = 2;
    __VA_ARGS__
  } else {
    constexpr uint32_t NUM_STAGES_SMEM = 1;
    __VA_ARGS__
  }

Where, compute_capacity is determined by gpu_iface/utils.cuh

For CDNA3, the compute_capacity.first will return 9

@diptorupd
Copy link
Collaborator

@rtmadduri good work. I left few minor comments asking for some clarifications, but overall I think it is good to go.

@rtmadduri rtmadduri requested a review from diptorupd October 27, 2025 17:42
@diptorupd diptorupd force-pushed the feature/decode-chunking-logic branch from d9c1dd6 to 7beead2 Compare October 27, 2025 20:55
@diptorupd diptorupd merged commit 23e247b into ROCm:amd-integration Oct 27, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants