Decode feature chunking logic and shared mem optimization #25

rtmadduri · 2025-10-20T03:10:24Z

This PR adds chunking logic and enables the shared memory optimization feature for Decode for the CDNA3 architecture.

The major addition of the PR is rewriting the shared memory calculation and chunking to better suit the CDNA3 architecture which only allows 64KiB of shared memory per CU.

The PR makes corresponding changes to test_batch_decode_kernels_hip.py and examples/test_batch_decode_example.py

examples/test_batch_decode_example.py

JIT: Using prebuilt ops
PASS

test_batch_decode_kernels_hip.py

================================= 720 passed in 74.37s (0:01:14) =================================

Complete HIP PyTest suite

=================================  16388 passed, 18 skipped in 148.54s (0:02:28) =================================

C++ test suite

89% tests passed, 3 tests failed out of 27

Total Test time (real) = 259.46 sec

The following tests FAILED:
          3 - FlashInferCorrectnessTest.VariableLengthMergeKernelCorrectnessTestFP16 (Failed)
         20 - MfmaRowSumTest.CorrectResults (Failed)
         27 - test_rowsum_hip (Failed)

Note: See here for more info about the above known failures

Improvement over the existing implementation:

num_qo_heads = 32
kv_len = 8196
num_kv_heads = 32
head_dim = 128
num_iter = 500

Average time per iteration in seconds:

Current Flashinfer + ROCm Decode MI325: 6.3011
Shared memory Optimization Decode MI325 (This PR): 0.11113595962524414
Upstream Flashinfer v0.2.5 Decode H100: 0.09272098541259766

libflashinfer/include/flashinfer/attention/generic/decode.cuh

libflashinfer/include/gpu_iface/backend/hip/memory_ops_hip.h

libflashinfer/include/flashinfer/attention/generic/decode.cuh

diptorupd · 2025-10-25T00:05:17Z

libflashinfer/include/flashinfer/attention/generic/decode.cuh


-  constexpr uint32_t vec_size = std::max(16UL / sizeof(DTypeKV), HEAD_DIM / 32UL);
+  // AMD CDNA3 optimized vector size - prefer smaller vec_size for better occupancy
+  constexpr uint32_t vec_size = (HEAD_DIM < 256U)


So, we are going to use vec size of 4 for HEAD_DIM < 256 and 8 and for higher HEAD_DIM sizes. Vec size of 4 should translate into a float4 or 128b vector loads and for higher vec sizes it should most likely translate into be multiple 128b loads. Can you clarify the relation with vec_size and how it relates with the occupancy as noted in the comment.

The comment is incorrect and misleading. I have corrected the comment and it now reads

// Optimizing vec_size for CDNA3 architecture. // This helps keep the dynamic shared memory allocation within hardware threshold for CDNA3

This solution took a little bit of experimenting to arrive. The thread_block configuration (bdx, bdy, bdz) is influenced by the vec_size.

Lets take an example - For HEAD_DIM = 128,
We have

constexpr uint32_t vec_size = max(8/2, 128/64) = max(4, 2) = 4 constexpr bdx = HEAD_DIM/vec_size = 128/4 = 32

This in-turn has an impact of how many threads we launch across y, z dims and also influences how much dynamic shared memory we allocate using the smem formula:

const uint32_t smem_size = 2U * NUM_STAGES_SMEM * bdy * tile_size_per_bdx * bdz * HEAD_DIM * sizeof(DTypeKV) + 2U * bdy * bdz * sizeof(float);

Making the vec_size a function of HEAD_DIM helped me tune the register and dynamic shared memory allocation to cover more use-cases.

diptorupd · 2025-10-25T00:06:32Z

libflashinfer/include/flashinfer/attention/generic/decode.cuh

-          2U * bdy * bdz * sizeof(float);
+    // This has been hard coded to 2U. Previous implementation involved a macro redirection that
+    // always resulted in 2U for H100 or CDNA3 architecture. Please take a look at
+    // gpu_iface/dispatch.cuh - DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM macro


Not really challenging why we should set NUM_STAGES_SMEM to 2, but the heuristic for using 2 for CDNA3 is not clear to me here.

On Nvidia hardware, this typically points to the ability of newer GPUs (H100) to pipeline shared memory operations more efficiently. Something like a multi-stage shared memory buffer.

One stage of smem buffering would involve loading data -> computing -> syncthreads -> repeat.

The H100 architecture allows asynchronous copy from global memory to shared memory using cp.async. These async copies allow multiple in-flight stages of shared memory data.

On the CDNA 3, I try to do something similar (though we have cp_async disabled for now). That is why this is set to 2U.

On an implementation level, we have

#define DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM(compute_capacity, NUM_STAGES_SMEM, ...) if (compute_capacity.first >= 8) { constexpr uint32_t NUM_STAGES_SMEM = 2; __VA_ARGS__ } else { constexpr uint32_t NUM_STAGES_SMEM = 1; __VA_ARGS__ }

Where, compute_capacity is determined by gpu_iface/utils.cuh

For CDNA3, the compute_capacity.first will return 9

diptorupd · 2025-10-25T00:09:28Z

@rtmadduri good work. I left few minor comments asking for some clarifications, but overall I think it is good to go.

rebase

rtmadduri self-assigned this Oct 20, 2025

rtmadduri changed the title ~~[Draft] | Decode feature chunking logic and shared mem optimization~~ Decode feature chunking logic and shared mem optimization Oct 20, 2025

diptorupd reviewed Oct 20, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/decode.cuh Outdated Show resolved Hide resolved

diptorupd reviewed Oct 20, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/decode.cuh Outdated Show resolved Hide resolved

diptorupd reviewed Oct 20, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/decode.cuh Outdated Show resolved Hide resolved

diptorupd reviewed Oct 20, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/decode.cuh Show resolved Hide resolved

rtmadduri requested review from demandal25 and diptorupd October 22, 2025 15:17

demandal25 reviewed Oct 24, 2025

View reviewed changes

demandal25 approved these changes Oct 24, 2025

View reviewed changes

diptorupd reviewed Oct 24, 2025

View reviewed changes

libflashinfer/include/flashinfer/attention/generic/decode.cuh Outdated Show resolved Hide resolved

diptorupd reviewed Oct 25, 2025

View reviewed changes

rtmadduri requested a review from diptorupd October 27, 2025 17:42

rtmadduri added 11 commits October 27, 2025 15:55

Decode feature chunking logic and shared mem optimization

38b4fc9

remove utils

05ec4b0

Remove superfluous changes

d2f999e

HEAD_DIM 256 passing

500b42b

rebase

added config for 256 HEAD_DIM

444feaa

comment

f868d90

adding U for uint32_t comparision, comments

f1675f7

adding one more comment

f168552

add comments to if else

9c00f68

replace iostream with sstream

fba7e80

comments

7beead2

diptorupd force-pushed the feature/decode-chunking-logic branch from d9c1dd6 to 7beead2 Compare October 27, 2025 20:55

diptorupd merged commit 23e247b into ROCm:amd-integration Oct 27, 2025
1 check passed

Decode feature chunking logic and shared mem optimization #25

Decode feature chunking logic and shared mem optimization #25

Uh oh!

Conversation

rtmadduri commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

diptorupd Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

rtmadduri Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

diptorupd Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

rtmadduri Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

diptorupd commented Oct 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rtmadduri commented Oct 20, 2025 •

edited

Loading