-
Notifications
You must be signed in to change notification settings - Fork 4
Decode feature chunking logic and shared mem optimization #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode feature chunking logic and shared mem optimization #25
Conversation
|
|
||
| constexpr uint32_t vec_size = std::max(16UL / sizeof(DTypeKV), HEAD_DIM / 32UL); | ||
| // AMD CDNA3 optimized vector size - prefer smaller vec_size for better occupancy | ||
| constexpr uint32_t vec_size = (HEAD_DIM < 256U) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are going to use vec size of 4 for HEAD_DIM < 256 and 8 and for higher HEAD_DIM sizes. Vec size of 4 should translate into a float4 or 128b vector loads and for higher vec sizes it should most likely translate into be multiple 128b loads. Can you clarify the relation with vec_size and how it relates with the occupancy as noted in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is incorrect and misleading. I have corrected the comment and it now reads
// Optimizing vec_size for CDNA3 architecture.
// This helps keep the dynamic shared memory allocation within hardware threshold for CDNA3
This solution took a little bit of experimenting to arrive. The thread_block configuration (bdx, bdy, bdz) is influenced by the vec_size.
Lets take an example - For HEAD_DIM = 128,
We have
constexpr uint32_t vec_size = max(8/2, 128/64) = max(4, 2) = 4
constexpr bdx = HEAD_DIM/vec_size = 128/4 = 32
This in-turn has an impact of how many threads we launch across y, z dims and also influences how much dynamic shared memory we allocate using the smem formula:
const uint32_t smem_size =
2U * NUM_STAGES_SMEM * bdy * tile_size_per_bdx * bdz * HEAD_DIM * sizeof(DTypeKV) +
2U * bdy * bdz * sizeof(float);
Making the vec_size a function of HEAD_DIM helped me tune the register and dynamic shared memory allocation to cover more use-cases.
| 2U * bdy * bdz * sizeof(float); | ||
| // This has been hard coded to 2U. Previous implementation involved a macro redirection that | ||
| // always resulted in 2U for H100 or CDNA3 architecture. Please take a look at | ||
| // gpu_iface/dispatch.cuh - DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM macro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really challenging why we should set NUM_STAGES_SMEM to 2, but the heuristic for using 2 for CDNA3 is not clear to me here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Nvidia hardware, this typically points to the ability of newer GPUs (H100) to pipeline shared memory operations more efficiently. Something like a multi-stage shared memory buffer.
One stage of smem buffering would involve loading data -> computing -> syncthreads -> repeat.
The H100 architecture allows asynchronous copy from global memory to shared memory using cp.async. These async copies allow multiple in-flight stages of shared memory data.
On the CDNA 3, I try to do something similar (though we have cp_async disabled for now). That is why this is set to 2U.
On an implementation level, we have
#define DISPATCH_COMPUTE_CAP_DECODE_NUM_STAGES_SMEM(compute_capacity, NUM_STAGES_SMEM, ...)
if (compute_capacity.first >= 8) {
constexpr uint32_t NUM_STAGES_SMEM = 2;
__VA_ARGS__
} else {
constexpr uint32_t NUM_STAGES_SMEM = 1;
__VA_ARGS__
}
Where, compute_capacity is determined by gpu_iface/utils.cuh
For CDNA3, the compute_capacity.first will return 9
|
@rtmadduri good work. I left few minor comments asking for some clarifications, but overall I think it is good to go. |
rebase
d9c1dd6 to
7beead2
Compare
This PR adds chunking logic and enables the shared memory optimization feature for Decode for the CDNA3 architecture.
The major addition of the PR is rewriting the shared memory calculation and chunking to better suit the CDNA3 architecture which only allows 64KiB of shared memory per CU.
The PR makes corresponding changes to
test_batch_decode_kernels_hip.pyandexamples/test_batch_decode_example.pyexamples/test_batch_decode_example.pytest_batch_decode_kernels_hip.pyComplete HIP PyTest suite
C++ test suite
Note: See here for more info about the above known failures
Improvement over the existing implementation: