-
Notifications
You must be signed in to change notification settings - Fork 14.4k
HIP: add fattn-mma-f16 for RDNA4 #18481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Since this PR is currently a draft with open TODOs: please ping me at those times you would like a review, otherwise I'll be focusing on other matters. |
|
Hello @JohannesGaessler , Do you have a good way to measure the perf of fattn-mma-f16? The current perf tests of FLASH_ATTN_EXT only use vec fattn and tile fattn, thank you. Best Regards |
|
Use something like |
|
OK, I got the bad news, fattn-mma is 25% slower than fattn-wmma, looks like that the most increased workload is in ldmatrix_trans, not sure how to deal it faster on rdna4. |
|
Hello @JohannesGaessler Shall be done now, just pass test-backend-ops and llama-bench, using identity matrix and mma to do register transpose is faster than native loading. DetailsDeepSeek-R1-Distill-Qwen-1.5B_f16.gguf
Best Regards |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of correctness and the way it's implemented I approve. Performance does not need to be optimal for a marge, this can be improved in follow-up PRs. Probably I would want to do some refactors down the line but this is a job for me as a maintainer.
Did you make a conscious decision when you copied the Turing configuration or did you just pick one of them? The context is that I first wrote the kernel for Ampere or newer with 99 kiB of SRAM/SM, an occupancy of 2, head sizes <= 256, and <= 64 Q columns. I later extended the kernel with support for Turing with 64 kiB of SRAM/SM and head sizes 576/512 for DeepSeek. Probably it would make sense to try more configurations that can potentially fit into the 128 kiB of SRAM/CU on RDNA4.
The kernel selection logic you added in fattn.cu probably needs to be improved prior to a merge, particularly when it comes to quantized KV cache.
|
Quick performance test at the default settings:
LLaMA 3 has a head size of 128 which is the one that the code is generally most optimized for. With a GQA ratio of 4 you need a physical batch size of >= 4 to fully utilize the WMMA tiles with a width of 16, at that point the new implementation seems to already be faster than a combination of the tile and vector kernels. |
|
I forgot: I'm running ./build/bin/llama-bench --model models/opt/${mn}-${q}.gguf -r 1 -fa 1 -n 0 -d 32768 -ub "512,1-256*2" --progress -o sql|sqlite3 llama-bench.sqlite |
What about compared to the wmma kernel? |
|
I did the rocWMMA benchmark wrong so I had to re-do it, these are the results:
On RDNA4 it seems to be faster than the rocWMMA kernel as it exists on master. |
I just copy the config from Turing and haven't started serious tuning yet as transpose loading wastes too much of my time, hopefully the future AMD GPUs will have transpose loading from shared memory, RDNA4 global transpose loading doesn't help much. I will have a try to make a better config in the next couple of days. |
|
I think the issue of transposition can be fixed upon loading the V data from VRAM to SRAM in combination with a permutation of VKQ. |
This is also what I thought, transpose the data via gmem to smem, but I really cannot clean up the transpose loading by native CUDA, cute is much easier, so I just spend sometime to try if identity mat and mma can do the transpose as it can also help my another project, I will suggest to add a TODO first. Besides, based on the spec of gfx950, I think the next gen of RDNA shall have transpose loading from smem, this will make the things much easier and no need change the loading logic. |
|
Can you give me a brief outline of what work you still want to do for this PR and what you intend to do in follow-up PRs in the near future? I think I know how to handle the transposition for RDNA GPUs as they already exist but since I'm working on multiple things in parallel I would prefer to avoid concurrent work on the same code. |
|
Comparison with #16827 where the WMMA kernel was tuned for RDNA3:
The RDNA3 tunings seem to have been detrimental for large batch FA andr RDNA4 and this PR seems to be the fastest to date. There are some intermediate batch sizes where this PR seems to still be suboptimal but I think this is an issue with tuning. |
|
I forgot: it's probably worthwhile to check the logic in |
|
I think these are the following things I want to do in this PR:
TODO in the following PRs:
In the meaning time, please give me your comments about this PR, then I can update the code as same time. |
|
Hello @JohannesGaessler I just do some basic tuning for DeepSeek-R1-Distill-Qwen-1.5B, I cannot make any perf change for Meta-Llama-3-8B-Instruct, sorry I'm still not familiar with llama.cpp model parameters. I would suggest to keep this PR simple and make more changes in the future.
Best Regards |
| const int nblocks_stream_k = max_blocks; | ||
|
|
||
| const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || tiles_efficiency_percent < 75; | ||
| const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || amd_wmma_available(cc) || tiles_efficiency_percent < 75; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding amd_wmma_available here is better from an intent standpoint, but this change dosent do anything in practice, just fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just follow Johannes' suggestion, looks like that tiles_efficiency_percent < 75 is enought for RDNA4.
Add native fattn-mma-f16 for RDNA4, all tests passed, need perf tuning.
All tests have been executed more than 5 times to check random data error, no random data error now.
resolves #18243
FLASH_ATTN_EXT .txt