HIP: add fattn-mma-f16 for RDNA4 #18481

zhang-hui-yulo · 2025-12-30T07:20:36Z

Add native fattn-mma-f16 for RDNA4, all tests passed, need perf tuning.

All tests have been executed more than 5 times to check random data error, no random data error now.

resolves #18243

Pass FLASH_ATTN_EXT on RDNA4.
Disable fattn-mma-f16 for RDNA3, add RDNA3 support in the future.
Perf tuning for RDNA4.

JohannesGaessler · 2025-12-30T14:14:30Z

Since this PR is currently a draft with open TODOs: please ping me at those times you would like a review, otherwise I'll be focusing on other matters.

zhang-hui-yulo · 2026-01-02T05:50:35Z

Hello @JohannesGaessler ,

Do you have a good way to measure the perf of fattn-mma-f16? The current perf tests of FLASH_ATTN_EXT only use vec fattn and tile fattn, thank you.

Best Regards
Hui

JohannesGaessler · 2026-01-02T09:10:41Z

Use something like llama-bench -n 0 -d 32768 -p "512,1-256*2". My recommendation would be to always use a real model with llama-bench unless this is not viable for some reason.

zhang-hui-yulo · 2026-01-03T05:48:45Z

OK, I got the bad news, fattn-mma is 25% slower than fattn-wmma, looks like that the most increased workload is in ldmatrix_trans, not sure how to deal it faster on rdna4.

zhang-hui-yulo · 2026-01-07T07:44:11Z

Hello @JohannesGaessler

Shall be done now, just pass test-backend-ops and llama-bench, using identity matrix and mma to do register transpose is faster than native loading.

cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1201 -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=OFF -DGGML_HIP_ROCWMMA_FATTN=ON

Details

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-bench --model ../../models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B_f16.gguf -r 1 -fa 1 -n 0 -d 32768 -p "512,1-512*2" --progress -o sql | sqlite3 ../../models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B_f16.sqlite

DeepSeek-R1-Distill-Qwen-1.5B_f16.gguf

Model	Test	t/s master	t/s fattn_for_rdna4	Speedup
qwen2 1.5B F16	pp1@d32768	79.31	75.60	0.95
qwen2 1.5B F16	pp2@d32768	108.58	108.84	1.00
qwen2 1.5B F16	pp4@d32768	115.21	128.15	1.11
qwen2 1.5B F16	pp8@d32768	194.81	218.75	1.12
qwen2 1.5B F16	pp16@d32768	351.71	373.34	1.06
qwen2 1.5B F16	pp32@d32768	586.19	651.52	1.11
qwen2 1.5B F16	pp64@d32768	844.73	758.69	0.90
qwen2 1.5B F16	pp128@d32768	950.54	827.33	0.87
qwen2 1.5B F16	pp256@d32768	1017.40	1191.77	1.17
qwen2 1.5B F16	pp512@d32768	1047.45	1231.11	1.18

FLASH_ATTN_EXT_latest.txt

Best Regards
Hui

JohannesGaessler

In terms of correctness and the way it's implemented I approve. Performance does not need to be optimal for a marge, this can be improved in follow-up PRs. Probably I would want to do some refactors down the line but this is a job for me as a maintainer.

Did you make a conscious decision when you copied the Turing configuration or did you just pick one of them? The context is that I first wrote the kernel for Ampere or newer with 99 kiB of SRAM/SM, an occupancy of 2, head sizes <= 256, and <= 64 Q columns. I later extended the kernel with support for Turing with 64 kiB of SRAM/SM and head sizes 576/512 for DeepSeek. Probably it would make sense to try more configurations that can potentially fit into the 128 kiB of SRAM/CU on RDNA4.

The kernel selection logic you added in fattn.cu probably needs to be improved prior to a merge, particularly when it comes to quantized KV cache.

JohannesGaessler · 2026-01-07T11:43:50Z

Quick performance test at the default settings:

GPU	Model	Microbatch size	Test	t/s b7653	t/s `103141f`	Speedup
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	29.45	29.03	0.99
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	57.01	48.12	0.84
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	72.29	82.72	1.14
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	94.93	113.99	1.20
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	148.10	191.49	1.29
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	165.09	207.68	1.26
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	250.37	282.48	1.13
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	282.13	330.75	1.17
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	306.39	353.56	1.15
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	312.65	357.47	1.14

LLaMA 3 has a head size of 128 which is the one that the code is generally most optimized for. With a GQA ratio of 4 you need a physical batch size of >= 4 to fully utilize the WMMA tiles with a width of 16, at that point the new implementation seems to already be faster than a combination of the tile and vector kernels.

JohannesGaessler · 2026-01-07T11:45:14Z

I forgot: I'm running llama-bench like this:

./build/bin/llama-bench --model models/opt/${mn}-${q}.gguf -r 1 -fa 1 -n 0 -d 32768 -ub "512,1-256*2" --progress -o sql|sqlite3 llama-bench.sqlite

IMbackK · 2026-01-07T12:09:43Z

LLaMA 3 has a head size of 128 which is the one that the code is generally most optimized for. With a GQA ratio of 4 you need a physical batch size of >= 4 to fully utilize the WMMA tiles with a width of 16, at that point the new implementation seems to already be faster than a combination of the tile and vector kernels.

What about compared to the wmma kernel?

JohannesGaessler · 2026-01-07T12:24:04Z

I did the rocWMMA benchmark wrong so I had to re-do it, these are the results:

GPU	Model	Microbatch size	Test	t/s b7653	t/s `103141f`	Speedup
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	29.11	29.04	1.00
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	48.60	48.19	0.99
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	38.85	82.86	2.13
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	65.95	114.03	1.73
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	138.99	192.13	1.38
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	181.99	208.31	1.14
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	210.27	282.55	1.34
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	259.25	330.98	1.28
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	266.39	354.36	1.33
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	271.81	357.62	1.32

On RDNA4 it seems to be faster than the rocWMMA kernel as it exists on master.

zhang-hui-yulo · 2026-01-07T13:02:52Z

In terms of correctness and the way it's implemented I approve. Performance does not need to be optimal for a marge, this can be improved in follow-up PRs. Probably I would want to do some refactors down the line but this is a job for me as a maintainer.

Did you make a conscious decision when you copied the Turing configuration or did you just pick one of them? The context is that I first wrote the kernel for Ampere or newer with 99 kiB of SRAM/SM, an occupancy of 2, head sizes <= 256, and <= 64 Q columns. I later extended the kernel with support for Turing with 64 kiB of SRAM/SM and head sizes 576/512 for DeepSeek. Probably it would make sense to try more configurations that can potentially fit into the 128 kiB of SRAM/CU on RDNA4.

The kernel selection logic you added in fattn.cu probably needs to be improved prior to a merge, particularly when it comes to quantized KV cache.

I just copy the config from Turing and haven't started serious tuning yet as transpose loading wastes too much of my time, hopefully the future AMD GPUs will have transpose loading from shared memory, RDNA4 global transpose loading doesn't help much.

I will have a try to make a better config in the next couple of days.

JohannesGaessler · 2026-01-07T13:17:14Z

I think the issue of transposition can be fixed upon loading the V data from VRAM to SRAM in combination with a permutation of VKQ.

zhang-hui-yulo · 2026-01-07T13:32:46Z

I think the issue of transposition can be fixed upon loading the V data from VRAM to SRAM in combination with a permutation of VKQ.

This is also what I thought, transpose the data via gmem to smem, but I really cannot clean up the transpose loading by native CUDA, cute is much easier, so I just spend sometime to try if identity mat and mma can do the transpose as it can also help my another project, I will suggest to add a TODO first.

Besides, based on the spec of gfx950, I think the next gen of RDNA shall have transpose loading from smem, this will make the things much easier and no need change the loading logic.

JohannesGaessler · 2026-01-07T13:38:44Z

Can you give me a brief outline of what work you still want to do for this PR and what you intend to do in follow-up PRs in the near future? I think I know how to handle the transposition for RDNA GPUs as they already exist but since I'm working on multiple things in parallel I would prefer to avoid concurrent work on the same code.

JohannesGaessler · 2026-01-07T13:52:30Z

Comparison with #16827 where the WMMA kernel was tuned for RDNA3:

GPU	Model	Microbatch size	Test	t/s b7653	t/s `103141f`	t/s #16827
RX 9060 XT	llama 8B Q4_0	1	pp512@d32768	29.11	29.04	25.54
RX 9060 XT	llama 8B Q4_0	2	pp512@d32768	48.60	48.19	37.81
RX 9060 XT	llama 8B Q4_0	4	pp512@d32768	38.85	82.86	71.59
RX 9060 XT	llama 8B Q4_0	8	pp512@d32768	65.95	114.03	112.20
RX 9060 XT	llama 8B Q4_0	16	pp512@d32768	138.99	192.13	246.41
RX 9060 XT	llama 8B Q4_0	32	pp512@d32768	181.99	208.31	293.08
RX 9060 XT	llama 8B Q4_0	64	pp512@d32768	210.27	282.55	125.35
RX 9060 XT	llama 8B Q4_0	128	pp512@d32768	259.25	330.98	184.33
RX 9060 XT	llama 8B Q4_0	256	pp512@d32768	266.39	354.36	189.31
RX 9060 XT	llama 8B Q4_0	512	pp512@d32768	271.81	357.62	188.91

The RDNA3 tunings seem to have been detrimental for large batch FA andr RDNA4 and this PR seems to be the fastest to date. There are some intermediate batch sizes where this PR seems to still be suboptimal but I think this is an issue with tuning.

JohannesGaessler · 2026-01-07T13:54:17Z

I forgot: it's probably worthwhile to check the logic in fattn-common.cuh w.r.t. whether or not stream-k should be used. As of right now the logic should be treating AMD GPUs as "Ada Lovelace or newer".

zhang-hui-yulo · 2026-01-08T06:29:59Z

I think these are the following things I want to do in this PR:

Do some basic tuning for RDNA4 config in fattn and fattn-mma based on llama 8B.
Try to enable streamk in fattn-coomon, hopefully there is no coding bug as I didn't test streamk.

TODO in the following PRs:

gmem to smem transportation, I will be appreciated if you can help to finish this as I haven't spent much effort on loading, gmem to smem transportation might be faster than mma transportation as it could by hidden by gemm main loop.
More perf tuning.
RDNA3 support as I might find a good way to handle RDNA3 fused gemm without smem, of course this needs gmem to smem transportation.

In the meaning time, please give me your comments about this PR, then I can update the code as same time.

zhang-hui-yulo · 2026-01-08T08:46:29Z

Hello @JohannesGaessler

I just do some basic tuning for DeepSeek-R1-Distill-Qwen-1.5B, I cannot make any perf change for Meta-Llama-3-8B-Instruct, sorry I'm still not familiar with llama.cpp model parameters.

I would suggest to keep this PR simple and make more changes in the future.

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-bench --model ../../models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B_f16.gguf -r 4 -fa 1 -n 0 -d 32768 -p "512,1-512*2" --progress -o sql | sqlite3 ../../models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B_f16.sqlite

Model	Test	t/s master	t/s fattn_for_rdna4	Speedup
qwen2 1.5B F16	pp1@d32768	93.47	43.18	0.46
qwen2 1.5B F16	pp2@d32768	133.78	76.90	0.57
qwen2 1.5B F16	pp4@d32768	132.69	140.32	1.06
qwen2 1.5B F16	pp8@d32768	197.53	220.01	1.11
qwen2 1.5B F16	pp16@d32768	352.13	374.62	1.06
qwen2 1.5B F16	pp32@d32768	588.38	657.14	1.12
qwen2 1.5B F16	pp64@d32768	865.42	766.73	0.89
qwen2 1.5B F16	pp128@d32768	978.27	864.89	0.88
qwen2 1.5B F16	pp256@d32768	1195.13	1352.40	1.13
qwen2 1.5B F16	pp512@d32768	1297.93	1394.73	1.07

Best Regards
Hui

IMbackK · 2026-01-08T08:56:42Z

ggml/src/ggml-cuda/fattn-common.cuh

        const int nblocks_stream_k = max_blocks;

-        const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || tiles_efficiency_percent < 75;
+        const bool use_stream_k = cc >= GGML_CUDA_CC_ADA_LOVELACE || amd_wmma_available(cc) || tiles_efficiency_percent < 75;


Adding amd_wmma_available here is better from an intent standpoint, but this change dosent do anything in practice, just fyi

Just follow Johannes' suggestion, looks like that tiles_efficiency_percent < 75 is enought for RDNA4.

zhang hui added 9 commits December 23, 2025 17:51

finish VQ mma

01c6842

flash_attn_ext_f16_iter

97be800

KQ_rowsum

4a91e14

correct exp

c0c1cd1

fix scale error

e05eb9f

fix softmax scale

c12f0b8

fix softmax scale

55ea70c

enable fattn on cpu side

73abf64

fix random error

4f88393

zhang-hui-yulo changed the title ~~HIP: add fattn-mma for RDNA4~~ HIP: add fattn-mma-f16 for RDNA4 Dec 30, 2025

loci-dev mentioned this pull request Dec 30, 2025

UPSTREAM PR #18481: HIP: add fattn-mma-f16 for RDNA4 auroralabs-loci/llama.cpp#757

Open

3 tasks

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 30, 2025

disable fattn-mma-f16 on rdna3

6585003

fix wrong col for rdna

90c02e3

use identity mat to transpose

7e09adb

zhang-hui-yulo marked this pull request as ready for review January 7, 2026 07:44

zhang-hui-yulo requested a review from JohannesGaessler as a code owner January 7, 2026 07:44

zhang hui and others added 2 commits January 7, 2026 15:56

resolve conflicts

b66845e

Merge branch 'master' into fattn_for_rdna4

103141f

JohannesGaessler reviewed Jan 7, 2026

View reviewed changes

basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

41fd92b

IMbackK reviewed Jan 8, 2026

View reviewed changes

fix volta compile error

c4489fa

HIP: add fattn-mma-f16 for RDNA4 #18481

Are you sure you want to change the base?

HIP: add fattn-mma-f16 for RDNA4 #18481

Conversation

zhang-hui-yulo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 30, 2025

Uh oh!

zhang-hui-yulo commented Jan 2, 2026

Uh oh!

JohannesGaessler commented Jan 2, 2026

Uh oh!

zhang-hui-yulo commented Jan 3, 2026

Uh oh!

zhang-hui-yulo commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

IMbackK commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

zhang-hui-yulo commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

zhang-hui-yulo commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

JohannesGaessler commented Jan 7, 2026

Uh oh!

zhang-hui-yulo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhang-hui-yulo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

zhang-hui-yulo Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhang-hui-yulo commented Dec 30, 2025 •

edited

Loading

zhang-hui-yulo commented Jan 7, 2026 •

edited

Loading

zhang-hui-yulo commented Jan 8, 2026 •

edited

Loading

zhang-hui-yulo commented Jan 8, 2026 •

edited

Loading