vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

jeffbolznv · 2025-11-03T19:05:01Z

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Helps a couple percent on models where it applies.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.45 ± 11.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        271.92 ± 3.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.37 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.13 ± 1.33 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.37 ± 1.21 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        134.55 ± 3.63 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.55 ± 0.32 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.51 ± 0.51 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.49 ± 0.43 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.57 ± 0.33 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       472.71 ± 38.19 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       478.29 ± 10.31 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       468.20 ± 16.25 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        483.64 ± 2.21 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        484.80 ± 2.20 |

build: 1ae74882f (6913)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.94 ± 17.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        275.08 ± 2.82 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.42 ± 1.60 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.48 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        277.64 ± 0.90 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.12 ± 3.41 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.25 ± 3.01 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.57 ± 0.47 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.33 ± 0.56 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.56 ± 0.53 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       482.35 ± 47.67 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       488.51 ± 11.51 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        487.90 ± 6.93 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        494.89 ± 4.11 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        496.09 ± 5.04 |

build: b74de9b7b (6915)

ggml/src/ggml-vulkan/ggml-vulkan.cpp

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv requested review from 0cc4m and slaren as code owners November 3, 2025 19:05

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 3, 2025

jeffbolznv commented Nov 3, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) auroralabs-loci/llama.cpp#53

Closed

jeffbolznv force-pushed the rmsnorm_rope_fusion branch from e565af7 to 7ca2c06 Compare November 4, 2025 20:13

vulkan: fuse rms_norm + mul + rope (+ view + set_rows)

16b7301

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

jeffbolznv force-pushed the rmsnorm_rope_fusion branch from 7ca2c06 to 16b7301 Compare November 5, 2025 19:52

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #16977: vulkan: fuse rms_norm + mul + rope (+ view + set_rows) auroralabs-loci/llama.cpp#98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

jeffbolznv commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

Are you sure you want to change the base?

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) #16977

Conversation

jeffbolznv commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant