Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Helps a couple percent on models where it applies.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.45 ± 11.28 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        271.92 ± 3.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.37 ± 1.46 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.13 ± 1.33 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        274.37 ± 1.21 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        134.55 ± 3.63 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.55 ± 0.32 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.51 ± 0.51 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.49 ± 0.43 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.57 ± 0.33 |

build: 1ae74882f (6913)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       472.71 ± 38.19 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       478.29 ± 10.31 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       468.20 ± 16.25 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        483.64 ± 2.21 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        484.80 ± 2.20 |

build: 1ae74882f (6913)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       269.94 ± 17.36 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        275.08 ± 2.82 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.42 ± 1.60 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        276.48 ± 1.57 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        277.64 ± 0.90 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.12 ± 3.41 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        137.25 ± 3.01 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.57 ± 0.47 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.33 ± 0.56 |
| qwen3 14B Q4_K - Medium        |   8.38 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        138.56 ± 0.53 |

build: b74de9b7b (6915)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 20 --prio 1 -m c:\models\Ring-mini-2.0-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       482.35 ± 47.67 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |       488.51 ± 11.51 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        487.90 ± 6.93 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        494.89 ± 4.11 |
| bailingmoe2 16B.A1B Q4_K - Medium |   9.22 GiB |    16.26 B | Vulkan     |  99 |  1 |           tg128 |        496.09 ± 5.04 |

build: b74de9b7b (6915)

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 3, 2025
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant