metal : add support for non-padded FA KV #16148

ggerganov · 2025-09-21T15:13:59Z

Currently, the FA kernel implementations in some backends have padding requirements for the K and V tensors:

Metal: multiple of 32
CUDA: multiple of 256
Vulkan: not sure

This requirement is imposing some significant limitations:

Various vision models can't utilize FA in the vision encoders (clip : cap max image size 1024 for qwen vl model #13478 (review))
We can't effectively use FA for cacheless llama contexts, typical for embedding scenarios (see context : print graph stats for memory-less contexts #15586 (comment)) because of this hack:

llama.cpp/src/llama-graph.cpp

Line 1287 in f6c4c4c

if (cparams.flash_attn && (n_kv % 256 == 0) && kq_b == nullptr) {

I would like to address this issue and allow FA to be generally available regardless of the K and V sizes.

For the Metal backend, we already have the infrastructure to support that and can apply it relatively easily. Thinking between 2 possible approaches:

Building a second set of FA kernels that include bounds check
Pass a secondary small padding buffer in addition to the main K and V buffers and updating the main loop of the FA kernel to perform one extra iteration over the padding buffer. This should avoid overhead from extra bounds checks.

I am not sure what is the situation with the CUDA backend. @JohannesGaessler Is this possible to achieve?

Also not sure the state of the Vulkan backend in this regard. @0cc4m, @jeffbolznv Any insights are appreciated.

jeffbolznv · 2025-09-21T16:14:23Z

The Vulkan backend currently assumes some alignment - I'm not sure they exact amount and it probably varies by code path. We could add another spec constant that controls bounds checking. I recently refactored the flash attention shader management in a way that should make it easier to add more spec constant variants without having to iterate over or store all possible combinations. So it shouldn't be too much trouble.

JohannesGaessler · 2025-09-21T17:14:20Z

For CUDA I will soon make a PR that refactors and deduplicates the vector kernels. That will fix the biggest constraint on tensor shapes and it should be possible to reduce the granularity. What is the exact requirement? A reduced granularity of e.g. 32 or completely arbitrary tensor shapes?

ggerganov · 2025-09-21T17:25:25Z

What is the exact requirement?

In both of the use cases that I listed, the shapes of the K and V tensors can be arbitrary numbers - usually a function of the number of input vision tokens or the number of embeddings respectively. So just reducing the granularity won't be enough.

Edit: this is all considering just the n_kv shape of the K, V and mask tensors, i.e: ne11, ne21 and ne30

JohannesGaessler · 2025-09-21T17:44:20Z

It is definitely possible to make a version that works for arbitrary tensor shapes. What are typical values for ne10?

jeffbolznv · 2025-09-21T22:36:11Z

I've implemented the Vulkan changes for this, will make a PR soon.

ggerganov · 2025-09-22T06:27:00Z

It is definitely possible to make a version that works for arbitrary tensor shapes. What are typical values for ne10?

Use cases range across most head shapes. For example recently the new EmbeddingGemma model uses head size of 256, while some older BERT models have head size of 32. So I would say for ne10:

multiple of 32 is a must
multiple of 8 (e.g. 40, 80, 112) would be nice though I don't know about prominent use cases

JohannesGaessler · 2025-09-22T07:52:39Z

Sorry, I was on my phone and mistyped. I meant to ask what typical values for ne11 are. I'm asking because that is the primary dimension based on which the CUDA backend selects kernels.

ggerganov · 2025-09-22T08:27:03Z

For ne11, the embedding models - typically any values all the way up to 8192. Not sure what are the requirements for vision models, but I would guess that this might go even further beyond 8192 in some cases.

JohannesGaessler · 2025-09-22T09:14:43Z

I'm very sorry, I also misremembered which source is which tensor. What I specifically want to know is the typical shape of the query tensor: depending on whether dimension 1 of Q is 1, larger than 1 but small, or much larger than 1 I'll prioritize a different kernel for a first version that works but maybe has suboptimal performance. (For pure text models the equivalent dimension I'm interested in is the physical batch size.)

ggerganov · 2025-09-22T09:39:14Z

In these two use cases, the dim 1 of Q (i.e. ne01 / n_batch) is effectively equal to the n_kv shape (i.e. ne11, ne21 and ne30). So my previous comment also translates directly for the Q dim 1 shape. I think the case we should optimize for is large ne01 (i.e. more than 128) and going up to several thousands. But note that ne01 < 128 should also work, even if it is not very optimal. Here the threshold of 128 is somewhat arbitrary - I don't think it matters significantly how much it is exactly.

JohannesGaessler · 2025-09-22T10:47:17Z

Okay, thank you. Support for Turing and newer should be relatively simple, AMD and old NVIDIA will need some additional consideration for edge cases.

Green-Sky · 2025-09-22T10:47:51Z

multiple of 8 (e.g. 40, 80, 112) would be nice though I don't know about prominent use cases

Stable Diffusion 1.x uses 40, 80, 120 ...

JohannesGaessler · 2025-09-28T14:17:36Z

I have not forgotten about this, I was just waiting for #16208 to be merged to avoid having to do concurrent changes to the same code.
The mma kernel is already doing a special iteration at the end of the KV tensors due to asynchronous data loading so I think the impact of bounds checking will be negligible.
For the other kernels I'll have to check.

ggerganov · 2025-09-28T14:33:27Z

@JohannesGaessler Ok thanks - no worries.

For the Metal backend, I implemented the second option with padding the K, V and Mask buffers.

ggerganov · 2025-10-06T13:22:07Z

I'll merge this PR for now to not get stale - it should not affect llama.cpp functionality. Later when CUDA adds support, will make a new PR to enable FA for cacheless contexts.

JohannesGaessler · 2025-10-06T13:28:42Z

Unless I run into unexpected problems I'll make a PR for CUDA support this week.

* metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement

github-actions bot added the testing Everything test related label Sep 21, 2025

jeffbolznv mentioned this pull request Sep 21, 2025

vulkan: support arbitrary KV dimension in flash attention #16160

Merged

ggerganov mentioned this pull request Sep 22, 2025

context : print graph stats for memory-less contexts #15586

Merged

ggerganov force-pushed the gg/fa-kv-pad branch from f6c4c4c to 5d0d2d2 Compare September 28, 2025 13:01

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 28, 2025

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 28, 2025

ggerganov mentioned this pull request Sep 28, 2025

ggml : remove KQ mask padding #16309

Draft

3 tasks

ggerganov marked this pull request as ready for review September 28, 2025 17:50

ggerganov requested review from JohannesGaessler and slaren as code owners September 28, 2025 17:50

ggerganov added 5 commits September 30, 2025 22:06

metal : pad K, V and Mask when needed

defeeb3

cont : simplify

5ed8acd

cuda : add TODO about KV padding requirement

ea8f4bb

metal : add comments

9b21358

metal : remove mask padding requirement

347c113

ggerganov force-pushed the gg/fa-kv-pad branch from 46c338f to 347c113 Compare September 30, 2025 19:06

ggerganov mentioned this pull request Oct 1, 2025

metal : mark FA blocks #16372

Merged

ggerganov changed the title ~~ggml : add support for non-padded FA KV~~ metal : add support for non-padded FA KV Oct 6, 2025

ggerganov merged commit 0a319bb into master Oct 7, 2025
66 of 69 checks passed

ggerganov deleted the gg/fa-kv-pad branch October 7, 2025 05:23

ggerganov mentioned this pull request Oct 15, 2025

mtmd : add **vision** support for Mistral Small 3.1 #13231

Merged

ggerganov mentioned this pull request Oct 23, 2025

server : support unified cache across slots #16736

Draft

4 tasks

Uh oh!

metal : add support for non-padded FA KV #16148

metal : add support for non-padded FA KV #16148

Conversation

ggerganov commented Sep 21, 2025

Uh oh!

jeffbolznv commented Sep 21, 2025

Uh oh!

JohannesGaessler commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Sep 21, 2025

Uh oh!

jeffbolznv commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

JohannesGaessler commented Sep 22, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

JohannesGaessler commented Sep 22, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

JohannesGaessler commented Sep 22, 2025

Uh oh!

Green-Sky commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Sep 28, 2025

Uh oh!

ggerganov commented Sep 28, 2025

Uh oh!

ggerganov commented Oct 6, 2025

Uh oh!

JohannesGaessler commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Sep 21, 2025 •

edited

Loading

Green-Sky commented Sep 22, 2025 •

edited

Loading