Skip to content

Conversation

ggerganov
Copy link
Member

Currently, the FA kernel implementations in some backends have padding requirements for the K and V tensors:

  • Metal: multiple of 32
  • CUDA: multiple of 256
  • Vulkan: not sure

This requirement is imposing some significant limitations:

I would like to address this issue and allow FA to be generally available regardless of the K and V sizes.

For the Metal backend, we already have the infrastructure to support that and can apply it relatively easily. Thinking between 2 possible approaches:

  • Building a second set of FA kernels that include bounds check
  • Pass a secondary small padding buffer in addition to the main K and V buffers and updating the main loop of the FA kernel to perform one extra iteration over the padding buffer. This should avoid overhead from extra bounds checks.

I am not sure what is the situation with the CUDA backend. @JohannesGaessler Is this possible to achieve?

Also not sure the state of the Vulkan backend in this regard. @0cc4m, @jeffbolznv Any insights are appreciated.

@github-actions github-actions bot added the testing Everything test related label Sep 21, 2025
@jeffbolznv
Copy link
Collaborator

The Vulkan backend currently assumes some alignment - I'm not sure they exact amount and it probably varies by code path. We could add another spec constant that controls bounds checking. I recently refactored the flash attention shader management in a way that should make it easier to add more spec constant variants without having to iterate over or store all possible combinations. So it shouldn't be too much trouble.

@JohannesGaessler
Copy link
Collaborator

For CUDA I will soon make a PR that refactors and deduplicates the vector kernels. That will fix the biggest constraint on tensor shapes and it should be possible to reduce the granularity. What is the exact requirement? A reduced granularity of e.g. 32 or completely arbitrary tensor shapes?

@ggerganov
Copy link
Member Author

ggerganov commented Sep 21, 2025

What is the exact requirement?

In both of the use cases that I listed, the shapes of the K and V tensors can be arbitrary numbers - usually a function of the number of input vision tokens or the number of embeddings respectively. So just reducing the granularity won't be enough.

Edit: this is all considering just the n_kv shape of the K, V and mask tensors, i.e: ne11, ne21 and ne30

@JohannesGaessler
Copy link
Collaborator

It is definitely possible to make a version that works for arbitrary tensor shapes. What are typical values for ne10?

@jeffbolznv
Copy link
Collaborator

I've implemented the Vulkan changes for this, will make a PR soon.

@ggerganov
Copy link
Member Author

It is definitely possible to make a version that works for arbitrary tensor shapes. What are typical values for ne10?

Use cases range across most head shapes. For example recently the new EmbeddingGemma model uses head size of 256, while some older BERT models have head size of 32. So I would say for ne10:

  • multiple of 32 is a must
  • multiple of 8 (e.g. 40, 80, 112) would be nice though I don't know about prominent use cases

@JohannesGaessler
Copy link
Collaborator

Sorry, I was on my phone and mistyped. I meant to ask what typical values for ne11 are. I'm asking because that is the primary dimension based on which the CUDA backend selects kernels.

@ggerganov
Copy link
Member Author

For ne11, the embedding models - typically any values all the way up to 8192. Not sure what are the requirements for vision models, but I would guess that this might go even further beyond 8192 in some cases.

@JohannesGaessler
Copy link
Collaborator

I'm very sorry, I also misremembered which source is which tensor. What I specifically want to know is the typical shape of the query tensor: depending on whether dimension 1 of Q is 1, larger than 1 but small, or much larger than 1 I'll prioritize a different kernel for a first version that works but maybe has suboptimal performance. (For pure text models the equivalent dimension I'm interested in is the physical batch size.)

@ggerganov
Copy link
Member Author

In these two use cases, the dim 1 of Q (i.e. ne01 / n_batch) is effectively equal to the n_kv shape (i.e. ne11, ne21 and ne30). So my previous comment also translates directly for the Q dim 1 shape. I think the case we should optimize for is large ne01 (i.e. more than 128) and going up to several thousands. But note that ne01 < 128 should also work, even if it is not very optimal. Here the threshold of 128 is somewhat arbitrary - I don't think it matters significantly how much it is exactly.

@JohannesGaessler
Copy link
Collaborator

Okay, thank you. Support for Turing and newer should be relatively simple, AMD and old NVIDIA will need some additional consideration for edge cases.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Sep 22, 2025

  • multiple of 8 (e.g. 40, 80, 112) would be nice though I don't know about prominent use cases

Stable Diffusion 1.x uses 40, 80, 120 ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants