-
Notifications
You must be signed in to change notification settings - Fork 13.1k
ggml : add support for non-padded FA KV #16148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The Vulkan backend currently assumes some alignment - I'm not sure they exact amount and it probably varies by code path. We could add another spec constant that controls bounds checking. I recently refactored the flash attention shader management in a way that should make it easier to add more spec constant variants without having to iterate over or store all possible combinations. So it shouldn't be too much trouble. |
For CUDA I will soon make a PR that refactors and deduplicates the vector kernels. That will fix the biggest constraint on tensor shapes and it should be possible to reduce the granularity. What is the exact requirement? A reduced granularity of e.g. 32 or completely arbitrary tensor shapes? |
In both of the use cases that I listed, the shapes of the K and V tensors can be arbitrary numbers - usually a function of the number of input vision tokens or the number of embeddings respectively. So just reducing the granularity won't be enough. Edit: this is all considering just the |
It is definitely possible to make a version that works for arbitrary tensor shapes. What are typical values for ne10? |
I've implemented the Vulkan changes for this, will make a PR soon. |
Use cases range across most head shapes. For example recently the new EmbeddingGemma model uses head size of 256, while some older BERT models have head size of 32. So I would say for
|
Sorry, I was on my phone and mistyped. I meant to ask what typical values for |
For |
I'm very sorry, I also misremembered which source is which tensor. What I specifically want to know is the typical shape of the query tensor: depending on whether dimension 1 of Q is 1, larger than 1 but small, or much larger than 1 I'll prioritize a different kernel for a first version that works but maybe has suboptimal performance. (For pure text models the equivalent dimension I'm interested in is the physical batch size.) |
In these two use cases, the dim 1 of Q (i.e. |
Okay, thank you. Support for Turing and newer should be relatively simple, AMD and old NVIDIA will need some additional consideration for edge cases. |
Stable Diffusion 1.x uses 40, 80, 120 ... |
Currently, the FA kernel implementations in some backends have padding requirements for the K and V tensors:
This requirement is imposing some significant limitations:
llama.cpp/src/llama-graph.cpp
Line 1287 in f6c4c4c
I would like to address this issue and allow FA to be generally available regardless of the K and V sizes.
For the Metal backend, we already have the infrastructure to support that and can apply it relatively easily. Thinking between 2 possible approaches:
I am not sure what is the situation with the CUDA backend. @JohannesGaessler Is this possible to achieve?
Also not sure the state of the Vulkan backend in this regard. @0cc4m, @jeffbolznv Any insights are appreciated.