-
Notifications
You must be signed in to change notification settings - Fork 13.3k
graph : support cacheless embeddings with FA and iSWA #16528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ggerganov
commented
Oct 12, 2025
- Support cacheless iSWA such as EmbeddingGemma
- Enable FA for all cacheless models and batch sizes
The main change in this PR is to allow using FA kernels when there is no KV cache and hence the I am running the following tests and I get NaNs when # prompt of 256 tokens - OK
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..255})"
# prompt of 257 tokens - FAIL (nans)
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})" I tried to reproduce this by adding a test to Edit: both the Vulkan and Metal backends produce good results in this case, so it is something related to the CUDA backend. |
@ggerganov can you provide me with the model and command you're using to reproduce the issue? |
@JohannesGaessler This is the model: https://huggingface.co/ggml-org/bge-m3-Q8_0-GGUF This is the command: make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})" On my end it outputs NaNs:
|
Ah sorry, I didn't see that the model is in the command. |
Should be fixed by #16540 . |
Sorry, I accidentally pushed the wrong branch, use whichever one is more convenient for you. |
Unfortunately I still get NaNs with the command above using this branch. Just to clarify, if you are using the branch in #16540 it won't produce NaNs, but the reason is because that branch does not have the cacheless modifications introduced here. So in order to correctly reproduce, you have to checkout the |
The way I tested it was on top of |
Please note that using The problematic case is when the number of tokens is not a power of 2. For example using You can also reproduce this with other non-power-of-2 prompts: # this results in 202 input tokens and triggers NaNs
-p "$(printf 'hello %.0s' {1..100})"
# this results in 64 input tokens and it works OK
-p "$(printf 'hello %.0s' {1..31})"
# this results in 66 input tokens and it produces NaNs
-p "$(printf 'hello %.0s' {1..32})" You can determine the number of tokens for a given prompt by looking for the following log message:
|
d8d99d1
to
c308925
Compare
If the K/V data is out-of-bounds the SRAM buffer is zeroed out. The result of the KQ maxtrix multiplication is then 0. However, there was no check to prevent the use of this result for the determination of the KQ maximum. As a consequence, if the KQ maximum of the real data is e.g. -100 then you get numerical issues for the real data when you calculate the softmax. |
I hope all numerical issues are now fixed, I am no longer able to provoke NaNs. |
Great, thank you. I confirm it works correctly now. I will first merge your #16540 PR with the fix and then proceed to rebase this one and merge it next. |
c308925
to
5734546
Compare