Skip to content

Conversation

ggerganov
Copy link
Member

  • Support cacheless iSWA such as EmbeddingGemma
  • Enable FA for all cacheless models and batch sizes

@ggerganov
Copy link
Member Author

ggerganov commented Oct 12, 2025

The main change in this PR is to allow using FA kernels when there is no KV cache and hence the n_kv size of the tensors is not necessarily padded to 256. My understanding is that #16492 should enable support for n_kv % 256 != 0.

I am running the following tests and I get NaNs when n_kv % 256 != 0:

# prompt of 256 tokens - OK
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..255})"

# prompt of 257 tokens - FAIL (nans)
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})"

I tried to reproduce this by adding a test to test-backend-ops with the same shapes, but I wasn't able to. @JohannesGaessler Any ideas what's going wrong here? I am running this on RTX 5090.

Edit: both the Vulkan and Metal backends produce good results in this case, so it is something related to the CUDA backend.

@JohannesGaessler
Copy link
Collaborator

@ggerganov can you provide me with the model and command you're using to reproduce the issue?

@ggerganov
Copy link
Member Author

@JohannesGaessler This is the model: https://huggingface.co/ggml-org/bge-m3-Q8_0-GGUF

This is the command:

make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})"

On my end it outputs NaNs:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        CUDA0 model buffer size =   307.22 MiB
load_tensors:   CPU_Mapped model buffer size =   291.41 MiB
......................................................
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 0
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.96 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =   168.02 MiB
llama_context:  CUDA_Host compute buffer size =    96.05 MiB
llama_context: graph nodes  = 780
llama_context: graph splits = 4 (with bs=4096), 2 (with bs=1)
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 750,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
batch_decode: n_tokens = 514, n_seq = 1
embedding 0:       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan 
llama_perf_context_print:        load time =     308.88 ms
llama_perf_context_print: prompt eval time =      74.70 ms /   514 tokens (    0.15 ms per token,  6881.13 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =      75.45 ms /   515 tokens
llama_perf_context_print:    graphs reused =          0

@JohannesGaessler
Copy link
Collaborator

Ah sorry, I didn't see that the model is in the command.

@JohannesGaessler
Copy link
Collaborator

Should be fixed by #16540 .

@JohannesGaessler JohannesGaessler self-requested a review as a code owner October 12, 2025 19:43
@JohannesGaessler
Copy link
Collaborator

Sorry, I accidentally pushed the wrong branch, use whichever one is more convenient for you.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025
@ggerganov
Copy link
Member Author

Unfortunately I still get NaNs with the command above using this branch.

Just to clarify, if you are using the branch in #16540 it won't produce NaNs, but the reason is because that branch does not have the cacheless modifications introduced here. So in order to correctly reproduce, you have to checkout the gg/cacheless-embd branch and run the command from the previous comment.

@JohannesGaessler
Copy link
Collaborator

The way I tested it was on top of gg/cacheless-embd, then I cherrypicked the commit onto a different branch since it's an issue more generally. I get finite results with the short test prompt I was using as well as -p "$(printf 'hello %.0s' {1..255})", but I get NaNs with -f LICENSE. So presumably there is at least one other point in the kernel where there are numerical issues.

@ggerganov
Copy link
Member Author

Please note that using -p "$(printf 'hello %.0s' {1..255})" will result in 512 tokens for this model (because of the extra BOS and EOS tokens). This case is not problematic - i.e. it outputs the correct numbers.

The problematic case is when the number of tokens is not a power of 2. For example using -p "$(printf 'hello %.0s' {1..256})" results in 514 tokens and this results in NaNs.

You can also reproduce this with other non-power-of-2 prompts:

# this results in 202 input tokens and triggers NaNs
-p "$(printf 'hello %.0s' {1..100})"

# this results in 64 input tokens and it works OK
-p "$(printf 'hello %.0s' {1..31})"

# this results in 66 input tokens and it produces NaNs
-p "$(printf 'hello %.0s' {1..32})"

You can determine the number of tokens for a given prompt by looking for the following log message:

batch_decode: n_tokens = 202, n_seq = 1

@JohannesGaessler
Copy link
Collaborator

If the K/V data is out-of-bounds the SRAM buffer is zeroed out. The result of the KQ maxtrix multiplication is then 0. However, there was no check to prevent the use of this result for the determination of the KQ maximum. As a consequence, if the KQ maximum of the real data is e.g. -100 then you get numerical issues for the real data when you calculate the softmax.

@JohannesGaessler
Copy link
Collaborator

I hope all numerical issues are now fixed, I am no longer able to provoke NaNs.

@ggerganov
Copy link
Member Author

Great, thank you. I confirm it works correctly now.

I will first merge your #16540 PR with the fix and then proceed to rebase this one and merge it next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants