graph : support cacheless embeddings with FA and iSWA #16528

ggerganov · 2025-10-12T07:42:07Z

Support cacheless iSWA such as EmbeddingGemma
Enable FA for all cacheless models and batch sizes

ggerganov · 2025-10-12T12:33:33Z

The main change in this PR is to allow using FA kernels when there is no KV cache and hence the n_kv size of the tensors is not necessarily padded to 256. My understanding is that #16492 should enable support for n_kv % 256 != 0.

I am running the following tests and I get NaNs when n_kv % 256 != 0:

# prompt of 256 tokens - OK
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..255})"

# prompt of 257 tokens - FAIL (nans)
make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})"

I tried to reproduce this by adding a test to test-backend-ops with the same shapes, but I wasn't able to. @JohannesGaessler Any ideas what's going wrong here? I am running this on RTX 5090.

Edit: both the Vulkan and Metal backends produce good results in this case, so it is something related to the CUDA backend.

src/llama-graph.cpp

JohannesGaessler · 2025-10-12T13:12:37Z

@ggerganov can you provide me with the model and command you're using to reproduce the issue?

ggerganov · 2025-10-12T13:22:16Z

@JohannesGaessler This is the model: https://huggingface.co/ggml-org/bge-m3-Q8_0-GGUF

This is the command:

make -j && CUDA_VISIBLE_DEVICES=0 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "$(printf 'hello %.0s' {1..256})"

On my end it outputs NaNs:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:        CUDA0 model buffer size =   307.22 MiB
load_tensors:   CPU_Mapped model buffer size =   291.41 MiB
......................................................
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 0
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.96 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =   168.02 MiB
llama_context:  CUDA_Host compute buffer size =    96.05 MiB
llama_context: graph nodes  = 780
llama_context: graph splits = 4 (with bs=4096), 2 (with bs=1)
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 750,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
batch_decode: n_tokens = 514, n_seq = 1
embedding 0:       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan       nan 
llama_perf_context_print:        load time =     308.88 ms
llama_perf_context_print: prompt eval time =      74.70 ms /   514 tokens (    0.15 ms per token,  6881.13 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =      75.45 ms /   515 tokens
llama_perf_context_print:    graphs reused =          0

JohannesGaessler · 2025-10-12T13:37:37Z

Ah sorry, I didn't see that the model is in the command.

JohannesGaessler · 2025-10-12T19:35:01Z

Should be fixed by #16540 .

JohannesGaessler · 2025-10-12T19:44:32Z

Sorry, I accidentally pushed the wrong branch, use whichever one is more convenient for you.

ggerganov · 2025-10-13T06:32:34Z

Unfortunately I still get NaNs with the command above using this branch.

Just to clarify, if you are using the branch in #16540 it won't produce NaNs, but the reason is because that branch does not have the cacheless modifications introduced here. So in order to correctly reproduce, you have to checkout the gg/cacheless-embd branch and run the command from the previous comment.

JohannesGaessler · 2025-10-13T07:58:26Z

The way I tested it was on top of gg/cacheless-embd, then I cherrypicked the commit onto a different branch since it's an issue more generally. I get finite results with the short test prompt I was using as well as -p "$(printf 'hello %.0s' {1..255})", but I get NaNs with -f LICENSE. So presumably there is at least one other point in the kernel where there are numerical issues.

ggerganov · 2025-10-13T08:13:52Z

Please note that using -p "$(printf 'hello %.0s' {1..255})" will result in 512 tokens for this model (because of the extra BOS and EOS tokens). This case is not problematic - i.e. it outputs the correct numbers.

The problematic case is when the number of tokens is not a power of 2. For example using -p "$(printf 'hello %.0s' {1..256})" results in 514 tokens and this results in NaNs.

You can also reproduce this with other non-power-of-2 prompts:

# this results in 202 input tokens and triggers NaNs
-p "$(printf 'hello %.0s' {1..100})"

# this results in 64 input tokens and it works OK
-p "$(printf 'hello %.0s' {1..31})"

# this results in 66 input tokens and it produces NaNs
-p "$(printf 'hello %.0s' {1..32})"

You can determine the number of tokens for a given prompt by looking for the following log message:

batch_decode: n_tokens = 202, n_seq = 1

JohannesGaessler · 2025-10-13T13:40:22Z

If the K/V data is out-of-bounds the SRAM buffer is zeroed out. The result of the KQ maxtrix multiplication is then 0. However, there was no check to prevent the use of this result for the determination of the KQ maximum. As a consequence, if the KQ maximum of the real data is e.g. -100 then you get numerical issues for the real data when you calculate the softmax.

JohannesGaessler · 2025-10-13T13:43:26Z

I hope all numerical issues are now fixed, I am no longer able to provoke NaNs.

ggerganov · 2025-10-13T14:04:13Z

Great, thank you. I confirm it works correctly now.

I will first merge your #16540 PR with the fix and then proceed to rebase this one and merge it next.

ggerganov requested a review from CISC as a code owner October 12, 2025 07:42

This was referenced Oct 12, 2025

context : print graph stats for memory-less contexts #15586

Merged

metal : FA support F32 K and V #16531

Open

danbev approved these changes Oct 12, 2025

View reviewed changes

src/llama-graph.cpp Show resolved Hide resolved

JohannesGaessler mentioned this pull request Oct 12, 2025

CUDA: fix numerical issue in tile FA kernel #16540

Merged

JohannesGaessler self-requested a review as a code owner October 12, 2025 19:43

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025

JohannesGaessler force-pushed the gg/cacheless-embd branch from d8d99d1 to c308925 Compare October 13, 2025 13:32

graph : support cacheless embeddings with FA and iSWA

5734546

ggerganov force-pushed the gg/cacheless-embd branch from c308925 to 5734546 Compare October 13, 2025 14:30

graph : support cacheless embeddings with FA and iSWA #16528

Are you sure you want to change the base?

graph : support cacheless embeddings with FA and iSWA #16528

Conversation

ggerganov commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

JohannesGaessler commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

JohannesGaessler commented Oct 13, 2025

Uh oh!

JohannesGaessler commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Oct 12, 2025 •

edited

Loading