CUDA: fix 0.0f/0.0f for FA fixup #18472

JohannesGaessler · 2025-12-29T23:26:03Z

The problem seems to be the fixup kernel where results from multiple CUDA blocks are combined. As it turns out in this kernel the operation 0.0f/0.0f can occur. According to IEEE 754 the result of this operation is NaN. However, because we are compiling with -ffast_math the behavior is actually undefined. If the code is compiled for Turing the result is NaN, for Ada Lovelace it randomly just happens to work out.

JohannesGaessler · 2025-12-29T23:35:01Z

I forgot: I'm not 100% sure whether this is the correct fix. That both of these values are 0.0f could be indicative of the loop bounds being slightly wrong. I'll check this tomorrow when I'm less tired.

steampunque · 2025-12-29T23:41:25Z

I forgot: I'm not 100% sure whether this is the correct fix. That both of these values are 0.0f could be indicative of the loop bounds being slightly wrong. I'll check this tomorrow when I'm less tired.

One piece of info I didnt mention in bug report was this was the only model which failed, other Qwen3 based VL models worked fine. So its like something was numerically on razor edge and just happened to push over on this particular clip model. Thanks for your fast response and also digging into root cause; I am not sure turning on 80 at build is the right solution or not for the problem as it just might be pushing something slightly to the right side of working and still may be intermittent.

ggerganov · 2025-12-30T07:05:12Z

@steampunque Does this patch fix it for you? For me it still does not work on GTX 1660.

I build with:

cmake -DGGML_CUDA=ON ..

...

-- Found CUDAToolkit: /usr/local/cuda-13.0/targets/x86_64-linux/include (found version "13.0.48") 
-- CUDA Toolkit found
-- The CUDA compiler identification is NVIDIA 13.0.48
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda-13.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Replacing 120-real in CMAKE_CUDA_ARCHITECTURES_NATIVE with 120a-real
-- Using CMAKE_CUDA_ARCHITECTURES=120a-real;75-real CMAKE_CUDA_ARCHITECTURES_NATIVE=120a-real;75-real
-- CUDA host compiler is GNU 13.3.0

Here is system info:

nvidia-smi

nvidia-smi 
Tue Dec 30 09:03:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8             10W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1660        On  |   00000000:09:00.0 Off |                  N/A |
|  0%   48C    P8             15W /  130W |       6MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4529      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            4529      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

I'm forcing usage of the GTX 1660 with CUDA_VISIBLE_DEVICES=1

steampunque · 2025-12-30T19:41:17Z

@steampunque Does this patch fix it for you? For me it still does not work on GTX 1660.

This patch did fix the issue see the main issue thread where I show the different patches and build configs. I think this patch is an umbrella covering problems generated elsewhere, however the other patch which was identified as "root cause" did not fix the issue. If you turn on cuda arch 80 (Ampere) at build it should work with or without the patches, however this I believe is also not addressing root cause but sidestepping some real tangible problem elsewhere that just doesnt happen to manifest on either debug builds or ampere arch builds.

JohannesGaessler · 2025-12-30T21:30:56Z

So just to be clear: you did re-create the build directory, made a build that wasn't suspiciously fast for CC 7.5, and that build was still causing issues?

steampunque · 2025-12-30T22:48:21Z

So just to be clear: you did re-create the build directory, made a build that wasn't suspiciously fast for CC 7.5, and that build was still causing issues?

Yeah I went through all the hoops. There is a table on the issue thread summarizing all my results. This is the only patch that made it work compiling without 80 (Ampere) specified and then running ona 4070. I think compiling on 80 is covering up a problem though. Debug builds fix the problem too so something related to optimizations which get turned off for debug builds I'd guess.

CUDA: fix 0.0f/0.0f for FA fixup

4e02ad7

loci-dev mentioned this pull request Dec 29, 2025

UPSTREAM PR #18472: CUDA: fix 0.0f/0.0f for FA fixup auroralabs-loci/llama.cpp#749

Open

JohannesGaessler mentioned this pull request Dec 29, 2025

Eval bug: b7256 breaks MiniCPM-V-4_5 #18444

Closed

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 29, 2025

JohannesGaessler mentioned this pull request Dec 30, 2025

CUDA: fix KQ max calculation #18487

Merged

loci-dev mentioned this pull request Dec 30, 2025

UPSTREAM PR #18487: CUDA: fix KQ max calculation auroralabs-loci/llama.cpp#760

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix 0.0f/0.0f for FA fixup #18472

CUDA: fix 0.0f/0.0f for FA fixup #18472

JohannesGaessler commented Dec 29, 2025

Uh oh!

JohannesGaessler commented Dec 29, 2025

Uh oh!

steampunque commented Dec 29, 2025

Uh oh!

ggerganov commented Dec 30, 2025

Uh oh!

steampunque commented Dec 30, 2025

Uh oh!

JohannesGaessler commented Dec 30, 2025

Uh oh!

steampunque commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: fix 0.0f/0.0f for FA fixup #18472

Are you sure you want to change the base?

CUDA: fix 0.0f/0.0f for FA fixup #18472

Conversation

JohannesGaessler commented Dec 29, 2025

Uh oh!

JohannesGaessler commented Dec 29, 2025

Uh oh!

steampunque commented Dec 29, 2025

Uh oh!

ggerganov commented Dec 30, 2025

Uh oh!

steampunque commented Dec 30, 2025

Uh oh!

JohannesGaessler commented Dec 30, 2025

Uh oh!

steampunque commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants