-
Notifications
You must be signed in to change notification settings - Fork 14.4k
CUDA: fix 0.0f/0.0f for FA fixup #18472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: fix 0.0f/0.0f for FA fixup #18472
Conversation
|
I forgot: I'm not 100% sure whether this is the correct fix. That both of these values are 0.0f could be indicative of the loop bounds being slightly wrong. I'll check this tomorrow when I'm less tired. |
One piece of info I didnt mention in bug report was this was the only model which failed, other Qwen3 based VL models worked fine. So its like something was numerically on razor edge and just happened to push over on this particular clip model. Thanks for your fast response and also digging into root cause; I am not sure turning on 80 at build is the right solution or not for the problem as it just might be pushing something slightly to the right side of working and still may be intermittent. |
|
@steampunque Does this patch fix it for you? For me it still does not work on GTX 1660. I build with: cmake -DGGML_CUDA=ON ..
...
-- Found CUDAToolkit: /usr/local/cuda-13.0/targets/x86_64-linux/include (found version "13.0.48")
-- CUDA Toolkit found
-- The CUDA compiler identification is NVIDIA 13.0.48
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda-13.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Replacing 120-real in CMAKE_CUDA_ARCHITECTURES_NATIVE with 120a-real
-- Using CMAKE_CUDA_ARCHITECTURES=120a-real;75-real CMAKE_CUDA_ARCHITECTURES_NATIVE=120a-real;75-real
-- CUDA host compiler is GNU 13.3.0Here is system info: nvidia-smi
nvidia-smi
Tue Dec 30 09:03:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 Off | N/A |
| 0% 37C P8 10W / 575W | 15MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce GTX 1660 On | 00000000:09:00.0 Off | N/A |
| 0% 48C P8 15W / 130W | 6MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4529 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 4529 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+I'm forcing usage of the GTX 1660 with |
This patch did fix the issue see the main issue thread where I show the different patches and build configs. I think this patch is an umbrella covering problems generated elsewhere, however the other patch which was identified as "root cause" did not fix the issue. If you turn on cuda arch 80 (Ampere) at build it should work with or without the patches, however this I believe is also not addressing root cause but sidestepping some real tangible problem elsewhere that just doesnt happen to manifest on either debug builds or ampere arch builds. |
|
So just to be clear: you did re-create the build directory, made a build that wasn't suspiciously fast for CC 7.5, and that build was still causing issues? |
Yeah I went through all the hoops. There is a table on the issue thread summarizing all my results. This is the only patch that made it work compiling without 80 (Ampere) specified and then running ona 4070. I think compiling on 80 is covering up a problem though. Debug builds fix the problem too so something related to optimizations which get turned off for debug builds I'd guess. |
Fixes #18444 .
The problem seems to be the fixup kernel where results from multiple CUDA blocks are combined. As it turns out in this kernel the operation
0.0f/0.0fcan occur. According to IEEE 754 the result of this operation is NaN. However, because we are compiling with-ffast_maththe behavior is actually undefined. If the code is compiled for Turing the result is NaN, for Ada Lovelace it randomly just happens to work out.