Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #18444 .

The problem seems to be the fixup kernel where results from multiple CUDA blocks are combined. As it turns out in this kernel the operation 0.0f/0.0f can occur. According to IEEE 754 the result of this operation is NaN. However, because we are compiling with -ffast_math the behavior is actually undefined. If the code is compiled for Turing the result is NaN, for Ada Lovelace it randomly just happens to work out.

@JohannesGaessler
Copy link
Collaborator Author

I forgot: I'm not 100% sure whether this is the correct fix. That both of these values are 0.0f could be indicative of the loop bounds being slightly wrong. I'll check this tomorrow when I'm less tired.

@steampunque
Copy link

I forgot: I'm not 100% sure whether this is the correct fix. That both of these values are 0.0f could be indicative of the loop bounds being slightly wrong. I'll check this tomorrow when I'm less tired.

One piece of info I didnt mention in bug report was this was the only model which failed, other Qwen3 based VL models worked fine. So its like something was numerically on razor edge and just happened to push over on this particular clip model. Thanks for your fast response and also digging into root cause; I am not sure turning on 80 at build is the right solution or not for the problem as it just might be pushing something slightly to the right side of working and still may be intermittent.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 29, 2025
@ggerganov
Copy link
Member

@steampunque Does this patch fix it for you? For me it still does not work on GTX 1660.

I build with:

cmake -DGGML_CUDA=ON ..

...

-- Found CUDAToolkit: /usr/local/cuda-13.0/targets/x86_64-linux/include (found version "13.0.48") 
-- CUDA Toolkit found
-- The CUDA compiler identification is NVIDIA 13.0.48
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda-13.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Replacing 120-real in CMAKE_CUDA_ARCHITECTURES_NATIVE with 120a-real
-- Using CMAKE_CUDA_ARCHITECTURES=120a-real;75-real CMAKE_CUDA_ARCHITECTURES_NATIVE=120a-real;75-real
-- CUDA host compiler is GNU 13.3.0

Here is system info:

nvidia-smi

nvidia-smi 
Tue Dec 30 09:03:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P8             10W /  575W |      15MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1660        On  |   00000000:09:00.0 Off |                  N/A |
|  0%   48C    P8             15W /  130W |       6MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4529      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            4529      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

I'm forcing usage of the GTX 1660 with CUDA_VISIBLE_DEVICES=1

@steampunque
Copy link

@steampunque Does this patch fix it for you? For me it still does not work on GTX 1660.

This patch did fix the issue see the main issue thread where I show the different patches and build configs. I think this patch is an umbrella covering problems generated elsewhere, however the other patch which was identified as "root cause" did not fix the issue. If you turn on cuda arch 80 (Ampere) at build it should work with or without the patches, however this I believe is also not addressing root cause but sidestepping some real tangible problem elsewhere that just doesnt happen to manifest on either debug builds or ampere arch builds.

@JohannesGaessler
Copy link
Collaborator Author

So just to be clear: you did re-create the build directory, made a build that wasn't suspiciously fast for CC 7.5, and that build was still causing issues?

@steampunque
Copy link

So just to be clear: you did re-create the build directory, made a build that wasn't suspiciously fast for CC 7.5, and that build was still causing issues?

Yeah I went through all the hoops. There is a table on the issue thread summarizing all my results. This is the only patch that made it work compiling without 80 (Ampere) specified and then running ona 4070. I think compiling on 80 is covering up a problem though. Debug builds fix the problem too so something related to optimizations which get turned off for debug builds I'd guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: b7256 breaks MiniCPM-V-4_5

3 participants