CUDA: avoid mul + bias fusion when buffers are split #16935

am17an · 2025-11-02T11:51:28Z

Fix #16799. When fusing just a mul-mat + bias, we don't check if the buffer is split. We check this when fusing gate + up. Tested on 3x 4090 with gpt-oss-120b

IMbackK

Yes, unsurprisingly since this is just disabling fusion in this case, this fixes the issue.

am17an · 2025-11-02T12:15:10Z

Yes, unsurprisingly since this is just disabling fusion in this case, this fixes the issue.

Yes unlikely fusion would help in this case anyway

IMbackK · 2025-11-02T12:38:18Z

we should probably have some multigpu unit tests to catch this sort of thing.

JohannesGaessler · 2025-11-03T13:08:56Z

@am17an should we merge this?

am17an · 2025-11-03T13:10:44Z

I'm wondering whether we should just disable fusion outright if we detect any buffer is split or --ot is used

am17an · 2025-11-03T13:17:41Z

At least for --sm row cases it doesn't make sense to support it. --ot we should still support. Similar to CUDA graphs. What do you guys think?

JohannesGaessler · 2025-11-03T13:22:37Z

If there are issues with any -ot the issue could be related to padding not being cleared correctly. To avoid having to do out-of-bounds checks, the matrix multiplication code can read a bit further than ne00, in essence reading a bit of data from the next row. To avoid changing the result the activation columns are padded with 0 at the end. However, the last row of src0 also needs to be padded because the data there can randomly encode e.g. NaN which would cause the final result to become NaN. See:

llama.cpp/ggml/src/ggml-cuda/mmvq.cu

Lines 656 to 665 in 070ff4d

    
           // If src0 is a temporary compute buffer, clear any potential padding. 
        
           if (ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE) { 
        
               const size_t size_data  = ggml_nbytes(src0); 
        
               const size_t size_alloc = ggml_backend_buffer_get_alloc_size(src0->buffer, src0); 
        
               if (size_alloc > size_data) { 
        
                   GGML_ASSERT(ggml_is_contiguously_allocated(src0)); 
        
                   GGML_ASSERT(!src0->view_src); 
        
                   CUDA_CHECK(cudaMemsetAsync((char *) src0->data + size_data, 0, size_alloc - size_data, stream)); 
        
               } 
        
           }

am17an · 2025-11-03T13:25:39Z

I think we already check this with bad_padding_clear? or is this something else

slaren · 2025-11-03T13:27:47Z

ggml_backend_sched may replace some tensors in the graph when necessary to make a copy on a different backend, and I wonder if that may be causing the fusion check to fail, because ggml_node_get_use_count may not work properly for these nodes.

am17an · 2025-11-03T13:32:04Z

To be clear there have been no crashes reported with --ot, the issue I'm talking about related to that is #16912, which seems to be a mix of fusion and re-ordering tensors. The illegal memory accesses are all reported with --sm row which was supposed to be fixed with this PR, but another user reported another crash which I've not been able to reproduce.

JohannesGaessler · 2025-11-03T13:32:34Z

I think we already check this with bad_padding_clear? or is this something else

What I mean is that the padding is being cleared for src0 but from what I can tell not for gate.

More generally, -sm row has never worked properly for AMD in the first place and an illegal memory access is very unspecific so I think there are 2 separate bugs here. For the longest time I didn't even have the hardware to debug this, now I'm almost at the point where I intend to replace the current -sm row implementation anyways. So I'm not convinced that getting the current version of -sm row to work on AMD is worth the opportunity cost.

IMbackK · 2025-11-03T13:45:02Z

More generally, -sm row has never worked properly for AMD

Works fine here (present bug aside). I think the perception that it dosent work is that rocr has had multiple bugs relating to handing various p2p scenarios.

am17an · 2025-11-04T02:53:39Z

I am merging this as it solves quite a bunch of --sm row issues, I still need to investigate one more crash which I can't reproduce

CUDA: avoid mul + bias fusion when doing fusion

1d3a615

am17an requested a review from slaren as a code owner November 2, 2025 11:51

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 2, 2025

am17an mentioned this pull request Nov 2, 2025

Eval bug: ROCm illegal memory access with -sm row #16799

Closed

am17an requested a review from JohannesGaessler November 2, 2025 11:52

am17an changed the title ~~CUDA: avoid mul + bias fusion when doing fusion~~ CUDA: avoid mul + bias fusion when buffers are split Nov 2, 2025

JohannesGaessler approved these changes Nov 2, 2025

View reviewed changes

IMbackK approved these changes Nov 2, 2025

View reviewed changes

pwilkin mentioned this pull request Nov 2, 2025

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

Draft

Panchovix mentioned this pull request Nov 2, 2025

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion) #16912

Closed

pwilkin mentioned this pull request Nov 2, 2025

Eval bug: data corruption on CUDA experts offload #16945

Open

am17an mentioned this pull request Nov 3, 2025

Eval bug: CUDA error: an illegal memory access was encountered #16953

Closed

am17an merged commit 2759ccd into ggml-org:master Nov 4, 2025
121 of 130 checks passed

am17an deleted the cuda-fix-sm-row branch November 4, 2025 03:01

CUDA: avoid mul + bias fusion when buffers are split #16935

CUDA: avoid mul + bias fusion when buffers are split #16935

Uh oh!

Conversation

am17an commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

am17an commented Nov 2, 2025

Uh oh!

IMbackK commented Nov 2, 2025

Uh oh!

JohannesGaessler commented Nov 3, 2025

Uh oh!

am17an commented Nov 3, 2025

Uh oh!

am17an commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 3, 2025

Uh oh!

am17an commented Nov 3, 2025

Uh oh!

slaren commented Nov 3, 2025

Uh oh!

am17an commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 3, 2025

Uh oh!

IMbackK commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Nov 2, 2025 •

edited

Loading

am17an commented Nov 3, 2025 •

edited

Loading

am17an commented Nov 3, 2025 •

edited

Loading

IMbackK commented Nov 3, 2025 •

edited

Loading