Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Nov 2, 2025

Fix #16799. When fusing just a mul-mat + bias, we don't check if the buffer is split. We check this when fusing gate + up. Tested on 3x 4090 with gpt-oss-120b

@am17an am17an requested a review from slaren as a code owner November 2, 2025 11:51
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 2, 2025
@am17an am17an changed the title CUDA: avoid mul + bias fusion when doing fusion CUDA: avoid mul + bias fusion when buffers are split Nov 2, 2025
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unsurprisingly since this is just disabling fusion in this case, this fixes the issue.

@am17an
Copy link
Collaborator Author

am17an commented Nov 2, 2025

Yes, unsurprisingly since this is just disabling fusion in this case, this fixes the issue.

Yes unlikely fusion would help in this case anyway

@IMbackK
Copy link
Collaborator

IMbackK commented Nov 2, 2025

we should probably have some multigpu unit tests to catch this sort of thing.

@JohannesGaessler
Copy link
Collaborator

@am17an should we merge this?

@am17an
Copy link
Collaborator Author

am17an commented Nov 3, 2025

I'm wondering whether we should just disable fusion outright if we detect any buffer is split or --ot is used

@am17an
Copy link
Collaborator Author

am17an commented Nov 3, 2025

At least for --sm row cases it doesn't make sense to support it. --ot we should still support. Similar to CUDA graphs. What do you guys think?

@JohannesGaessler
Copy link
Collaborator

If there are issues with any -ot the issue could be related to padding not being cleared correctly. To avoid having to do out-of-bounds checks, the matrix multiplication code can read a bit further than ne00, in essence reading a bit of data from the next row. To avoid changing the result the activation columns are padded with 0 at the end. However, the last row of src0 also needs to be padded because the data there can randomly encode e.g. NaN which would cause the final result to become NaN. See:

// If src0 is a temporary compute buffer, clear any potential padding.
if (ggml_backend_buffer_get_usage(src0->buffer) == GGML_BACKEND_BUFFER_USAGE_COMPUTE) {
const size_t size_data = ggml_nbytes(src0);
const size_t size_alloc = ggml_backend_buffer_get_alloc_size(src0->buffer, src0);
if (size_alloc > size_data) {
GGML_ASSERT(ggml_is_contiguously_allocated(src0));
GGML_ASSERT(!src0->view_src);
CUDA_CHECK(cudaMemsetAsync((char *) src0->data + size_data, 0, size_alloc - size_data, stream));
}
}

@am17an
Copy link
Collaborator Author

am17an commented Nov 3, 2025

I think we already check this with bad_padding_clear? or is this something else

@slaren
Copy link
Member

slaren commented Nov 3, 2025

ggml_backend_sched may replace some tensors in the graph when necessary to make a copy on a different backend, and I wonder if that may be causing the fusion check to fail, because ggml_node_get_use_count may not work properly for these nodes.

@am17an
Copy link
Collaborator Author

am17an commented Nov 3, 2025

To be clear there have been no crashes reported with --ot, the issue I'm talking about related to that is #16912, which seems to be a mix of fusion and re-ordering tensors. The illegal memory accesses are all reported with --sm row which was supposed to be fixed with this PR, but another user reported another crash which I've not been able to reproduce.

@JohannesGaessler
Copy link
Collaborator

I think we already check this with bad_padding_clear? or is this something else

What I mean is that the padding is being cleared for src0 but from what I can tell not for gate.

More generally, -sm row has never worked properly for AMD in the first place and an illegal memory access is very unspecific so I think there are 2 separate bugs here. For the longest time I didn't even have the hardware to debug this, now I'm almost at the point where I intend to replace the current -sm row implementation anyways. So I'm not convinced that getting the current version of -sm row to work on AMD is worth the opportunity cost.

@IMbackK
Copy link
Collaborator

IMbackK commented Nov 3, 2025

More generally, -sm row has never worked properly for AMD

Works fine here (present bug aside). I think the perception that it dosent work is that rocr has had multiple bugs relating to handing various p2p scenarios.

@am17an
Copy link
Collaborator Author

am17an commented Nov 4, 2025

I am merging this as it solves quite a bunch of --sm row issues, I still need to investigate one more crash which I can't reproduce

@am17an am17an merged commit 2759ccd into ggml-org:master Nov 4, 2025
121 of 130 checks passed
@am17an am17an deleted the cuda-fix-sm-row branch November 4, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: ROCm illegal memory access with -sm row

4 participants