Skip to content

Add CUDA non-contiguous Unary Ops support #14639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

YavorGIvanov
Copy link
Contributor

No description provided.

@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 11, 2025
@YavorGIvanov YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from c44bfde to 919ce38 Compare July 11, 2025 23:34
@am17an am17an requested a review from JohannesGaessler July 12, 2025 10:08
Comment on lines +131 to +133
if (ggml_is_contiguous(src) && ggml_is_contiguous(dst_tensor)) {
unary_op_kernel<op><<<num_blocks, CUDA_NEG_BLOCK_SIZE, 0, stream>>>(x, dst, k);
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the contiguous path, it's no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as the performance of the simple cont kernel is obviously better. I thought you may prefer to still use the most optimal path in this case. I know in the big scheme of things these unary operations are a very small part of the inference time, but think it is good idea to not degrade cont perf in this case.

  ABS(type=f32,ne_a=[256,256,3,1],v=0):               532415 runs -     1.88 us/run -     1536 kB/run -  778.95 GB/s
  ABS(type=f32,ne_a=[256,256,3,1],v=1):               311220 runs -     3.24 us/run -     3070 kB/run -  903.14 GB/s

Here is example perf test using test-backend-ops on a H100 SXM5.
v=0 meaning contiguous and v=1 meaning non-contiguous.

Let me know whether you still want the cont path removed or you agree I should keep it for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, if you want to keep the contiguous path, add a template parameter to the noncontiguous kernel where you return early.

More generally, if you're concerned about the performance one thing you can try is replace the byte offsets with logical offsets (calculate these in host code and pass to the kernel). But I expect the impact on end-to-end performance to be negligible.

@github-actions github-actions bot added the testing Everything test related label Jul 12, 2025
@YavorGIvanov YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from 1174a95 to 1752873 Compare July 12, 2025 23:43
@YavorGIvanov YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from 1752873 to 64be8c5 Compare July 12, 2025 23:44
@YavorGIvanov
Copy link
Contributor Author

@JohannesGaessler @am17an Tried to address all comments.

@CISC
Copy link
Collaborator

CISC commented Jul 31, 2025

@YavorGIvanov gentle ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants