Add CUDA non-contiguous Unary Ops support #14639

YavorGIvanov · 2025-07-11T23:33:14Z

No description provided.

CMakePresets.json

docs/ops/CUDA.csv

ggml/src/ggml-cuda/unary.cu

JohannesGaessler · 2025-07-12T10:18:57Z

ggml/src/ggml-cuda/unary.cu

+    if (ggml_is_contiguous(src) && ggml_is_contiguous(dst_tensor)) {
+        unary_op_kernel<op><<<num_blocks, CUDA_NEG_BLOCK_SIZE, 0, stream>>>(x, dst, k);
+    } else {


Remove the contiguous path, it's no longer needed.

I kept it as the performance of the simple cont kernel is obviously better. I thought you may prefer to still use the most optimal path in this case. I know in the big scheme of things these unary operations are a very small part of the inference time, but think it is good idea to not degrade cont perf in this case.

ABS(type=f32,ne_a=[256,256,3,1],v=0): 532415 runs - 1.88 us/run - 1536 kB/run - 778.95 GB/s ABS(type=f32,ne_a=[256,256,3,1],v=1): 311220 runs - 3.24 us/run - 3070 kB/run - 903.14 GB/s

Here is example perf test using test-backend-ops on a H100 SXM5.
v=0 meaning contiguous and v=1 meaning non-contiguous.

Let me know whether you still want the cont path removed or you agree I should keep it for now.

Sorry for the late reply, if you want to keep the contiguous path, add a template parameter to the noncontiguous kernel where you return early.

More generally, if you're concerned about the performance one thing you can try is replace the byte offsets with logical offsets (calculate these in host code and pass to the kernel). But I expect the impact on end-to-end performance to be negligible.

YavorGIvanov · 2025-07-12T23:44:53Z

@JohannesGaessler @am17an Tried to address all comments.

CISC · 2025-07-31T12:41:48Z

@YavorGIvanov gentle ping

github-actions bot added documentation Improvements or additions to documentation build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 11, 2025

YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from c44bfde to 919ce38 Compare July 11, 2025 23:34

am17an reviewed Jul 12, 2025

View reviewed changes

CMakePresets.json Outdated Show resolved Hide resolved

am17an requested a review from JohannesGaessler July 12, 2025 10:08

JohannesGaessler reviewed Jul 12, 2025

View reviewed changes

github-actions bot added the testing Everything test related label Jul 12, 2025

YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from 1174a95 to 1752873 Compare July 12, 2025 23:43

Add CUDA non-contigious Unary ops implementation

64be8c5

YavorGIvanov force-pushed the feature/cuda-non-cont-unary-support branch from 1752873 to 64be8c5 Compare July 12, 2025 23:44

YavorGIvanov mentioned this pull request Jul 12, 2025

Add ELU CUDA support #14657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA non-contiguous Unary Ops support #14639

Add CUDA non-contiguous Unary Ops support #14639

YavorGIvanov commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Jul 12, 2025

Uh oh!

YavorGIvanov Jul 12, 2025

Uh oh!

JohannesGaessler Jul 15, 2025

Uh oh!

YavorGIvanov commented Jul 12, 2025

Uh oh!

CISC commented Jul 31, 2025

Uh oh!

Uh oh!

Add CUDA non-contiguous Unary Ops support #14639

Are you sure you want to change the base?

Add CUDA non-contiguous Unary Ops support #14639

Conversation

YavorGIvanov commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

YavorGIvanov Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

YavorGIvanov commented Jul 12, 2025

Uh oh!

CISC commented Jul 31, 2025

Uh oh!

Uh oh!