CUDA: add implicit conv3d #16948

bssrdf · 2025-11-02T17:38:04Z

This PR adds an implicit conv3d op in CUDA backend, as a complement to IM2COL_3D+GEMM kernel currently used in SD.cpp for video models. It pretty much follows conv2d_implicit.

Using tensor cores, this PR outperforms IM2COL_3D+GEMM. The memory saving is significant, as expected.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD)	im2col_3D+GEMM TIME	im2col_3D+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM	pytorch/cudnn TIME
(320, 1280, 26, 38, 8, 3, 3, 3)	1.81 ms	168.85 MB	1.53 ms	38.59 MB	1.42 ms
(1280, 1280, 26, 38, 8, 3, 3, 3)	6.71 ms	559.61 MB	6.01 ms	38.59 MB	5.68 ms
(320, 1280, 52, 76, 8, 3, 3, 3)	7.39 ms	675.39 MB	3.86 ms	154.38 MB	5.64 ms
(1280, 1280, 52, 76, 8, 3, 3, 3)	26.40 ms	2238.44 MB	15.25 ms	154.38 MB	21.62 ms
(320, 1280, 104, 152, 8, 3, 3, 3)	29.36 ms	2701.56 MB	15.42 ms	617.50 MB	22.02 ms
(1280, 1280, 104, 152, 8, 3, 3, 3)	105.70 ms	8953.75 MB	60.83 ms	617.50 MB	82.19 ms
(320, 1280, 208, 304, 4, 3, 3, 3)	59.02 ms	5403.12 MB	30.06 ms	1235.00 MB	43.78 ms
(640, 1280, 208, 304, 4, 3, 3, 3)	109.66 ms	9571.25 MB	60.04 ms	1235.00 MB	83.66 ms

…a backend

… 3x slower than im2col

Green-Sky · 2025-11-02T17:54:40Z

Quick n dirty test perf run:

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD)	im2col+GEMM TIME	im2col+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3)	10.30 ms	168.85 MB	34.95 ms	38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3)	37.30 ms	559.61 MB	141.30 ms	38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3)	37.85 ms	675.39 MB	137.66 ms	154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3)	156.93 ms	2238.44 MB	558.15 ms	154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3)	157.53 ms	2701.56 MB	569.28 ms	617.50 MB

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

bssrdf · 2025-11-02T18:29:20Z

Quick n dirty test perf run:
$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 10.30 ms 168.85 MB 34.95 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 37.30 ms 559.61 MB 141.30 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 37.85 ms 675.39 MB 137.66 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 156.93 ms 2238.44 MB 558.15 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 157.53 ms 2701.56 MB 569.28 ms 617.50 MB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

Need to get tensors working:)

Green-Sky · 2025-11-02T21:37:12Z

With that last commit I am catching an illegal memory access.

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 715.16 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: IM2COL failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err
/build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error

bssrdf · 2025-11-02T22:31:43Z

With that last commit I am catching an illegal memory access.

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 715.16 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: IM2COL failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err
/build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error

Give it another try. Tensor core path should be on and working.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD)	im2col+GEMM TIME	im2col+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3)	1.88 ms	168.85 MB	1.54 ms	38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3)	6.76 ms	559.61 MB	6.18 ms	38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3)	7.58 ms	675.39 MB	3.99 ms	154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3)	26.99 ms	2238.44 MB	15.51 ms	154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3)	30.06 ms	2701.56 MB	15.86 ms	617.50 MB
(1280, 1280, 104, 152, 8, 3, 3, 3)	108.33 ms	8953.75 MB	62.23 ms	617.50 MB
(320, 1280, 208, 304, 4, 3, 3, 3)	60.33 ms	5403.12 MB	30.79 ms	1235.00 MB
(640, 1280, 208, 304, 4, 3, 3, 3)	112.34 ms	9571.25 MB	61.49 ms	1235.00 MB

Green-Sky · 2025-11-02T23:14:22Z

Looks good!

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD)	im2col+GEMM TIME	im2col+GEMM VRAM	implicit GEMM TIME	implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3)	11.35 ms	168.85 MB	6.15 ms	38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3)	39.40 ms	559.61 MB	24.59 ms	38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3)	40.52 ms	675.39 MB	23.23 ms	154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3)	166.10 ms	2238.44 MB	91.93 ms	154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3)	164.08 ms	2701.56 MB	91.41 ms	617.50 MB

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

jeffbolznv · 2025-11-02T23:24:50Z

Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp.

bssrdf · 2025-11-02T23:44:20Z

Looks good!
$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 11.35 ms 168.85 MB 6.15 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 39.40 ms 559.61 MB 24.59 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 40.52 ms 675.39 MB 23.23 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 166.10 ms 2238.44 MB 91.93 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 164.08 ms 2701.56 MB 91.41 ms 617.50 MB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp.

Will do. Right now, test-backend-op test -o CONV_3D failed to pass.

@Green-Sky , does the WAN video generation work ok?

Green-Sky · 2025-11-02T23:52:51Z

@Green-Sky , does the WAN video generation work ok?

As far as I can tell yes. I have only tested Wan2.2-TI2V-5B yet (wan2.2 vae).

Single "frame" (aka image gen using wan):

I don't have the im2col gemm image ready, but it looked identical. Same for 5 frames.

I was unable to run higher frame counts due to my 8gig vram limit. It should technically just fit if I restart and don't open extra programs. Not gonna do that today.

Green-Sky · 2025-11-03T00:14:13Z

@Green-Sky , does the WAN video generation work ok?

I tried a smaller resolution and I did get another illegal memory access:

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 5393.97 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/jj5n0s93mbksd9wmv6vrs8dxmhcvhnik-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err

bssrdf · 2025-11-03T13:51:04Z

@Green-Sky , does the WAN video generation work ok?

I tried a smaller resolution and I did get another illegal memory access:

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 5393.97 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/jj5n0s93mbksd9wmv6vrs8dxmhcvhnik-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Green-Sky · 2025-11-03T15:45:46Z

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.

Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

bssrdf · 2025-11-03T15:55:49Z

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.

Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

So it is a ggml issue?

Green-Sky · 2025-11-03T15:58:32Z

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.
Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

So it is a ggml issue?

I can not definitively say that, might be sd.cpp too. But afaik the concat is operating on the outputs of conv3d operations. Maybe we are missing a sync somewhere.

Green-Sky · 2025-11-03T16:54:33Z

@leejet I suspect sd.cpp is doing something wrong here, because if width and height are multiples of 32, decoding works. But if they are just multiples of 16 I get that illegal memory access error.

But it is still possible that it is an error with this pr.

Green-Sky · 2025-11-03T17:07:34Z

$ result/bin/sd -M vid_gen --diffusion-model models/wan/Wan2.2-TI2V-5B-Q8_0.gguf --vae models/wan/Wan2.2_VAE.safetensors --t5xxl models/wan/umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat strolling down a wooden plank" --cfg-scale 6 --sampling-method euler -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 576 -H 352 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 --steps 48

wan2.2-TI2V-5B-Q8_0-cat-strolling_funkysize.mp4

bssrdf added 5 commits November 1, 2025 20:08

use conv2d_implicit as template; add conv3d parameters

ab15f6c

WIP: updating indices for input and kernel; enable OP_CONV_3D for cud…

52455b8

…a backend

WIP: build ok

0a64ea8

conv3d WIP: added a test case

e802036

conv3D WIP: fixed a launch param bug, results now correct; performace…

a5b68bc

… 3x slower than im2col

bssrdf requested review from ggerganov and slaren as code owners November 2, 2025 17:38

bssrdf marked this pull request as draft November 2, 2025 17:38

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 2, 2025

Green-Sky mentioned this pull request Nov 2, 2025

The Process of Wan2.1 t2v 1.3B VAE Decoding needs too much VRAM/RAM??? leejet/stable-diffusion.cpp#872

Open

conv3d WIP: turn on tensor cores; NCDHW2NDHWC to be worked out

3f5c504

conv3d WIP: enabled tensor core path

3308cce

fixed a bug now all test cases passed

2357922

bssrdf marked this pull request as ready for review November 3, 2025 14:00

bssrdf added 4 commits November 3, 2025 09:28

make CI happy

5aa4ae7

one more CI fix

91650b7

add some test cases in test-backend-op perf

f0ced9f

avoid CI time out on test-conv3d

f921286

fix metal related CI stuff

36c0df7

CUDA: add implicit conv3d #16948

Are you sure you want to change the base?

CUDA: add implicit conv3d #16948

Conversation

bssrdf commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 2, 2025

Uh oh!

bssrdf commented Nov 2, 2025

Uh oh!

Green-Sky commented Nov 2, 2025

Uh oh!

bssrdf commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 2, 2025

Uh oh!

jeffbolznv commented Nov 2, 2025

Uh oh!

bssrdf commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 2, 2025

Uh oh!

Green-Sky commented Nov 3, 2025

Uh oh!

bssrdf commented Nov 3, 2025

Uh oh!

Green-Sky commented Nov 3, 2025

Uh oh!

bssrdf commented Nov 3, 2025

Uh oh!

Green-Sky commented Nov 3, 2025

Uh oh!

Green-Sky commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bssrdf commented Nov 2, 2025 •

edited

Loading

bssrdf commented Nov 2, 2025 •

edited

Loading

bssrdf commented Nov 2, 2025 •

edited

Loading

Green-Sky commented Nov 3, 2025 •

edited

Loading