Skip to content

Conversation

@bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Nov 2, 2025

This PR adds an implicit conv3d op in CUDA backend, as a complement to IM2COL_3D+GEMM kernel currently used in SD.cpp for video models. It pretty much follows conv2d_implicit.

Using tensor cores, this PR outperforms IM2COL_3D+GEMM. The memory saving is significant, as expected.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col_3D+GEMM TIME im2col_3D+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM pytorch/cudnn TIME
(320, 1280, 26, 38, 8, 3, 3, 3) 1.81 ms 168.85 MB 1.53 ms 38.59 MB 1.42 ms
(1280, 1280, 26, 38, 8, 3, 3, 3) 6.71 ms 559.61 MB 6.01 ms 38.59 MB 5.68 ms
(320, 1280, 52, 76, 8, 3, 3, 3) 7.39 ms 675.39 MB 3.86 ms 154.38 MB 5.64 ms
(1280, 1280, 52, 76, 8, 3, 3, 3) 26.40 ms 2238.44 MB 15.25 ms 154.38 MB 21.62 ms
(320, 1280, 104, 152, 8, 3, 3, 3) 29.36 ms 2701.56 MB 15.42 ms 617.50 MB 22.02 ms
(1280, 1280, 104, 152, 8, 3, 3, 3) 105.70 ms 8953.75 MB 60.83 ms 617.50 MB 82.19 ms
(320, 1280, 208, 304, 4, 3, 3, 3) 59.02 ms 5403.12 MB 30.06 ms 1235.00 MB 43.78 ms
(640, 1280, 208, 304, 4, 3, 3, 3) 109.66 ms 9571.25 MB 60.04 ms 1235.00 MB 83.66 ms

@bssrdf bssrdf marked this pull request as draft November 2, 2025 17:38
@Green-Sky
Copy link
Collaborator

Quick n dirty test perf run:

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 10.30 ms 168.85 MB 34.95 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 37.30 ms 559.61 MB 141.30 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 37.85 ms 675.39 MB 137.66 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 156.93 ms 2238.44 MB 558.15 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 157.53 ms 2701.56 MB 569.28 ms 617.50 MB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 2, 2025
@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 2, 2025

Quick n dirty test perf run:

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 10.30 ms 168.85 MB 34.95 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 37.30 ms 559.61 MB 141.30 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 37.85 ms 675.39 MB 137.66 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 156.93 ms 2238.44 MB 558.15 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 157.53 ms 2701.56 MB 569.28 ms 617.50 MB

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

Need to get tensors working:)

@Green-Sky
Copy link
Collaborator

With that last commit I am catching an illegal memory access.

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 715.16 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: IM2COL failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err
/build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 2, 2025

With that last commit I am catching an illegal memory access.

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 715.16 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: IM2COL failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err
/build/fr4p6mskfb7pv7d6n87kfpipj5dr1bql-source/ggml/src/ggml-cuda/ggml-cuda.cu:89: CUDA error

Give it another try. Tensor core path should be on and working.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 1.88 ms 168.85 MB 1.54 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 6.76 ms 559.61 MB 6.18 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 7.58 ms 675.39 MB 3.99 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 26.99 ms 2238.44 MB 15.51 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 30.06 ms 2701.56 MB 15.86 ms 617.50 MB
(1280, 1280, 104, 152, 8, 3, 3, 3) 108.33 ms 8953.75 MB 62.23 ms 617.50 MB
(320, 1280, 208, 304, 4, 3, 3, 3) 60.33 ms 5403.12 MB 30.79 ms 1235.00 MB
(640, 1280, 208, 304, 4, 3, 3, 3) 112.34 ms 9571.25 MB 61.49 ms 1235.00 MB

@Green-Sky
Copy link
Collaborator

Looks good!

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 11.35 ms 168.85 MB 6.15 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 39.40 ms 559.61 MB 24.59 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 40.52 ms 675.39 MB 23.23 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 166.10 ms 2238.44 MB 91.93 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 164.08 ms 2701.56 MB 91.41 ms 617.50 MB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

@jeffbolznv
Copy link
Collaborator

Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp.

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 2, 2025

Looks good!

$ result/bin/test-conv3d
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

(IC, OC, IW, IH, ID, KW, KH, KD) im2col+GEMM TIME im2col+GEMM VRAM implicit GEMM TIME implicit GEMM VRAM
(320, 1280, 26, 38, 8, 3, 3, 3) 11.35 ms 168.85 MB 6.15 ms 38.59 MB
(1280, 1280, 26, 38, 8, 3, 3, 3) 39.40 ms 559.61 MB 24.59 ms 38.59 MB
(320, 1280, 52, 76, 8, 3, 3, 3) 40.52 ms 675.39 MB 23.23 ms 154.38 MB
(1280, 1280, 52, 76, 8, 3, 3, 3) 166.10 ms 2238.44 MB 91.93 ms 154.38 MB
(320, 1280, 104, 152, 8, 3, 3, 3) 164.08 ms 2701.56 MB 91.41 ms 617.50 MB

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8953.75 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9388687360
Segmentation fault (core dumped)

Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp.

Will do. Right now, test-backend-op test -o CONV_3D failed to pass.

@Green-Sky , does the WAN video generation work ok?

@Green-Sky
Copy link
Collaborator

@Green-Sky , does the WAN video generation work ok?

As far as I can tell yes. I have only tested Wan2.2-TI2V-5B yet (wan2.2 vae).

Single "frame" (aka image gen using wan):
output

I don't have the im2col gemm image ready, but it looked identical. Same for 5 frames.

I was unable to run higher frame counts due to my 8gig vram limit. It should technically just fit if I restart and don't open extra programs. Not gonna do that today.

@Green-Sky
Copy link
Collaborator

@Green-Sky , does the WAN video generation work ok?

I tried a smaller resolution and I did get another illegal memory access:

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 5393.97 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/jj5n0s93mbksd9wmv6vrs8dxmhcvhnik-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 3, 2025

@Green-Sky , does the WAN video generation work ok?

I tried a smaller resolution and I did get another illegal memory access:

[DEBUG] ggml_extend.hpp:1588 - wan_vae compute buffer size: 5393.97 MB(VRAM)
[ERROR] ggml_extend.hpp:75   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:75   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:75   -   current device: 0, in function ggml_cuda_compute_forward at /build/jj5n0s93mbksd9wmv6vrs8dxmhcvhnik-source/ggml/src/ggml-cuda/ggml-cuda.cu:2544
[ERROR] ggml_extend.hpp:75   -   err

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

@bssrdf bssrdf marked this pull request as ready for review November 3, 2025 14:00
@Green-Sky
Copy link
Collaborator

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.

Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

@bssrdf
Copy link
Contributor Author

bssrdf commented Nov 3, 2025

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.

Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

So it is a ggml issue?

@Green-Sky
Copy link
Collaborator

@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks.

Same issue on last commit and on latest ggml.
Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :)

So it is a ggml issue?

I can not definitively say that, might be sd.cpp too. But afaik the concat is operating on the outputs of conv3d operations. Maybe we are missing a sync somewhere.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Nov 3, 2025

@leejet I suspect sd.cpp is doing something wrong here, because if width and height are multiples of 32, decoding works. But if they are just multiples of 16 I get that illegal memory access error.

But it is still possible that it is an error with this pr.

@Green-Sky
Copy link
Collaborator

$ result/bin/sd -M vid_gen --diffusion-model models/wan/Wan2.2-TI2V-5B-Q8_0.gguf --vae models/wan/Wan2.2_VAE.safetensors --t5xxl models/wan/umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat strolling down a wooden plank" --cfg-scale 6 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 576 -H 352 --diffusion-fa --offload-to-cpu --video-frames 33 --flow-shift 3.0 --steps 48
wan2.2-TI2V-5B-Q8_0-cat-strolling_funkysize.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants