-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CUDA: add implicit conv3d #16948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: add implicit conv3d #16948
Conversation
… 3x slower than im2col
|
Quick n dirty test perf run:
|
Need to get tensors working:) |
|
With that last commit I am catching an illegal memory access. |
Give it another try. Tensor core path should be on and working.
|
|
Looks good!
|
|
Please add the perf tests to make_test_cases_perf in test-backend-ops.cpp. |
Will do. Right now, @Green-Sky , does the WAN video generation work ok? |
As far as I can tell yes. I have only tested Wan2.2-TI2V-5B yet (wan2.2 vae). Single "frame" (aka image gen using wan): I don't have the im2col gemm image ready, but it looked identical. Same for 5 frames. I was unable to run higher frame counts due to my 8gig vram limit. It should technically just fit if I restart and don't open extra programs. Not gonna do that today. |
I tried a smaller resolution and I did get another illegal memory access: |
@Green-Sky, can you retry my latest commit? I see your test failed at CONCAT op. That's weird. Could you also run this case with current main branch? Thanks. |
Same issue on last commit and on latest ggml. Keep in mind that sd.cpp uses ggml and not llama.cpp, so I have to change the patch manually and apply it onto ggml. :) |
So it is a ggml issue? |
I can not definitively say that, might be sd.cpp too. But afaik the concat is operating on the outputs of conv3d operations. Maybe we are missing a sync somewhere. |
|
@leejet I suspect sd.cpp is doing something wrong here, because if width and height are multiples of 32, decoding works. But if they are just multiples of 16 I get that illegal memory access error. But it is still possible that it is an error with this pr. |
wan2.2-TI2V-5B-Q8_0-cat-strolling_funkysize.mp4 |

This PR adds an implicit conv3d op in CUDA backend, as a complement to IM2COL_3D+GEMM kernel currently used in SD.cpp for video models. It pretty much follows conv2d_implicit.
Using tensor cores, this PR outperforms IM2COL_3D+GEMM. The memory saving is significant, as expected.