Skip to content

Conversation

jeffbolznv
Copy link
Collaborator

Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp.

I've been working on getting leejet/stable-diffusion.cpp#778 to work in Vulkan. The main thing that's missing is that it does 2d and 3d convolutions that have intermediate im2col buffers that are larger than 4GB. This change fixes the im2col part, I'll make a separate change for the matmul part.

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner September 20, 2025 19:30
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 20, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Sep 21, 2025

Memory allocations larger than maxMemoryAllocationSize are not technically forbidden, and at least NVIDIA's windows driver will allocate more than 4GB.

This is true for allocations, but not for buffers. If I disable the allocation size check in ggml_vk_create_buffer your new test kinda runs on all my devices, but I'm not sure if it runs correctly. Validation layers complain about the buffer size and the descriptor range, of course.

I tried running your new im2col and im2col_3d tests:

On AMD (RADV) it does the allocation, but fails the test runs.
On Intel (ANV) it gets the correct result for im2col. im2col_3d fails because 16GB VRAM wasn't enough.
On Nvidia (proprietary Linux driver) it runs correctly.

On all three it takes very long to finish the test. The tests also used huge amounts of RAM (>80GB), not sure if that's the CPU backend or something else.

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
@jeffbolznv
Copy link
Collaborator Author

I've pushed a fix for the descriptor range validation failure. I'm not aware of one related to the buffer size.

The large memory usage and slowness is expected. The test framework ends up with multiple copies of the huge tensor, converted to f32. I don't intend to enable these tests by default.

I'm surprised the AMD driver is failing. Maybe it could have been related to the validation failure, but that's a bit surprising since that descriptor isn't actually used.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 21, 2025

Mesa driver development seems to work by building stuff, optimizing it and fixing issues when they come up. If things don't come up, they often don't work, so this is probably another case of "nobody tried to do this yet". We'll have to open an issue about it, most likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants