feat(vulkan): Add native Vulkan support for TurboQuant KV Cache (turbo3/turbo4) by Yvi71 · Pull Request #158 · TheTom/llama-cpp-turboquant

Yvi71 · 2026-05-28T07:50:35Z

This PR adds native Vulkan support for the TurboQuant KV Cache quantization formats (turbo3 and turbo4) on AMD RDNA 2 and other Vulkan-compatible hardware.
Currently, TurboQuant is fully implemented for Metal (macOS) and HIP/CUDA backends, but Vulkan support was missing, resulting in assertion failures and missing shader pipelines when attempting to run on Vulkan.

Changes:

ggml-backend.cpp: Relaxed GGML_ASSERT checks for async get/set operations on tensor views. Since the KV Cache copying and shifting operations often use views (without direct tensor->data), this prevents hard assertion crashes when moving the quantized cache.
ggml-vulkan.cpp: Registered the GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 types in the Vulkan device capability checks and loaded their set-row pipelines.
vulkan-shaders-gen.cpp: Added turbo3_0 and turbo4_0 to the offline shader compilation loop, allowing copy_to_quant.comp to correctly generate SPIR-V shaders (set_rows_turbo3_0 and set_rows_turbo4_0).

Testing:

Extensively tested on a dual AMD Radeon RX 6800 (RDNA 2) setup using Mesa RADV Vulkan under Linux.
Verified stable prompt cache saves/loads and offloading.
Runs fully in "production", achieving excellent speed scaling without any memory leaks or validation layer issues.

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107

hip: bypass memory pool for FA f16 temp buffers

…o3/turbo4)

TheTom and others added 4 commits April 20, 2026 09:14

Merge pull request TheTom#92 from TheTom/fix/hip-fa-pool-retention

57f6b93

hip: bypass memory pool for FA f16 temp buffers

Merge origin/master into feature/turboquant-kv-cache

bd0d153

feat(vulkan): add native Vulkan support for TurboQuant KV Cache (turb…

19b1864

…o3/turbo4)

github-actions Bot added Nvidia GPU ggml Vulkan labels May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vulkan): Add native Vulkan support for TurboQuant KV Cache (turbo3/turbo4)#158

feat(vulkan): Add native Vulkan support for TurboQuant KV Cache (turbo3/turbo4)#158
Yvi71 wants to merge 4 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-turboquant-kv-cache

Yvi71 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yvi71 commented May 28, 2026

Changes:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants