Skip to content

feat(vulkan): Add native Vulkan support for TurboQuant KV Cache (turbo3/turbo4)#158

Open
Yvi71 wants to merge 4 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-turboquant-kv-cache
Open

feat(vulkan): Add native Vulkan support for TurboQuant KV Cache (turbo3/turbo4)#158
Yvi71 wants to merge 4 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-turboquant-kv-cache

Conversation

@Yvi71
Copy link
Copy Markdown

@Yvi71 Yvi71 commented May 28, 2026

This PR adds native Vulkan support for the TurboQuant KV Cache quantization formats (turbo3 and turbo4) on AMD RDNA 2 and other Vulkan-compatible hardware.
Currently, TurboQuant is fully implemented for Metal (macOS) and HIP/CUDA backends, but Vulkan support was missing, resulting in assertion failures and missing shader pipelines when attempting to run on Vulkan.

Changes:

  • ggml-backend.cpp: Relaxed GGML_ASSERT checks for async get/set operations on tensor views. Since the KV Cache copying and shifting operations often use views (without direct tensor->data), this prevents hard assertion crashes when moving the quantized cache.
  • ggml-vulkan.cpp: Registered the GGML_TYPE_TURBO3_0 and GGML_TYPE_TURBO4_0 types in the Vulkan device capability checks and loaded their set-row pipelines.
  • vulkan-shaders-gen.cpp: Added turbo3_0 and turbo4_0 to the offline shader compilation loop, allowing copy_to_quant.comp to correctly generate SPIR-V shaders (set_rows_turbo3_0 and set_rows_turbo4_0).

Testing:

  • Extensively tested on a dual AMD Radeon RX 6800 (RDNA 2) setup using Mesa RADV Vulkan under Linux.
  • Verified stable prompt cache saves/loads and offloading.
  • Runs fully in "production", achieving excellent speed scaling without any memory leaks or validation layer issues.

TheTom and others added 4 commits April 20, 2026 09:14
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized
allocations permanently. For quantized KV flash attention, the f16
dequant temp buffers (K_f16, V_f16) stay allocated in the pool after
use, consuming more VRAM than the KV compression saves. This causes
quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context
lengths on HIP/ROCm where VMM is unavailable.

Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[]
for reuse and never calls cudaFree. On CUDA with VMM the OS can
reclaim unused virtual memory. On HIP without VMM (all consumer RDNA
3/4 GPUs), the pool permanently consumes peak VRAM.

Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with
cudaFree (via RAII wrapper) instead of the pool. Memory is released
after the FA kernel completes via cudaStreamSynchronize.

Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K).

Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only).
Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT)
Fixes: ggml-org#22107
hip: bypass memory pool for FA f16 temp buffers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants