From 3416ff13f7a3eb187b7ab4abab576a39c09b9e9a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gast=C3=B3n=20Parravicini?= Date: Sat, 2 May 2026 23:33:15 +0000 Subject: [PATCH 1/2] ggml-cuda: fix cublasSgemm crash with zero-dim matrices in speculative decoding Problem ------- During speculative decoding (DFlash/copyspec), draft verification produces non-consecutive token positions which cause slot management to generate sub-matrix batches where row_diff evaluates to 0. cuBLAS requires ldc >= max(1, m) where m = row_diff. When row_diff == 0, ldc is also 0 (ldc = row_diff when id != ctx.device), violating the cuBLAS precondition and triggering CUBLAS_STATUS_INVALID_VALUE. Root cause ---------- No guard exists before the cublasSgemm/cublasGemmEx calls in ggml_cuda_op_mul_mat_cublas() to handle zero-dimension sub-matrices. Unlike OpenBLAS and MKL which define zero-size GEMMs as no-ops, cuBLAS treats them as invalid parameters and aborts. Fix --- Add an early-return guard after row_diff and ldc are computed. When any GEMM dimension (m, n, k) is zero, return immediately. This is safe because dst_dd_i is untouched and the caller does not read partial results. Tested ------ Qwen3.6-27B Q4_K_M + DFlash draft, 55k context, RTX 3090 24GB - Before: crash after ~27k tokens prefill during draft verification - After: stable generation, multiple sequential requests without crash Related upstream issues ----------------------- - ggml-org/llama.cpp#22105 (DFlash PR: hybrid models cannot discard rejected suffixes via seq_rm due to non-decomposable recurrent state) - ggml-org/llama.cpp#19929 (non-consecutive token position with Qwen3.5 vision models on multi-GPU setups) - ggml-org/llama.cpp#21569 (DFlash discussion: MoE hybrid limitations) Authored-by: Gaston Parravicini --- ggml/src/ggml-cuda/ggml-cuda.cu | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu index 02743e89d5f..1b544e05d0b 100644 --- a/ggml/src/ggml-cuda/ggml-cuda.cu +++ b/ggml/src/ggml-cuda/ggml-cuda.cu @@ -1493,6 +1493,16 @@ static void ggml_cuda_op_mul_mat_cublas( // ldc == nrows of the matrix that cuBLAS writes into int64_t ldc = id == ctx.device ? ne0 : row_diff; + // Guard: cuBLAS requires m >= 1, n >= 1, k >= 1 for Sgemm/GemmEx. + // During speculative decoding (DFlash/copyspec), draft verification can + // produce non-consecutive token positions which result in zero-size + // sub-matrices. cuBLAS treats these as invalid parameters and aborts + // with CUBLAS_STATUS_INVALID_VALUE. Zero-size GEMMs are defined as + // no-ops (no output written), matching OpenBLAS and MKL behavior. + if (row_diff == 0 || src1_ncols == 0 || ne10 == 0) { + return; + } + const int cc = ggml_cuda_info().devices[id].cc; const bool supports_bf16 = GGML_CUDA_CC_IS_NVIDIA(cc) || GGML_CUDA_CC_IS_AMD(cc) || From a12fd999c2f38d134f5ee2489b17e9a17ef3cabc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gast=C3=B3n=20Parravicini?= Date: Sun, 3 May 2026 02:05:48 +0000 Subject: [PATCH 2/2] ggml-cuda: extend zero-dim guard to cover ne00 and ldc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous guard checked row_diff, src1_ncols, ne10 but missed ne00 (lda) and ldc. When ne00 == 0 or ldc == 0, cuBLAS also aborts with CUBLAS_STATUS_INVALID_VALUE even if m, n, k > 0. Extends the early-return guard to cover all invalid GEMM dimensions. Authored-by: Gastón Parravicini --- ggml/src/ggml-cuda/ggml-cuda.cu | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu index 1b544e05d0b..6adb40d5d3d 100644 --- a/ggml/src/ggml-cuda/ggml-cuda.cu +++ b/ggml/src/ggml-cuda/ggml-cuda.cu @@ -1499,7 +1499,7 @@ static void ggml_cuda_op_mul_mat_cublas( // sub-matrices. cuBLAS treats these as invalid parameters and aborts // with CUBLAS_STATUS_INVALID_VALUE. Zero-size GEMMs are defined as // no-ops (no output written), matching OpenBLAS and MKL behavior. - if (row_diff == 0 || src1_ncols == 0 || ne10 == 0) { + if (row_diff == 0 || src1_ncols == 0 || ne10 == 0 || ne00 == 0 || ldc == 0) { return; }