From 3416ff13f7a3eb187b7ab4abab576a39c09b9e9a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gast=C3=B3n=20Parravicini?= <athos.tgz@email.com>
Date: Sat, 2 May 2026 23:33:15 +0000
Subject: [PATCH 1/2] ggml-cuda: fix cublasSgemm crash with zero-dim matrices
 in speculative decoding

Problem
-------
During speculative decoding (DFlash/copyspec), draft verification produces
non-consecutive token positions which cause slot management to generate
sub-matrix batches where row_diff evaluates to 0.

cuBLAS requires ldc >= max(1, m) where m = row_diff. When row_diff == 0,
ldc is also 0 (ldc = row_diff when id != ctx.device), violating the
cuBLAS precondition and triggering CUBLAS_STATUS_INVALID_VALUE.

Root cause
----------
No guard exists before the cublasSgemm/cublasGemmEx calls in
ggml_cuda_op_mul_mat_cublas() to handle zero-dimension sub-matrices.
Unlike OpenBLAS and MKL which define zero-size GEMMs as no-ops, cuBLAS
treats them as invalid parameters and aborts.

Fix
---
Add an early-return guard after row_diff and ldc are computed. When any
GEMM dimension (m, n, k) is zero, return immediately. This is safe because
dst_dd_i is untouched and the caller does not read partial results.

Tested
------
Qwen3.6-27B Q4_K_M + DFlash draft, 55k context, RTX 3090 24GB
- Before: crash after ~27k tokens prefill during draft verification
- After:  stable generation, multiple sequential requests without crash

Related upstream issues
-----------------------
- ggml-org/llama.cpp#22105 (DFlash PR: hybrid models cannot discard
  rejected suffixes via seq_rm due to non-decomposable recurrent state)
- ggml-org/llama.cpp#19929 (non-consecutive token position with Qwen3.5
  vision models on multi-GPU setups)
- ggml-org/llama.cpp#21569 (DFlash discussion: MoE hybrid limitations)

Authored-by: Gaston Parravicini
---
 ggml/src/ggml-cuda/ggml-cuda.cu | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 02743e89d5f..1b544e05d0b 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1493,6 +1493,16 @@ static void ggml_cuda_op_mul_mat_cublas(
     // ldc == nrows of the matrix that cuBLAS writes into
     int64_t ldc = id == ctx.device ? ne0 : row_diff;
 
+    // Guard: cuBLAS requires m >= 1, n >= 1, k >= 1 for Sgemm/GemmEx.
+    // During speculative decoding (DFlash/copyspec), draft verification can
+    // produce non-consecutive token positions which result in zero-size
+    // sub-matrices. cuBLAS treats these as invalid parameters and aborts
+    // with CUBLAS_STATUS_INVALID_VALUE. Zero-size GEMMs are defined as
+    // no-ops (no output written), matching OpenBLAS and MKL behavior.
+    if (row_diff == 0 || src1_ncols == 0 || ne10 == 0) {
+        return;
+    }
+
     const int cc = ggml_cuda_info().devices[id].cc;
 
     const bool supports_bf16 = GGML_CUDA_CC_IS_NVIDIA(cc) || GGML_CUDA_CC_IS_AMD(cc) ||

From a12fd999c2f38d134f5ee2489b17e9a17ef3cabc Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gast=C3=B3n=20Parravicini?= <athos.tgz@email.com>
Date: Sun, 3 May 2026 02:05:48 +0000
Subject: [PATCH 2/2] ggml-cuda: extend zero-dim guard to cover ne00 and ldc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previous guard checked row_diff, src1_ncols, ne10 but missed ne00 (lda)
and ldc. When ne00 == 0 or ldc == 0, cuBLAS also aborts with
CUBLAS_STATUS_INVALID_VALUE even if m, n, k > 0.

Extends the early-return guard to cover all invalid GEMM dimensions.

Authored-by: Gastón Parravicini
---
 ggml/src/ggml-cuda/ggml-cuda.cu | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 1b544e05d0b..6adb40d5d3d 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -1499,7 +1499,7 @@ static void ggml_cuda_op_mul_mat_cublas(
     // sub-matrices. cuBLAS treats these as invalid parameters and aborts
     // with CUBLAS_STATUS_INVALID_VALUE. Zero-size GEMMs are defined as
     // no-ops (no output written), matching OpenBLAS and MKL behavior.
-    if (row_diff == 0 || src1_ncols == 0 || ne10 == 0) {
+    if (row_diff == 0 || src1_ncols == 0 || ne10 == 0 || ne00 == 0 || ldc == 0) {
         return;
     }