Speculative decoding: ~130× decode regression on CUDA + turbo3 KV (RTX 5090, Qwen3.6-27B-Q6_K) despite 100% draft acceptance

## Summary

The speculative-decoding cherry-pick (991301feb, "Cherry-pick upstream speculative decoding for hybrid models") activates correctly on CUDA + turbo3 KV — log confirms `speculative decoding context initialized` — but the generate step then runs at **~0.36–0.42 tokens/sec**, roughly 130× slower than the same binary without `-md`. This is despite **100% draft acceptance** across many requests (e.g. 143/143, 91/91, 47/47 tokens accepted), so the spec-decode logic itself is mechanically correct; the regression appears to be in the CUDA generate path under the new context-checkpoint machinery.

Tested on `feature/turboquant-kv-cache` HEAD `5aeb2fdbe` (2026-05-09). The commit message notes "Smoke tested on M5 Max with turbo4 KV — zero regression," so this seems to be a CUDA-path issue the Apple Silicon smoke-test couldn't surface.

## Environment

- GPU: NVIDIA RTX 5090 (sm_120, 32 GB VRAM)
- OS: Windows Server 2025
- CUDA: 12.8
- Build: CMake/Ninja Release, `-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_TURBOQUANT=ON`
- Target model: `Qwen3.6-27B-Q6_K.gguf` (Unsloth GGUF)
- Draft model: `qwen3-1.7b-q4_k_m.gguf` (vocab differs from target; server logs `the target and draft vocabs are not compatible - tokens will be translated between the two` — included only as a one-variable comparison vs. baseline run that used the same target model without `-md`)

## Launch command

```
llama-server.exe ^
  --model     C:\models\Qwen3.6-27B-Q6_K.gguf ^
  --model-draft C:\models\qwen3-1.7b-q4_k_m.gguf ^
  --draft-max 4 --draft-min 1 ^
  --host 127.0.0.1 --port 8080 ^
  --ctx-size 32768 ^
  -ctk turbo3 -ctv turbo3 ^
  --flash-attn on --gpu-layers 99 -ngld 99 ^
  -b 4096 -ub 1024 ^
  --no-warmup
```

## Observed behavior

Several requests pulled from a 10-prompt fixture (mix of short tool-call, medium QA, long codegen). Representative server timings:

```
prompt eval time =     735.13 ms /    69 tokens (   10.65 ms per token,    93.86 tokens per second)
       eval time =  165139.77 ms /    64 tokens ( 2580.31 ms per token,     0.39 tokens per second)
      total time =  165874.91 ms /  133 tokens
draft acceptance rate = 1.00000 (   47 accepted /    47 generated)
 statistics draft: #calls(b,g,a) = 1 16 16, #gen drafts = 16, #acc drafts = 16, #gen tokens = 63, #acc tokens = 47, dur(b,g,a) = 0.000, 154924.900, 0.009 ms
```

```
prompt eval time =    1662.41 ms /   110 tokens (   15.11 ms per token,    66.17 tokens per second)
       eval time =  557763.44 ms /   200 tokens ( 2788.82 ms per token,     0.36 tokens per second)
      total time =  559425.85 ms /   310 tokens
draft acceptance rate = 1.00000 (  143 accepted /   143 generated)
```

Note `dur(b,g,a) = 0.000, 525188.554, 0.024 ms` — virtually all wall time is in the generate (`g`) phase, not in draft proposal or acceptance.

## Expected behavior

Reference baseline using the same binary, same target model, same target hardware, **without `-md`**: 50 tok/s server-side decode (`predicted_per_second` ≈ 50). So eval time at the same generation lengths should be roughly:

- 64 tokens: ~1.3 s (observed: 165 s)
- 200 tokens: ~4.0 s (observed: 558 s)

Per-prompt deltas under speculative decoding should be in the +0%..+1.5× range based on prior ik_llama.cpp + same draft model bench (overall +2.2% median across the 10-prompt mix).

## Notes

- Draft acceptance at 1.0 across many prompts strongly suggests the spec-decode logic itself is functioning. The bottleneck appears to be in the per-token generate path that runs *under* the new context-checkpoint scaffolding (PRs #19493 + #22114 + #22168 + #22223 from the cherry-pick).
- We see `created context checkpoint 1 of 32 (pos_min = 105, pos_max = 105, n_tokens = 106, size = 149.626 MiB)` and matching `restored context checkpoint` lines per request. Possibly the checkpoint save/restore is going through a slow path for turbo3 KV cache?
- Already rolled back to `pr/tq4-weight-compression` for local use, so this isn't blocking us — happy to run further repro variants on this hardware if it helps.

## Reproducer

Any client posting to `/v1/chat/completions` (e.g. `curl`) reproduces this — there's nothing special about the harness. We hit it with a 10-prompt fixture mixing short tool-call / classification, medium summary / QA, and long codegen / refactor / explanation / agent-plan / multistep-reasoning, with `max_tokens` ranging from 64 to 800. Happy to share any specific prompt or run further variants on this hardware if it helps narrow it down.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative decoding: ~130× decode regression on CUDA + turbo3 KV (RTX 5090, Qwen3.6-27B-Q6_K) despite 100% draft acceptance #143

Summary

Environment

Launch command

Observed behavior

Expected behavior

Notes

Reproducer

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Speculative decoding: ~130× decode regression on CUDA + turbo3 KV (RTX 5090, Qwen3.6-27B-Q6_K) despite 100% draft acceptance #143

Description

Summary

Environment

Launch command

Observed behavior

Expected behavior

Notes

Reproducer

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions