Summary
The speculative-decoding cherry-pick (991301f, "Cherry-pick upstream speculative decoding for hybrid models") activates correctly on CUDA + turbo3 KV — log confirms speculative decoding context initialized — but the generate step then runs at ~0.36–0.42 tokens/sec, roughly 130× slower than the same binary without -md. This is despite 100% draft acceptance across many requests (e.g. 143/143, 91/91, 47/47 tokens accepted), so the spec-decode logic itself is mechanically correct; the regression appears to be in the CUDA generate path under the new context-checkpoint machinery.
Tested on feature/turboquant-kv-cache HEAD 5aeb2fdbe (2026-05-09). The commit message notes "Smoke tested on M5 Max with turbo4 KV — zero regression," so this seems to be a CUDA-path issue the Apple Silicon smoke-test couldn't surface.
Environment
- GPU: NVIDIA RTX 5090 (sm_120, 32 GB VRAM)
- OS: Windows Server 2025
- CUDA: 12.8
- Build: CMake/Ninja Release,
-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_TURBOQUANT=ON
- Target model:
Qwen3.6-27B-Q6_K.gguf (Unsloth GGUF)
- Draft model:
qwen3-1.7b-q4_k_m.gguf (vocab differs from target; server logs the target and draft vocabs are not compatible - tokens will be translated between the two — included only as a one-variable comparison vs. baseline run that used the same target model without -md)
Launch command
llama-server.exe ^
--model C:\models\Qwen3.6-27B-Q6_K.gguf ^
--model-draft C:\models\qwen3-1.7b-q4_k_m.gguf ^
--draft-max 4 --draft-min 1 ^
--host 127.0.0.1 --port 8080 ^
--ctx-size 32768 ^
-ctk turbo3 -ctv turbo3 ^
--flash-attn on --gpu-layers 99 -ngld 99 ^
-b 4096 -ub 1024 ^
--no-warmup
Observed behavior
Several requests pulled from a 10-prompt fixture (mix of short tool-call, medium QA, long codegen). Representative server timings:
prompt eval time = 735.13 ms / 69 tokens ( 10.65 ms per token, 93.86 tokens per second)
eval time = 165139.77 ms / 64 tokens ( 2580.31 ms per token, 0.39 tokens per second)
total time = 165874.91 ms / 133 tokens
draft acceptance rate = 1.00000 ( 47 accepted / 47 generated)
statistics draft: #calls(b,g,a) = 1 16 16, #gen drafts = 16, #acc drafts = 16, #gen tokens = 63, #acc tokens = 47, dur(b,g,a) = 0.000, 154924.900, 0.009 ms
prompt eval time = 1662.41 ms / 110 tokens ( 15.11 ms per token, 66.17 tokens per second)
eval time = 557763.44 ms / 200 tokens ( 2788.82 ms per token, 0.36 tokens per second)
total time = 559425.85 ms / 310 tokens
draft acceptance rate = 1.00000 ( 143 accepted / 143 generated)
Note dur(b,g,a) = 0.000, 525188.554, 0.024 ms — virtually all wall time is in the generate (g) phase, not in draft proposal or acceptance.
Expected behavior
Reference baseline using the same binary, same target model, same target hardware, without -md: 50 tok/s server-side decode (predicted_per_second ≈ 50). So eval time at the same generation lengths should be roughly:
- 64 tokens: ~1.3 s (observed: 165 s)
- 200 tokens: ~4.0 s (observed: 558 s)
Per-prompt deltas under speculative decoding should be in the +0%..+1.5× range based on prior ik_llama.cpp + same draft model bench (overall +2.2% median across the 10-prompt mix).
Notes
Reproducer
Any client posting to /v1/chat/completions (e.g. curl) reproduces this — there's nothing special about the harness. We hit it with a 10-prompt fixture mixing short tool-call / classification, medium summary / QA, and long codegen / refactor / explanation / agent-plan / multistep-reasoning, with max_tokens ranging from 64 to 800. Happy to share any specific prompt or run further variants on this hardware if it helps narrow it down.
Summary
The speculative-decoding cherry-pick (991301f, "Cherry-pick upstream speculative decoding for hybrid models") activates correctly on CUDA + turbo3 KV — log confirms
speculative decoding context initialized— but the generate step then runs at ~0.36–0.42 tokens/sec, roughly 130× slower than the same binary without-md. This is despite 100% draft acceptance across many requests (e.g. 143/143, 91/91, 47/47 tokens accepted), so the spec-decode logic itself is mechanically correct; the regression appears to be in the CUDA generate path under the new context-checkpoint machinery.Tested on
feature/turboquant-kv-cacheHEAD5aeb2fdbe(2026-05-09). The commit message notes "Smoke tested on M5 Max with turbo4 KV — zero regression," so this seems to be a CUDA-path issue the Apple Silicon smoke-test couldn't surface.Environment
-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_TURBOQUANT=ONQwen3.6-27B-Q6_K.gguf(Unsloth GGUF)qwen3-1.7b-q4_k_m.gguf(vocab differs from target; server logsthe target and draft vocabs are not compatible - tokens will be translated between the two— included only as a one-variable comparison vs. baseline run that used the same target model without-md)Launch command
Observed behavior
Several requests pulled from a 10-prompt fixture (mix of short tool-call, medium QA, long codegen). Representative server timings:
Note
dur(b,g,a) = 0.000, 525188.554, 0.024 ms— virtually all wall time is in the generate (g) phase, not in draft proposal or acceptance.Expected behavior
Reference baseline using the same binary, same target model, same target hardware, without
-md: 50 tok/s server-side decode (predicted_per_second≈ 50). So eval time at the same generation lengths should be roughly:Per-prompt deltas under speculative decoding should be in the +0%..+1.5× range based on prior ik_llama.cpp + same draft model bench (overall +2.2% median across the 10-prompt mix).
Notes
created context checkpoint 1 of 32 (pos_min = 105, pos_max = 105, n_tokens = 106, size = 149.626 MiB)and matchingrestored context checkpointlines per request. Possibly the checkpoint save/restore is going through a slow path for turbo3 KV cache?pr/tq4-weight-compressionfor local use, so this isn't blocking us — happy to run further repro variants on this hardware if it helps.Reproducer
Any client posting to
/v1/chat/completions(e.g.curl) reproduces this — there's nothing special about the harness. We hit it with a 10-prompt fixture mixing short tool-call / classification, medium summary / QA, and long codegen / refactor / explanation / agent-plan / multistep-reasoning, withmax_tokensranging from 64 to 800. Happy to share any specific prompt or run further variants on this hardware if it helps narrow it down.