Consolidating the speculative-decoding threads into one place so we coordinate instead of colliding. Grew out of #325. cc @wjinxu @scatyf3 @kitty-eu-org @CAICAIIs — please holler on / claim the lanes you want.
Where we are
DFlash speculative decoding shipped on main (#436 + #442): the full loop — draft model loading → batched draft forward → target "verify" forward (piecewise CUDA graph) → acceptance → KV rollback → OpenAI serving — behind --dflash-draft-model-path. Greedy-lossless (lm-eval gsm8k strict-match identical spec on/off), validated on 5070 Ti / 5090, matches or beats vLLM dflash across c1/c8/c16. Design notes: docs/models/qwen3/dflash-speculative-decoding.md.
So speculation is a real subsystem now, not an experiment. This roadmap is about growing it into a first-class, multi-method, multi-model one — without each method/model reimplementing the controller.
Workstreams
Brackets = current lead / interested. Shout if you want to take or share one.
Methods
Performance
Capability & correctness
- Speculative sampling. Today speculation is greedy-only: a sampling (
temperature / top_k / top_p) request correctly falls back to plain decode — no wrong distribution, but no speedup, and one sampling request currently disables speculation for its whole batch. Lossless rejection-sampling acceptance (accept with min(1, p_target/q_draft), resample from (p_target − q_draft)₊ on reject) would unlock the speedup for non-greedy requests. Lives in the verify controller.
Usability & metrics
- Acceptance-rate / draft-hit / speculative-token-throughput metrics, serving semantics, config ergonomics.
Shared primitives
- Factor the reusable pieces — proposer trait, verify + accept, draft cache, KV rollback — into a shared module (the way sampling lives in a shared sampler), so EAGLE3 / n-gram / DFlash / per-model engines don't each reimplement the controller.
How to plug in
Comment with the lane you want and we'll coordinate ownership here. The #439 timing overlap — a parallel draft crate that landed after the integration shipped — is exactly what this thread is meant to prevent: let's claim lanes before writing thousands of lines. 🙂
Related: #325 (origin), #349 (n-gram), #439 (draft crate), #434 (qwen3.5 DFlash).
Consolidating the speculative-decoding threads into one place so we coordinate instead of colliding. Grew out of #325. cc @wjinxu @scatyf3 @kitty-eu-org @CAICAIIs — please holler on / claim the lanes you want.
Where we are
DFlash speculative decoding shipped on
main(#436 + #442): the full loop — draft model loading → batched draft forward → target "verify" forward (piecewise CUDA graph) → acceptance → KV rollback → OpenAI serving — behind--dflash-draft-model-path. Greedy-lossless (lm-eval gsm8k strict-match identical spec on/off), validated on 5070 Ti / 5090, matches or beats vLLM dflash across c1/c8/c16. Design notes:docs/models/qwen3/dflash-speculative-decoding.md.So speculation is a real subsystem now, not an experiment. This roadmap is about growing it into a first-class, multi-method, multi-model one — without each method/model reimplementing the controller.
Workstreams
Brackets = current lead / interested. Shout if you want to take or share one.
Methods
Performance
main, worth landing as standalone kernel PRs.Capability & correctness
temperature/top_k/top_p) request correctly falls back to plain decode — no wrong distribution, but no speedup, and one sampling request currently disables speculation for its whole batch. Lossless rejection-sampling acceptance (accept withmin(1, p_target/q_draft), resample from(p_target − q_draft)₊on reject) would unlock the speedup for non-greedy requests. Lives in the verify controller.Usability & metrics
Shared primitives
How to plug in
Comment with the lane you want and we'll coordinate ownership here. The #439 timing overlap — a parallel draft crate that landed after the integration shipped — is exactly what this thread is meant to prevent: let's claim lanes before writing thousands of lines. 🙂
Related: #325 (origin), #349 (n-gram), #439 (draft crate), #434 (qwen3.5 DFlash).