[Roadmap] Speculative decoding (qwen3 first, shared primitives)

Consolidating the speculative-decoding threads into one place so we coordinate instead of colliding. Grew out of #325. cc @wjinxu @scatyf3 @kitty-eu-org @CAICAIIs — please holler on / claim the lanes you want.

## Where we are

DFlash speculative decoding shipped on `main` (#436 + #442): the full loop — draft model loading → batched draft forward → target "verify" forward (piecewise CUDA graph) → acceptance → KV rollback → OpenAI serving — behind `--dflash-draft-model-path`. Greedy-lossless (lm-eval gsm8k strict-match identical spec on/off), validated on 5070 Ti / 5090, matches or beats vLLM dflash across c1/c8/c16. Design notes: `docs/models/qwen3/dflash-speculative-decoding.md`.

So speculation is a real subsystem now, not an experiment. This roadmap is about growing it into a first-class, multi-method, multi-model one — without each method/model reimplementing the controller.

## Workstreams

Brackets = current lead / interested. Shout if you want to take or share one.

### Methods
- **DFlash** [@xiaguan] — shipped. Next: draft-side piecewise CUDA graph (the remaining ~16% of the decode launch gap).
- **EAGLE / EAGLE3** [@scatyf3, #325] — orthogonal draft method. We deliberately deferred the proposer-trait abstraction to land EAGLE cleanly, so this is the natural driver for the shared proposer interface below.
- **n-gram / prompt-lookup** [@wjinxu, #349] — no draft weights; great for surfacing exactly what the scheduler / paged KV / CUDA graph / sampling actually need from a speculator.
- **Qwen3.5 DFlash** [@CAICAIIs, #434] — extend DFlash to the qwen3.5 line.

### Performance
- Draft-side piecewise CUDA graph. @kitty-eu-org's #439 has draft-forward kernels (non-causal dense/ragged-batch prefill attention, K-only norm+RoPE) that look directly reusable here — if they beat what's on `main`, worth landing as standalone kernel PRs.

### Capability & correctness
- **Speculative sampling.** Today speculation is greedy-only: a sampling (`temperature` / `top_k` / `top_p`) request correctly falls back to plain decode — no wrong distribution, but no speedup, and one sampling request currently disables speculation for its whole batch. Lossless rejection-sampling acceptance (accept with `min(1, p_target/q_draft)`, resample from `(p_target − q_draft)₊` on reject) would unlock the speedup for non-greedy requests. Lives in the verify controller.

### Usability & metrics
- Acceptance-rate / draft-hit / speculative-token-throughput metrics, serving semantics, config ergonomics.

### Shared primitives
- Factor the reusable pieces — proposer trait, verify + accept, draft cache, KV rollback — into a shared module (the way sampling lives in a shared sampler), so EAGLE3 / n-gram / DFlash / per-model engines don't each reimplement the controller.

## How to plug in

Comment with the lane you want and we'll coordinate ownership here. The #439 timing overlap — a parallel draft crate that landed after the integration shipped — is exactly what this thread is meant to prevent: let's claim lanes before writing thousands of lines. 🙂

Related: #325 (origin), #349 (n-gram), #439 (draft crate), #434 (qwen3.5 DFlash).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] Speculative decoding (qwen3 first, shared primitives) #443

Where we are

Workstreams

Methods

Performance

Capability & correctness

Usability & metrics

Shared primitives

How to plug in

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] Speculative decoding (qwen3 first, shared primitives) #443

Description

Where we are

Workstreams

Methods

Performance

Capability & correctness

Usability & metrics

Shared primitives

How to plug in

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions