Skip to content

[Roadmap] Speculative decoding (qwen3 first, shared primitives) #443

Description

@xiaguan

Consolidating the speculative-decoding threads into one place so we coordinate instead of colliding. Grew out of #325. cc @wjinxu @scatyf3 @kitty-eu-org @CAICAIIs — please holler on / claim the lanes you want.

Where we are

DFlash speculative decoding shipped on main (#436 + #442): the full loop — draft model loading → batched draft forward → target "verify" forward (piecewise CUDA graph) → acceptance → KV rollback → OpenAI serving — behind --dflash-draft-model-path. Greedy-lossless (lm-eval gsm8k strict-match identical spec on/off), validated on 5070 Ti / 5090, matches or beats vLLM dflash across c1/c8/c16. Design notes: docs/models/qwen3/dflash-speculative-decoding.md.

So speculation is a real subsystem now, not an experiment. This roadmap is about growing it into a first-class, multi-method, multi-model one — without each method/model reimplementing the controller.

Workstreams

Brackets = current lead / interested. Shout if you want to take or share one.

Methods

Performance

Capability & correctness

  • Speculative sampling. Today speculation is greedy-only: a sampling (temperature / top_k / top_p) request correctly falls back to plain decode — no wrong distribution, but no speedup, and one sampling request currently disables speculation for its whole batch. Lossless rejection-sampling acceptance (accept with min(1, p_target/q_draft), resample from (p_target − q_draft)₊ on reject) would unlock the speedup for non-greedy requests. Lives in the verify controller.

Usability & metrics

  • Acceptance-rate / draft-hit / speculative-token-throughput metrics, serving semantics, config ergonomics.

Shared primitives

  • Factor the reusable pieces — proposer trait, verify + accept, draft cache, KV rollback — into a shared module (the way sampling lives in a shared sampler), so EAGLE3 / n-gram / DFlash / per-model engines don't each reimplement the controller.

How to plug in

Comment with the lane you want and we'll coordinate ownership here. The #439 timing overlap — a parallel draft crate that landed after the integration shipped — is exactly what this thread is meant to prevent: let's claim lanes before writing thousands of lines. 🙂

Related: #325 (origin), #349 (n-gram), #439 (draft crate), #434 (qwen3.5 DFlash).

Metadata

Metadata

Assignees

No one assigned

    Labels

    qwen3Qwen3-4B model crate (pegainfer-qwen3-4b)roadmapTracks features, enhancements, or milestones planned as part of the project roadmap

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions