Skip to content

pflash + dflash optimization on top of qwen35moe (PR #262) #280

@davide221

Description

@davide221

Context

PR #262 (howard0su) lands hybrid CPU/CUDA expert placement for qwen35moe arch. End-to-end works; greedy decode correct; perf beats author's table on RTX 3090 (2.5-3× across the budget sweep). Validated on lucebox2 with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf.

Bench (essay prompt, 400 gen tokens, t=0, RTX 3090 + Strix Halo CPU):

Budget Hot/Cold VRAM Decode tok/s
1000 MB 549/9691 3.4 GB 29.5
5000 MB 2756/7484 7.4 GB 38.6
9000 MB 4963/5277 11.4 GB 40.5
12000 MB 6618/3622 14.4 GB 44.5
15000 MB 8274/1966 17.4 GB 48.2

But two perf paths are not yet wired into qwen35moe:

  1. DFlash spec-decode with MoE target. PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 includes commit 7965190 feat: hybrid MoE spec-decode with DFlash draft model, but tested against dflash-draft-3.6-q8_0.gguf (dense 27B draft) the daemon hangs at chat CACHE — draft/target arch mismatch suspected. Needs a draft model matched to qwen35moe (or a verifier path that tolerates the mismatch).
  2. PFlash prefill compression on the MoE forward path. PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 does not touch pflash. Hybrid prefill compute is already partly CPU-bound on cold experts; PFlash compression could cut prompt-side cost further but needs validation it composes with the hybrid storage layer.

Goals

  • DFlash spec-decode end-to-end on qwen35moe (no hang, ≥1.5× decode on a long-form prompt)
  • PFlash composition with hybrid expert placement (correctness + perf vs no-PFlash baseline)
  • Sweep budget × spec-decode on/off × pflash on/off; identify the dominant configuration for lucebox shipping

Acceptance

  • test_dflash passes with arch=qwen35moe + --draft <matched_draft> (no chat CACHE hang)
  • bench_he or luce-bench --area ds4-eval ≥ baseline on the matched config
  • Document the recommended (budget, draft, pflash) combo for lucebox appliance shipping

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions