pflash + dflash optimization on top of qwen35moe (PR #262)

## Context

PR #262 (howard0su) lands hybrid CPU/CUDA expert placement for `qwen35moe` arch. End-to-end works; greedy decode correct; perf beats author's table on RTX 3090 (2.5-3× across the budget sweep). Validated on lucebox2 with `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`.

Bench (essay prompt, 400 gen tokens, t=0, RTX 3090 + Strix Halo CPU):

| Budget | Hot/Cold | VRAM | Decode tok/s |
|---|---|---|---|
| 1000 MB | 549/9691 | 3.4 GB | 29.5 |
| 5000 MB | 2756/7484 | 7.4 GB | 38.6 |
| 9000 MB | 4963/5277 | 11.4 GB | 40.5 |
| 12000 MB | 6618/3622 | 14.4 GB | 44.5 |
| 15000 MB | 8274/1966 | 17.4 GB | 48.2 |

But two perf paths are not yet wired into qwen35moe:

1. **DFlash spec-decode** with MoE target. PR #262 includes commit `7965190 feat: hybrid MoE spec-decode with DFlash draft model`, but tested against `dflash-draft-3.6-q8_0.gguf` (dense 27B draft) the daemon hangs at `chat CACHE` — draft/target arch mismatch suspected. Needs a draft model matched to qwen35moe (or a verifier path that tolerates the mismatch).
2. **PFlash prefill compression** on the MoE forward path. PR #262 does not touch pflash. Hybrid prefill compute is already partly CPU-bound on cold experts; PFlash compression could cut prompt-side cost further but needs validation it composes with the hybrid storage layer.

## Goals

- [ ] DFlash spec-decode end-to-end on `qwen35moe` (no hang, ≥1.5× decode on a long-form prompt)
- [ ] PFlash composition with hybrid expert placement (correctness + perf vs no-PFlash baseline)
- [ ] Sweep budget × spec-decode on/off × pflash on/off; identify the dominant configuration for lucebox shipping

## Acceptance

- `test_dflash` passes with `arch=qwen35moe + --draft <matched_draft>` (no `chat CACHE` hang)
- `bench_he` or `luce-bench --area ds4-eval` ≥ baseline on the matched config
- Document the recommended `(budget, draft, pflash)` combo for lucebox appliance shipping

## Refs

- PR #262 (qwen35moe + hybrid placement)
- Memory: `project_megaqwen3_27b_dflash`, `project_dflash_pflash_inproc`, `project_lucebox_product`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pflash + dflash optimization on top of qwen35moe (PR #262) #280

Context

Goals

Acceptance

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Budget	Hot/Cold	VRAM	Decode tok/s
1000 MB	549/9691	3.4 GB	29.5
5000 MB	2756/7484	7.4 GB	38.6
9000 MB	4963/5277	11.4 GB	40.5
12000 MB	6618/3622	14.4 GB	44.5
15000 MB	8274/1966	17.4 GB	48.2

pflash + dflash optimization on top of qwen35moe (PR #262) #280

Description

Context

Goals

Acceptance

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions