feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap by howard0su · Pull Request #289 · Luce-Org/lucebox-hub

howard0su · 2026-05-28T00:02:08Z

Implement pipelined decode path that caches DeltaNet pre-FFN graphs and enables true GPU/CPU overlap for hot/cold expert computation:

Cache 30/40 DeltaNet layer graphs (position-independent recurrent state)
Move ffn_post readback before hot graph launch to avoid serialization
Integrate pipelined path into both run_ar_decode_path and generate() AR fallback
Add persistent PipelinedDecodeState to avoid per-request alloc/free
Remove dead process_one_token code from generate()

Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot):

Realistic placement: 46.6 ms/tok (vs 43.0 all-GPU, only +8%)
Worst-case (all cold): 81.4 ms/tok (vs 90.7 old hybrid, -10%)
Saves ~8 GiB VRAM vs all-GPU while maintaining near-parity speed

Record the 2026-05-28 01:36 unattended refresh: upstream/main remains unchanged, new PR Luce-Org#289 is draft-only, and fresh direct-merge probes for all non-ancestor non-draft PRs still conflict in an isolated worktree.

Integrate howard0su/pipeline_moe over the current auto-integration stack. Resolve the Qwen35 MoE AR decode signature conflict by preserving the current thinking-budget hook API while routing hybrid MoE generation through the new pipelined decode path. Keep the existing accessible llama.cpp submodule commit because the PR submodule pointer is not fetchable from the configured submodule remote.

Document PR Luce-Org#289 integration, the inaccessible submodule pointer decision, refreshed direct-merge probes, and validation outcomes for the unattended auto-integration run.

Record the clean Luce-Org#289 head update, fresh direct merge probes for the remaining non-ancestor PRs, and this run's validation results.

cubic-dev-ai

3 issues found across 9 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Integrates the latest non-draft contributor update from PR Luce-Org#289 on top of the maintained auto-integration stack. Normalizes selected expert weights, guards the expert weight scale default, and removes a noisy pipelined decode init printf.

Record the PR Luce-Org#289 refresh, current PR classification, fresh direct conflict probes, delegated PR Luce-Org#221 feasibility attempt, validation, and retained worktree paths for this unattended run.

- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph, ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent accidental double-free of ggml/GPU resources. Add explicit move ops that null out the source. - P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside the n_capture_layers loop — all iterations convert the same act_cur data. - P2 (expert_weights_scale): No change needed — our condition (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Record PR Luce-Org#289 refresh to 0ffab8a, fresh direct conflict probes for the remaining non-ancestor contributor PRs, and tmux-driven PR Luce-Org#237 delegation outcomes.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

Implement pipelined decode path that caches DeltaNet pre-FFN graphs and enables true GPU/CPU overlap for hot/cold expert computation: - Cache 30/40 DeltaNet layer graphs (position-independent recurrent state) - Move ffn_post readback before hot graph launch to avoid serialization - Integrate pipelined path into both run_ar_decode_path and generate() AR fallback - Add persistent PipelinedDecodeState to avoid per-request alloc/free - Remove dead process_one_token code from generate() Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot): - Realistic placement: 46.6 ms/tok (vs 43.0 all-GPU, only +8%) - Worst-case (all cold): 81.4 ms/tok (vs 90.7 old hybrid, -10%) - Saves ~8 GiB VRAM vs all-GPU while maintaining near-parity speed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When the hybrid MoE pipeline fills unused routing slots with dummy entries, all dummies previously pointed to expert 0. This created a pathological imbalance (e.g. 69/72 rows for one expert) that triggered an out-of-bounds access in the CUDA MMQ stream-k kernel path during down-projection. Distribute dummy slot IDs evenly across all experts in the hot/cold stacks (i % n_experts) so no single expert accumulates excessive dummy rows. The dummy weight remains 0.0 so these rows contribute nothing to output. Also adds: - pipe_state_.reset() between requests to avoid stale DeltaNet graph pointers - RAII destructors for ResidualCombineGraph, GpuResidentState, CachedPrefnGraph, and PipelinedDecodeState to prevent resource leaks Tested: 10/10 requests pass on RTX 2080 Ti at ~15.5 tok/s decode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rruption The per-layer CachedFfnGraph (hot_graph/cold_graph) allocated by the pipelined decode path persisted across requests. After multiple requests with thousands of decode tokens, accumulated GPU allocations caused memory corruption when prefill tried to allocate its own graph buffers. Freeing these cached graphs between requests releases the GPU memory before prefill runs. The graphs are rebuilt cheaply on first decode of each new request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When DFLASH_QWEN35MOE_TELEMETRY=1, the pipelined decode loop now collects and prints per-token breakdown: - prefn_build: DeltaNet graph setup time - prefn_compute: GPU pre-FFN compute time - routing_readback: GPU→CPU routing decision transfer - ffn: hybrid MoE FFN (split into allhot/mixed) Also removes leftover debug CUDA sync checks from the prefill path and includes the mmq.cu ids_dst padding fix (submodule update). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The MoE router was not normalizing top-k expert weights after selection. With softmax gating over 256 experts but only top-8 selected, the weights summed to ~0.03-0.05 instead of 1.0, causing systematically underscaled FFN output across all 40 layers. This produced accumulating errors that made even simple arithmetic wrong (e.g. 7+8=11 instead of 15). Fix: always normalize selected weights by their sum with a clamp to avoid division by zero, matching llama.cpp's norm_w=true behavior for qwen35moe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph, ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent accidental double-free of ggml/GPU resources. Add explicit move ops that null out the source. - P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside the n_capture_layers loop — all iterations convert the same act_cur data. - P2 (expert_weights_scale): No change needed — our condition (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The routed-expert mul_mat_id MMQ kernel writes out of bounds on Ampere when the per-call token count exceeds ~8: the expert token distribution overshoots the destination tiles on the need_check=false write path. This silently corrupts neighbouring GPU allocations during prefill and crashes with a CUDA illegal memory access at a later decode synchronize (~4th request under the server, in the forced hot/cold split path). Sub-batch the hybrid FFN to 8 tokens per eval_qwen35moe_hybrid_ffn_batched call so the attention prefill can stay at the full chunk size. Verified on an RTX 3090 (24 GiB) forcing a 60/40 hot/cold split via DFLASH_EXPERT_BUDGET_MB=11000: all 10 HumanEval prompts complete and the server stays up (previously crashed at request 4). compute-sanitizer memcheck confirms the OOB write originates in the routed mul_mat_id (mul_mat_q<Q5_K, ..., need_check=false>) inside eval_qwen35moe_hybrid_ffn_batched. Co-Authored-By: WOZCODE <contact@withwoz.com>

howard0su marked this pull request as ready for review May 28, 2026 09:52

howard0su force-pushed the pipeline_moe branch from 85de8d3 to 593266a Compare May 28, 2026 09:52

howard0su force-pushed the pipeline_moe branch from 593266a to 4933ce7 Compare May 28, 2026 11:19

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026

docs: refresh auto-integration manifest

8033b09

Record the clean Luce-Org#289 head update, fresh direct merge probes for the remaining non-ancestor PRs, and this run's validation results.

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread server/src/qwen35moe/qwen35moe_pipelined_decode.h

Comment thread server/src/qwen35moe/qwen35moe_ffn.cpp

Comment thread server/src/qwen35moe/qwen35moe_backend.cpp Outdated

cubic-dev-ai Bot reviewed May 29, 2026

View reviewed changes

Comment thread server/src/qwen35moe/qwen35moe_backend.cpp

howard0su and others added 8 commits May 30, 2026 21:29

chore: remove noisy pipelined init log

c45dd59

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

howard0su force-pushed the pipeline_moe branch from 27bad6d to caf2b11 Compare May 30, 2026 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289

feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289
howard0su wants to merge 8 commits into
Luce-Org:mainfrom
howard0su:pipeline_moe

howard0su commented May 28, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard0su commented May 28, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading