feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289
Open
howard0su wants to merge 8 commits into
Open
feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289howard0su wants to merge 8 commits into
howard0su wants to merge 8 commits into
Conversation
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Record the 2026-05-28 01:36 unattended refresh: upstream/main remains unchanged, new PR Luce-Org#289 is draft-only, and fresh direct-merge probes for all non-ancestor non-draft PRs still conflict in an isolated worktree.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Integrate howard0su/pipeline_moe over the current auto-integration stack. Resolve the Qwen35 MoE AR decode signature conflict by preserving the current thinking-budget hook API while routing hybrid MoE generation through the new pipelined decode path. Keep the existing accessible llama.cpp submodule commit because the PR submodule pointer is not fetchable from the configured submodule remote.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Document PR Luce-Org#289 integration, the inaccessible submodule pointer decision, refreshed direct-merge probes, and validation outcomes for the unattended auto-integration run.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Record the clean Luce-Org#289 head update, fresh direct merge probes for the remaining non-ancestor PRs, and this run's validation results.
Contributor
There was a problem hiding this comment.
3 issues found across 9 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Integrates the latest non-draft contributor update from PR Luce-Org#289 on top of the maintained auto-integration stack. Normalizes selected expert weights, guards the expert weight scale default, and removes a noisy pipelined decode init printf.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Record the PR Luce-Org#289 refresh, current PR classification, fresh direct conflict probes, delegated PR Luce-Org#221 feasibility attempt, validation, and retained worktree paths for this unattended run.
howard0su
added a commit
to howard0su/lucebox-hub
that referenced
this pull request
May 28, 2026
- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph, ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent accidental double-free of ggml/GPU resources. Add explicit move ops that null out the source. - P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside the n_capture_layers loop — all iterations convert the same act_cur data. - P2 (expert_weights_scale): No change needed — our condition (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Record PR Luce-Org#289 refresh to 0ffab8a, fresh direct conflict probes for the remaining non-ancestor contributor PRs, and tmux-driven PR Luce-Org#237 delegation outcomes.
Contributor
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Implement pipelined decode path that caches DeltaNet pre-FFN graphs and enables true GPU/CPU overlap for hot/cold expert computation: - Cache 30/40 DeltaNet layer graphs (position-independent recurrent state) - Move ffn_post readback before hot graph launch to avoid serialization - Integrate pipelined path into both run_ar_decode_path and generate() AR fallback - Add persistent PipelinedDecodeState to avoid per-request alloc/free - Remove dead process_one_token code from generate() Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot): - Realistic placement: 46.6 ms/tok (vs 43.0 all-GPU, only +8%) - Worst-case (all cold): 81.4 ms/tok (vs 90.7 old hybrid, -10%) - Saves ~8 GiB VRAM vs all-GPU while maintaining near-parity speed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the hybrid MoE pipeline fills unused routing slots with dummy entries, all dummies previously pointed to expert 0. This created a pathological imbalance (e.g. 69/72 rows for one expert) that triggered an out-of-bounds access in the CUDA MMQ stream-k kernel path during down-projection. Distribute dummy slot IDs evenly across all experts in the hot/cold stacks (i % n_experts) so no single expert accumulates excessive dummy rows. The dummy weight remains 0.0 so these rows contribute nothing to output. Also adds: - pipe_state_.reset() between requests to avoid stale DeltaNet graph pointers - RAII destructors for ResidualCombineGraph, GpuResidentState, CachedPrefnGraph, and PipelinedDecodeState to prevent resource leaks Tested: 10/10 requests pass on RTX 2080 Ti at ~15.5 tok/s decode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rruption The per-layer CachedFfnGraph (hot_graph/cold_graph) allocated by the pipelined decode path persisted across requests. After multiple requests with thousands of decode tokens, accumulated GPU allocations caused memory corruption when prefill tried to allocate its own graph buffers. Freeing these cached graphs between requests releases the GPU memory before prefill runs. The graphs are rebuilt cheaply on first decode of each new request. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DFLASH_QWEN35MOE_TELEMETRY=1, the pipelined decode loop now collects and prints per-token breakdown: - prefn_build: DeltaNet graph setup time - prefn_compute: GPU pre-FFN compute time - routing_readback: GPU→CPU routing decision transfer - ffn: hybrid MoE FFN (split into allhot/mixed) Also removes leftover debug CUDA sync checks from the prefill path and includes the mmq.cu ids_dst padding fix (submodule update). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MoE router was not normalizing top-k expert weights after selection. With softmax gating over 256 experts but only top-8 selected, the weights summed to ~0.03-0.05 instead of 1.0, causing systematically underscaled FFN output across all 40 layers. This produced accumulating errors that made even simple arithmetic wrong (e.g. 7+8=11 instead of 15). Fix: always normalize selected weights by their sum with a clamp to avoid division by zero, matching llama.cpp's norm_w=true behavior for qwen35moe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph, ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent accidental double-free of ggml/GPU resources. Add explicit move ops that null out the source. - P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside the n_capture_layers loop — all iterations convert the same act_cur data. - P2 (expert_weights_scale): No change needed — our condition (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The routed-expert mul_mat_id MMQ kernel writes out of bounds on Ampere when the per-call token count exceeds ~8: the expert token distribution overshoots the destination tiles on the need_check=false write path. This silently corrupts neighbouring GPU allocations during prefill and crashes with a CUDA illegal memory access at a later decode synchronize (~4th request under the server, in the forced hot/cold split path). Sub-batch the hybrid FFN to 8 tokens per eval_qwen35moe_hybrid_ffn_batched call so the attention prefill can stay at the full chunk size. Verified on an RTX 3090 (24 GiB) forcing a 60/40 hot/cold split via DFLASH_EXPERT_BUDGET_MB=11000: all 10 HumanEval prompts complete and the server stays up (previously crashed at request 4). compute-sanitizer memcheck confirms the OOB write originates in the routed mul_mat_id (mul_mat_q<Q5_K, ..., need_check=false>) inside eval_qwen35moe_hybrid_ffn_batched. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement pipelined decode path that caches DeltaNet pre-FFN graphs and enables true GPU/CPU overlap for hot/cold expert computation:
Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot):