Skip to content

feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289

Open
howard0su wants to merge 8 commits into
Luce-Org:mainfrom
howard0su:pipeline_moe
Open

feat(qwen35moe): pipelined hybrid MoE decode with GPU/CPU overlap#289
howard0su wants to merge 8 commits into
Luce-Org:mainfrom
howard0su:pipeline_moe

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

Implement pipelined decode path that caches DeltaNet pre-FFN graphs and enables true GPU/CPU overlap for hot/cold expert computation:

  • Cache 30/40 DeltaNet layer graphs (position-independent recurrent state)
  • Move ffn_post readback before hot graph launch to avoid serialization
  • Integrate pipelined path into both run_ar_decode_path and generate() AR fallback
  • Add persistent PipelinedDecodeState to avoid per-request alloc/free
  • Remove dead process_one_token code from generate()

Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot):

  • Realistic placement: 46.6 ms/tok (vs 43.0 all-GPU, only +8%)
  • Worst-case (all cold): 81.4 ms/tok (vs 90.7 old hybrid, -10%)
  • Saves ~8 GiB VRAM vs all-GPU while maintaining near-parity speed

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record the 2026-05-28 01:36 unattended refresh: upstream/main remains unchanged, new PR Luce-Org#289 is draft-only, and fresh direct-merge probes for all non-ancestor non-draft PRs still conflict in an isolated worktree.
@howard0su howard0su marked this pull request as ready for review May 28, 2026 09:52
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Integrate howard0su/pipeline_moe over the current auto-integration stack. Resolve the Qwen35 MoE AR decode signature conflict by preserving the current thinking-budget hook API while routing hybrid MoE generation through the new pipelined decode path. Keep the existing accessible llama.cpp submodule commit because the PR submodule pointer is not fetchable from the configured submodule remote.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Document PR Luce-Org#289 integration, the inaccessible submodule pointer decision, refreshed direct-merge probes, and validation outcomes for the unattended auto-integration run.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record the clean Luce-Org#289 head update, fresh direct merge probes for the remaining non-ancestor PRs, and this run's validation results.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 9 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35moe/qwen35moe_pipelined_decode.h
Comment thread server/src/qwen35moe/qwen35moe_ffn.cpp
Comment thread server/src/qwen35moe/qwen35moe_backend.cpp Outdated
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Integrates the latest non-draft contributor update from PR Luce-Org#289 on top of the maintained auto-integration stack. Normalizes selected expert weights, guards the expert weight scale default, and removes a noisy pipelined decode init printf.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record the PR Luce-Org#289 refresh, current PR classification, fresh direct conflict probes, delegated PR Luce-Org#221 feasibility attempt, validation, and retained worktree paths for this unattended run.
howard0su added a commit to howard0su/lucebox-hub that referenced this pull request May 28, 2026
- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph,
  ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent
  accidental double-free of ggml/GPU resources. Add explicit move ops that
  null out the source.

- P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside
  the n_capture_layers loop — all iterations convert the same act_cur data.

- P2 (expert_weights_scale): No change needed — our condition
  (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record PR Luce-Org#289 refresh to 0ffab8a, fresh direct conflict probes for the remaining non-ancestor contributor PRs, and tmux-driven PR Luce-Org#237 delegation outcomes.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen35moe/qwen35moe_backend.cpp
howard0su and others added 8 commits May 30, 2026 21:29
Implement pipelined decode path that caches DeltaNet pre-FFN graphs and
enables true GPU/CPU overlap for hot/cold expert computation:

- Cache 30/40 DeltaNet layer graphs (position-independent recurrent state)
- Move ffn_post readback before hot graph launch to avoid serialization
- Integrate pipelined path into both run_ar_decode_path and generate() AR fallback
- Add persistent PipelinedDecodeState to avoid per-request alloc/free
- Remove dead process_one_token code from generate()

Benchmark results (RTX 2080 Ti, Qwen3.6-35B-A3B Q4_K_M, 60% hot):
- Realistic placement: 46.6 ms/tok (vs 43.0 all-GPU, only +8%)
- Worst-case (all cold): 81.4 ms/tok (vs 90.7 old hybrid, -10%)
- Saves ~8 GiB VRAM vs all-GPU while maintaining near-parity speed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the hybrid MoE pipeline fills unused routing slots with dummy entries,
all dummies previously pointed to expert 0. This created a pathological
imbalance (e.g. 69/72 rows for one expert) that triggered an out-of-bounds
access in the CUDA MMQ stream-k kernel path during down-projection.

Distribute dummy slot IDs evenly across all experts in the hot/cold stacks
(i % n_experts) so no single expert accumulates excessive dummy rows.
The dummy weight remains 0.0 so these rows contribute nothing to output.

Also adds:
- pipe_state_.reset() between requests to avoid stale DeltaNet graph pointers
- RAII destructors for ResidualCombineGraph, GpuResidentState,
  CachedPrefnGraph, and PipelinedDecodeState to prevent resource leaks

Tested: 10/10 requests pass on RTX 2080 Ti at ~15.5 tok/s decode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rruption

The per-layer CachedFfnGraph (hot_graph/cold_graph) allocated by the
pipelined decode path persisted across requests. After multiple requests
with thousands of decode tokens, accumulated GPU allocations caused
memory corruption when prefill tried to allocate its own graph buffers.

Freeing these cached graphs between requests releases the GPU memory
before prefill runs. The graphs are rebuilt cheaply on first decode
of each new request.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DFLASH_QWEN35MOE_TELEMETRY=1, the pipelined decode loop now
collects and prints per-token breakdown:
- prefn_build: DeltaNet graph setup time
- prefn_compute: GPU pre-FFN compute time
- routing_readback: GPU→CPU routing decision transfer
- ffn: hybrid MoE FFN (split into allhot/mixed)

Also removes leftover debug CUDA sync checks from the prefill path
and includes the mmq.cu ids_dst padding fix (submodule update).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MoE router was not normalizing top-k expert weights after selection.
With softmax gating over 256 experts but only top-8 selected, the weights
summed to ~0.03-0.05 instead of 1.0, causing systematically underscaled
FFN output across all 40 layers. This produced accumulating errors that
made even simple arithmetic wrong (e.g. 7+8=11 instead of 15).

Fix: always normalize selected weights by their sum with a clamp to avoid
division by zero, matching llama.cpp's norm_w=true behavior for qwen35moe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- P1: Delete implicit copy on resource-owning structs (CachedPrefnGraph,
  ResidualCombineGraph, GpuResidentState, PipelinedDecodeState) to prevent
  accidental double-free of ggml/GPU resources. Add explicit move ops that
  null out the source.

- P3: Hoist bf16_buf allocation and ggml_fp32_to_bf16_row conversion outside
  the n_capture_layers loop — all iterations convert the same act_cur data.

- P2 (expert_weights_scale): No change needed — our condition
  (!=0.0f && !=1.0f) already matches llama.cpp llama-graph.cpp:1413 exactly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The routed-expert mul_mat_id MMQ kernel writes out of bounds on Ampere
when the per-call token count exceeds ~8: the expert token distribution
overshoots the destination tiles on the need_check=false write path. This
silently corrupts neighbouring GPU allocations during prefill and crashes
with a CUDA illegal memory access at a later decode synchronize (~4th
request under the server, in the forced hot/cold split path).

Sub-batch the hybrid FFN to 8 tokens per eval_qwen35moe_hybrid_ffn_batched
call so the attention prefill can stay at the full chunk size.

Verified on an RTX 3090 (24 GiB) forcing a 60/40 hot/cold split via
DFLASH_EXPERT_BUDGET_MB=11000: all 10 HumanEval prompts complete and the
server stays up (previously crashed at request 4). compute-sanitizer
memcheck confirms the OOB write originates in the routed mul_mat_id
(mul_mat_q<Q5_K, ..., need_check=false>) inside eval_qwen35moe_hybrid_ffn_batched.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant