refactor: extract MoE hybrid mode into common layer for qwen and laguna#305
Open
howard0su wants to merge 17 commits into
Open
refactor: extract MoE hybrid mode into common layer for qwen and laguna#305howard0su wants to merge 17 commits into
howard0su wants to merge 17 commits into
Conversation
241bf3e to
d45b8fd
Compare
Contributor
There was a problem hiding this comment.
1 issue found across 102 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="harness/clients/run_opencode.sh">
<violation number="1" location="harness/clients/run_opencode.sh:14">
P2: `require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</violation>
</file>
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
| CLIENT_OUT="$LOG_DIR/opencode.out" | ||
| EXPORT_OUT="$LOG_DIR/opencode-export.json" | ||
| OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}" | ||
| require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN" |
Contributor
There was a problem hiding this comment.
P2: require_client_binary uses a filesystem -x check only and does not fall back to PATH resolution (command -v), so valid PATH-based OPENCODE_BIN values that previously worked now fail.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_opencode.sh, line 14:
<comment>`require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</comment>
<file context>
@@ -11,6 +11,7 @@ source "$SCRIPT_DIR/common.sh"
CLIENT_OUT="$LOG_DIR/opencode.out"
EXPORT_OUT="$LOG_DIR/opencode-export.json"
OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}"
+require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN"
HOME_DIR="$LOG_DIR/opencode-home"
PROJECT_DIR="$LOG_DIR/opencode-project"
</file context>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 11:16 EDT unattended integration pass: refreshed PR-head containment, fresh conflicted probe worktrees, and Codex feasibility reports for the Luce-Org#305 and Luce-Org#237 selective-port candidates. No product-code stack changes were made.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 13:38 EDT unattended reconciliation run, including refreshed PR containment, direct merge probes, and the Codex Luce-Org#305 selective-port recommendation.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 14:00 unattended run, fresh conflict probes, and the Claude/Codex Luce-Org#305 selective-port attempts. No product-code changes were retained.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the latest unattended PR containment check, direct conflict probes, and tmux/Codex feasibility audit for PR Luce-Org#305.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Refresh open PR classification after the 2026-05-30 18:10 EDT unattended run. Record current PR-head containment, repeated conflicted probe results, and the tmux-driven Luce-Org#305 Claude/Codex feasibility audit.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 19:02 cron run, refreshed direct-merge probes, and the tmux/Codex feasibility audit for PR Luce-Org#305. No product-code changes were added.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record 2026-05-30 19:40 ET auto-integration refresh. Reclassify PR Luce-Org#305 as non-draft, capture direct merge probe results for still-pending PRs, and summarize tmux-driven Claude/Codex audit outcomes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Rerun direct merge probes for remaining non-integrated PRs and record the tmux-driven Codex audit for Luce-Org#237/Luce-Org#305. No product-code changes were integrated.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 unattended reconciliation run, fresh conflict probes for the remaining non-ancestor PRs, and the unusable Luce-Org#305 Claude/Codex delegation attempts.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Merge latest origin/main into the integration stack and refresh the manifest with current open PR classification, fresh conflict probes, and Codex feasibility output for PR Luce-Org#305.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 06:16 cron refresh, fresh conflict probes for the six remaining non-ancestor PRs, and the tmux-driven Codex feasibility review for PR Luce-Org#305.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port the small PR Luce-Org#305 control-plane salvage slice into the current Qwen35MoE dynamic expert placement path.\n\nAdds DFLASH_EXPERT_BUDGET_PCT alongside the existing MB cap so profiling/testing can bound hot expert residency by a percentage of total expert bytes. Refreshes the auto-integration manifest with current PR classification, probes, delegation, and validation notes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow current-layout slice from PR Luce-Org#305. The batched FFN evaluator already carries the zero-weight dummy-slot balancing needed to avoid the old MMQ stream-k imbalance, so the prefill caller can run full chunks directly while reusing hot/cold gallocr handles across calls.\n\nAlso refresh the auto-integration manifest with current PR classification and validation notes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Port a narrow PR Luce-Org#305 slice by feeding the GPU-resident pipelined decode activation directly into the persistent logits graph with a backend-to-backend copy, avoiding the previous GPU-to-CPU readback before logits. Keep host uploads synchronous after review found async upload ordering could race with existing internal tensor copies. Refresh auto-integration metadata.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 06:29 UTC-4 unattended refresh, fresh direct-merge probes for remaining selective-port candidates, and the tmux-driven Codex feasibility review for PR Luce-Org#305.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 09:10 unattended reconciliation pass, including current open PR containment, direct-merge probe counts, and the Luce-Org#305 Codex feasibility result.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Refresh the unattended auto-integration manifest after the 2026-06-01 10:34 run. No contributor PR head advanced; direct probes still leave Luce-Org#305, Luce-Org#237, Luce-Org#221, Luce-Org#154, Luce-Org#153, and Luce-Org#135 as selective-port/runtime-validation candidates.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 12:08 cron pass, including Luce-Org#329's move back to draft status, fresh containment counts, direct conflict probes for the remaining selective-port candidates, and the tmux-driven Codex Luce-Org#305 no-safe-slice report.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Promote a narrow PR Luce-Org#305 slice into the integration stack: reusable common MoE routing statistics, expert placement, and swap-planning helpers plus a focused unit harness. Broader Laguna/Qwen hybrid runtime commonization remains deferred for current-layout and CUDA validation.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Port the narrow Luce-Org#305 dense/no-expert-layer guard into the current Qwen35MoE placement and hybrid-storage paths. Dense layers now receive no byte-budget hot experts and storage builders leave no-expert layers empty. Refresh the auto-integration manifest with the latest probe/delegation results.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-01 20:42 unattended refresh, repeated direct-merge probes for the remaining selective-port candidates, and the Luce-Org#305 Claude/Codex feasibility results.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Port the standalone HumanEval and LLM benchmark harnesses from PR Luce-Org#305 as an isolated salvage slice. Current docs already reference these utilities; the conflicted C++/runtime portions remain excluded. Also keep bench_he.py --help usable without local model artifacts so basic script validation can run in CI-like checkouts.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the latest Luce-Org#285 debug-thinking-logits merge, refreshed PR containment, direct conflict probes, and the Codex Luce-Org#305 no-safe-slice review.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-02 01:07 unattended run: no new non-draft PR heads advanced, direct-merge probes still conflict for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and a tmux Codex Luce-Org#221 pass found only the already-represented gguf_metadata header slice.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Promote the standalone server/scripts/bench_agent.py utility from PR Luce-Org#305 after fresh worktree conflict probes and tmux-driven feasibility review. Leave the remaining conflicted runtime/docs/CMake hunks for further selective-port work.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-02 03:58 unattended refresh, unchanged open PR containment, fresh conflict probes for the six remaining selective-port candidates, and the tmux-driven Codex no-safe-slice verdict for PR Luce-Org#305.
…aguna Move all hybrid infrastructure (placement, routing stats, storage, FFN eval, swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_* and add laguna hybrid mode with layer-by-layer decode. Common abstractions: - moe_hybrid_types.h: MoeHybridConfig, MoeLayerDesc (plain data, no vtable) - moe_hybrid_placement: greedy knapsack expert placement - moe_hybrid_routing_stats: runtime frequency tracking - moe_hybrid_storage: hot/cold expert buffer management - moe_hybrid_ffn_eval: GPU-resident, batched, single, prefill evaluation - moe_hybrid_swap_manager: promote/demote at request boundaries Laguna hybrid mode: - init_hybrid_mode(): VRAM budget, placement, partial GGUF load - hybrid_forward_one_token(): per-layer attn+router graph + hybrid FFN - generate_hybrid(): prefill via laguna_step, decode via hybrid path - load_target_gguf_laguna_partial(): skips expert tensor GPU upload Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build per-layer CachedFfnGraph at init with n_expert_used slots for the routed FFN pipeline. The decode loop uses a StreamMoE-inspired async pattern: prefn(async) → sync → CPU remap → upload → rffn(async) → combine(async), eliminating per-layer graph rebuilds. Cold experts get weight=0, contributing nothing to output while keeping the graph structure fixed. This reuses build_cached_hot_graph from the common module. Also adds detailed telemetry fields (gpu_idle, tensor_io, cold_cpu, hot_graph_build, etc.) for GPU utilization diagnosis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- compute_logits accepts optional gpu_src for GPU→GPU copy (no host bounce) - Use tensor_set_async for embedding upload on compute stream - Remove act_cur D2H readback after pipelined decode (stays on GPU) - Accumulate new telemetry fields in decode breakdown output - Pass MoeHybridStorage& to init_pipelined_decode_state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Print VRAM usage summary and expert placement info on startup for both full GPU mode and hybrid mode paths, so users can always see how memory is allocated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dense layers (layer 0 in Laguna with n_layer_dense_lead=1) have no expert tensors. The placement algorithm was assigning per_layer_floor hot experts to these layers, and the storage builder was adding 256 'cold' entries that can't be allocated. Fix: - Placement: only set floor for layers where layer_expert_bytes > 0 - Storage: skip layers with no expert tensors (gate/up/down all null) in both build_moe_hybrid_storage and build_moe_hybrid_storage_from_file - Add layer index to hot buffer allocation error message for debugging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement layer-by-layer hybrid prefill for Laguna MoE, enabling the 60/40 hot/cold expert split during prompt processing: - Rewrite generate_hybrid() prefill to use per-layer pre-FFN graph + eval_moe_hybrid_ffn_batched() instead of monolithic laguna_step() - Fix KV cache write: permute K/V before copying to cache (matches the monolithic graph's layout) - Fix 'all experts fit' path: always set hybrid_mode_=true in partial- load path so hybrid forward is used consistently - Add DFLASH_EXPERT_BUDGET_PCT env var to trigger hybrid mode with a percentage budget (e.g. 60 = keep 60% of expert bytes on GPU) - Generate uniform hotness when no routing stats file is provided Tested with 60% budget on Laguna XS.2 (Q4_K_M): - 6261 hot experts (10.61 GiB VRAM), 3723 cold experts (7.07 GiB RAM) - Prefill: 1047ms, Decode: 12.9 tok/s - Output quality verified (coherent, correct answers) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add warm/safety reserve sizes and placement source to the hybrid
placement diagnostic output, matching qwen35moe's format:
[laguna-hybrid] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB,
kv_cache=1.25 GiB (ctx=8192), warm=200 MB, safety=512 MB,
expert_budget=10.61 GiB (of 17.68 GiB total experts)
[laguna-hybrid] storage ready: ... source=uniform
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Always compute dynamic placement on startup instead of requiring
DFLASH_LAGUNA_HOTNESS or DFLASH_EXPERT_BUDGET_PCT env vars. The model
now automatically:
1. Partial-loads core tensors (non-expert) to GPU
2. Computes VRAM budget accounting for KV cache, warm, and safety reserves
3. If all experts fit → reloads full model to GPU (non-hybrid path)
4. Otherwise → uses hybrid hot/cold split automatically
This matches qwen35moe behavior: the budget breakdown and placement
decision are always printed regardless of whether hybrid mode is needed.
Example output (ctx=12000, not all fit):
[laguna] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB,
kv_cache=1.83 GiB (ctx=12000), warm=200 MB, safety=512 MB,
expert_budget=17.19 GiB (of 17.68 GiB total experts)
[laguna] dynamic placement result: 9726 hot experts, 258 cold experts
Example output (ctx=8192, all fit):
[laguna] dynamic placement result: 9984 hot experts, 0 cold experts
[laguna] all experts fit in VRAM, loading fully to GPU
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yers - Remove kFfnSafeBatch=8 sub-batching that caused ~23× more graph builds/computes per MoE layer (the dummy-slot routing fix already prevents the mul_mat_id OOB issue) - Keep StepGraph persistent across layers to reuse GPU gallocr buffer - Add p_hot_alloc/p_cold_alloc optional params to eval_moe_hybrid_ffn_batched for persistent allocator reuse - Applied to both Laguna and qwen35moe backends Result on Laguna (95 tokens, 258 cold experts): prefill: 84 tok/s → 266 tok/s (3.2× speedup) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add fine-grained timing to diagnose why AR decode uses ~50% GPU time at 60% hot expert occupancy: - PipelinedDecodeTelemetry new fields: routed_sync_us, routed_readback_us, routed_cpu_remap_us, routed_ffn_dispatch_us, routed_final_sync_us, routed_cold_expert_hits/total_expert_slots - Per-layer timing in the routed FFN fast path separates: prefn dispatch vs GPU sync stall vs D2H readback vs CPU remap vs FFN dispatch - run_pipelined_decode_path now measures full per-token budget: embed / layers / logits / sample with percentage breakdown - Cold expert hit rate tracking reveals effective coverage gaps Enable with DFLASH_QWEN35MOE_TELEMETRY=1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CUDA mul_mat_id kernel triggers GGML_ASSERT (process abort) on sm_75 when cold experts exist in a layer. Since the assert kills the process before eval_moe_hybrid_ffn_batched can return false, the try/catch fallback approach doesn't work. Fix: check storage.cold_expert_ids.empty() upfront and skip the batched path entirely for layers with cold experts, going straight to per-token eval_moe_hybrid_ffn_single. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, attention layers (full_attention_interval=4) used the expensive split path: full GPU sync → host readback → hot/cold partition → separate GPU + CPU compute → combine. This cost ~1.6ms per mixed layer. Now all 40 layers use the routed FFN fast path with cold-masking: - Cold experts get weight=0, local_id=0 (mapped to expert 0, result zeroed) - All 8 expert slots still computed on GPU (async, no PCIe roundtrip) - Eliminates CPU cold compute and D2H/H2D transfers entirely Results at 60% hot experts in VRAM (RTX 2080 Ti): - Decode: 12.1 → 15.5 tok/s (+28%) - GPU utilization: 79.8% → 89.8% - FFN/token: 9.56ms → 2.70ms - Eliminated: tensor_io (2.6ms), cold_cpu (6.2ms), combine (0.7ms) - Sync stall/layer: 384µs → 212µs 100% hot baseline unchanged (14.2 → 14.9 tok/s, slight improvement). 60% hot now matches or exceeds old 100% hot performance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DFLASH_COLD_COMPUTE=1 is set, layers with cold experts use the split path (hot GPU + cold CPU/Halo) instead of cold-masking (weight=0). This produces exact results at the cost of speed. Behavior: - All-hot layers: still use fast routed FFN path (no quality difference) - Mixed layers (have cold experts): use split path with actual cold computation on the cold backend (CPU today, AMD Halo in future) Results at 60% hot (RTX 2080 Ti, cold on CPU): - cold_compute=off (default): 15.5 tok/s, drops cold experts - cold_compute=on: 10.5 tok/s, exact output quality The cold_compute time (15ms/token on CPU) is the target for acceleration with a fast-memory device like AMD Halo. Prefill already computes cold experts correctly via per-token fallback (eval_moe_hybrid_ffn_single), no change needed there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace ggml graph dispatch (15 ops + thread pool barriers per layer) with a direct fused kernel calling ggml vec_dot primitives. Uses OpenMP to parallelize row-level matmuls and saturate DDR4 memory bandwidth. Key design: - ColdFfnCompute interface (cold_ffn_compute.h) — scales to Halo/GPU - CpuColdFfnCompute implementation using ggml vec_dot type traits - Per-tensor type support (Q4_K gate, Q5_K down in Q4_K_M quants) - Configurable threads via DFLASH_COLD_THREADS env (default: 8) Results at 60% hot (6155 hot, 4085 cold, RTX 2080 Ti): - Before: 10.4 tok/s decode (38ms cold single-threaded, 15.7ms ggml graph) - After: 14.7-17 tok/s decode (14.5ms cold with 8 threads) - GPU utilization: 50% → 64-70% - Correctness verified (coherent multi-paragraph output) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the cold_compute guard that forced mixed layers onto the slower split path. Both the DeltaNet routed fast path and the split-path routed FFN sub-path now handle cold experts inline: - Partition routing into hot (GPU rffn) + cold (CPU fused kernel) - D2H ffn_post only when cold experts are selected - Cold compute runs on CPU in parallel with GPU rffn dispatch - Upload cold result to combine graph (or keep zeroed if all-hot) When all experts are hot at runtime, the cold branch is skipped entirely with zero overhead — path is identical to the previous routed fast path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1de45e4 to
32e0675
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Move all hybrid infrastructure (placement, routing stats, storage, FFN eval, swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_* and add laguna hybrid mode with layer-by-layer decode.
Common abstractions:
Laguna hybrid mode: