refactor: extract MoE hybrid mode into common layer for qwen and laguna by howard0su · Pull Request #305 · Luce-Org/lucebox-hub

howard0su · 2026-05-29T06:23:33Z

Move all hybrid infrastructure (placement, routing stats, storage, FFN eval, swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_* and add laguna hybrid mode with layer-by-layer decode.

Common abstractions:

moe_hybrid_types.h: MoeHybridConfig, MoeLayerDesc (plain data, no vtable)
moe_hybrid_placement: greedy knapsack expert placement
moe_hybrid_routing_stats: runtime frequency tracking
moe_hybrid_storage: hot/cold expert buffer management
moe_hybrid_ffn_eval: GPU-resident, batched, single, prefill evaluation
moe_hybrid_swap_manager: promote/demote at request boundaries

Laguna hybrid mode:

init_hybrid_mode(): VRAM budget, placement, partial GGUF load
hybrid_forward_one_token(): per-layer attn+router graph + hybrid FFN
generate_hybrid(): prefill via laguna_step, decode via hybrid path
load_target_gguf_laguna_partial(): skips expert tensor GPU upload

cubic-dev-ai

1 issue found across 102 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="harness/clients/run_opencode.sh">

<violation number="1" location="harness/clients/run_opencode.sh:14">
P2: `require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</violation>
</file>

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic}

cubic-dev-ai · 2026-05-30T13:37:58Z

 CLIENT_OUT="$LOG_DIR/opencode.out"
 EXPORT_OUT="$LOG_DIR/opencode-export.json"
 OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}"
+require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN"


P2: require_client_binary uses a filesystem -x check only and does not fall back to PATH resolution (command -v), so valid PATH-based OPENCODE_BIN values that previously worked now fail.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_opencode.sh, line 14: <comment>`require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</comment> <file context> @@ -11,6 +11,7 @@ source "$SCRIPT_DIR/common.sh" CLIENT_OUT="$LOG_DIR/opencode.out" EXPORT_OUT="$LOG_DIR/opencode-export.json" OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}" +require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN" HOME_DIR="$LOG_DIR/opencode-home" PROJECT_DIR="$LOG_DIR/opencode-project" </file context>

Record the 2026-05-30 11:16 EDT unattended integration pass: refreshed PR-head containment, fresh conflicted probe worktrees, and Codex feasibility reports for the Luce-Org#305 and Luce-Org#237 selective-port candidates. No product-code stack changes were made.

Record the 2026-05-30 13:38 EDT unattended reconciliation run, including refreshed PR containment, direct merge probes, and the Codex Luce-Org#305 selective-port recommendation.

Record the 2026-05-30 14:00 unattended run, fresh conflict probes, and the Claude/Codex Luce-Org#305 selective-port attempts. No product-code changes were retained.

Record the latest unattended PR containment check, direct conflict probes, and tmux/Codex feasibility audit for PR Luce-Org#305.

Refresh open PR classification after the 2026-05-30 18:10 EDT unattended run. Record current PR-head containment, repeated conflicted probe results, and the tmux-driven Luce-Org#305 Claude/Codex feasibility audit.

Record the 2026-05-30 19:02 cron run, refreshed direct-merge probes, and the tmux/Codex feasibility audit for PR Luce-Org#305. No product-code changes were added.

Record 2026-05-30 19:40 ET auto-integration refresh. Reclassify PR Luce-Org#305 as non-draft, capture direct merge probe results for still-pending PRs, and summarize tmux-driven Claude/Codex audit outcomes.

Rerun direct merge probes for remaining non-integrated PRs and record the tmux-driven Codex audit for Luce-Org#237/Luce-Org#305. No product-code changes were integrated.

Record the 2026-05-31 unattended reconciliation run, fresh conflict probes for the remaining non-ancestor PRs, and the unusable Luce-Org#305 Claude/Codex delegation attempts.

Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.

Merge latest origin/main into the integration stack and refresh the manifest with current open PR classification, fresh conflict probes, and Codex feasibility output for PR Luce-Org#305.

Record the 2026-05-31 06:16 cron refresh, fresh conflict probes for the six remaining non-ancestor PRs, and the tmux-driven Codex feasibility review for PR Luce-Org#305.

Port the small PR Luce-Org#305 control-plane salvage slice into the current Qwen35MoE dynamic expert placement path.\n\nAdds DFLASH_EXPERT_BUDGET_PCT alongside the existing MB cap so profiling/testing can bound hot expert residency by a percentage of total expert bytes. Refreshes the auto-integration manifest with current PR classification, probes, delegation, and validation notes.

Port a narrow current-layout slice from PR Luce-Org#305. The batched FFN evaluator already carries the zero-weight dummy-slot balancing needed to avoid the old MMQ stream-k imbalance, so the prefill caller can run full chunks directly while reusing hot/cold gallocr handles across calls.\n\nAlso refresh the auto-integration manifest with current PR classification and validation notes.

Port a narrow PR Luce-Org#305 slice by feeding the GPU-resident pipelined decode activation directly into the persistent logits graph with a backend-to-backend copy, avoiding the previous GPU-to-CPU readback before logits. Keep host uploads synchronous after review found async upload ordering could race with existing internal tensor copies. Refresh auto-integration metadata.

Record the 2026-06-01 06:29 UTC-4 unattended refresh, fresh direct-merge probes for remaining selective-port candidates, and the tmux-driven Codex feasibility review for PR Luce-Org#305.

Record the 2026-06-01 09:10 unattended reconciliation pass, including current open PR containment, direct-merge probe counts, and the Luce-Org#305 Codex feasibility result.

Refresh the unattended auto-integration manifest after the 2026-06-01 10:34 run. No contributor PR head advanced; direct probes still leave Luce-Org#305, Luce-Org#237, Luce-Org#221, Luce-Org#154, Luce-Org#153, and Luce-Org#135 as selective-port/runtime-validation candidates.

Record the 2026-06-01 12:08 cron pass, including Luce-Org#329's move back to draft status, fresh containment counts, direct conflict probes for the remaining selective-port candidates, and the tmux-driven Codex Luce-Org#305 no-safe-slice report.

Promote a narrow PR Luce-Org#305 slice into the integration stack: reusable common MoE routing statistics, expert placement, and swap-planning helpers plus a focused unit harness. Broader Laguna/Qwen hybrid runtime commonization remains deferred for current-layout and CUDA validation.

Port the narrow Luce-Org#305 dense/no-expert-layer guard into the current Qwen35MoE placement and hybrid-storage paths. Dense layers now receive no byte-budget hot experts and storage builders leave no-expert layers empty. Refresh the auto-integration manifest with the latest probe/delegation results.

Record the 2026-06-01 20:42 unattended refresh, repeated direct-merge probes for the remaining selective-port candidates, and the Luce-Org#305 Claude/Codex feasibility results.

Port the standalone HumanEval and LLM benchmark harnesses from PR Luce-Org#305 as an isolated salvage slice. Current docs already reference these utilities; the conflicted C++/runtime portions remain excluded. Also keep bench_he.py --help usable without local model artifacts so basic script validation can run in CI-like checkouts.

Record the latest Luce-Org#285 debug-thinking-logits merge, refreshed PR containment, direct conflict probes, and the Codex Luce-Org#305 no-safe-slice review.

Record the 2026-06-02 01:07 unattended run: no new non-draft PR heads advanced, direct-merge probes still conflict for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and a tmux Codex Luce-Org#221 pass found only the already-represented gguf_metadata header slice.

Promote the standalone server/scripts/bench_agent.py utility from PR Luce-Org#305 after fresh worktree conflict probes and tmux-driven feasibility review. Leave the remaining conflicted runtime/docs/CMake hunks for further selective-port work.

Record the 2026-06-02 03:58 unattended refresh, unchanged open PR containment, fresh conflict probes for the six remaining selective-port candidates, and the tmux-driven Codex no-safe-slice verdict for PR Luce-Org#305.

…aguna Move all hybrid infrastructure (placement, routing stats, storage, FFN eval, swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_* and add laguna hybrid mode with layer-by-layer decode. Common abstractions: - moe_hybrid_types.h: MoeHybridConfig, MoeLayerDesc (plain data, no vtable) - moe_hybrid_placement: greedy knapsack expert placement - moe_hybrid_routing_stats: runtime frequency tracking - moe_hybrid_storage: hot/cold expert buffer management - moe_hybrid_ffn_eval: GPU-resident, batched, single, prefill evaluation - moe_hybrid_swap_manager: promote/demote at request boundaries Laguna hybrid mode: - init_hybrid_mode(): VRAM budget, placement, partial GGUF load - hybrid_forward_one_token(): per-layer attn+router graph + hybrid FFN - generate_hybrid(): prefill via laguna_step, decode via hybrid path - load_target_gguf_laguna_partial(): skips expert tensor GPU upload Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Build per-layer CachedFfnGraph at init with n_expert_used slots for the routed FFN pipeline. The decode loop uses a StreamMoE-inspired async pattern: prefn(async) → sync → CPU remap → upload → rffn(async) → combine(async), eliminating per-layer graph rebuilds. Cold experts get weight=0, contributing nothing to output while keeping the graph structure fixed. This reuses build_cached_hot_graph from the common module. Also adds detailed telemetry fields (gpu_idle, tensor_io, cold_cpu, hot_graph_build, etc.) for GPU utilization diagnosis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- compute_logits accepts optional gpu_src for GPU→GPU copy (no host bounce) - Use tensor_set_async for embedding upload on compute stream - Remove act_cur D2H readback after pipelined decode (stays on GPU) - Accumulate new telemetry fields in decode breakdown output - Pass MoeHybridStorage& to init_pipelined_decode_state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Print VRAM usage summary and expert placement info on startup for both full GPU mode and hybrid mode paths, so users can always see how memory is allocated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Dense layers (layer 0 in Laguna with n_layer_dense_lead=1) have no expert tensors. The placement algorithm was assigning per_layer_floor hot experts to these layers, and the storage builder was adding 256 'cold' entries that can't be allocated. Fix: - Placement: only set floor for layers where layer_expert_bytes > 0 - Storage: skip layers with no expert tensors (gate/up/down all null) in both build_moe_hybrid_storage and build_moe_hybrid_storage_from_file - Add layer index to hot buffer allocation error message for debugging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implement layer-by-layer hybrid prefill for Laguna MoE, enabling the 60/40 hot/cold expert split during prompt processing: - Rewrite generate_hybrid() prefill to use per-layer pre-FFN graph + eval_moe_hybrid_ffn_batched() instead of monolithic laguna_step() - Fix KV cache write: permute K/V before copying to cache (matches the monolithic graph's layout) - Fix 'all experts fit' path: always set hybrid_mode_=true in partial- load path so hybrid forward is used consistently - Add DFLASH_EXPERT_BUDGET_PCT env var to trigger hybrid mode with a percentage budget (e.g. 60 = keep 60% of expert bytes on GPU) - Generate uniform hotness when no routing stats file is provided Tested with 60% budget on Laguna XS.2 (Q4_K_M): - 6261 hot experts (10.61 GiB VRAM), 3723 cold experts (7.07 GiB RAM) - Prefill: 1047ms, Decode: 12.9 tok/s - Output quality verified (coherent, correct answers) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add warm/safety reserve sizes and placement source to the hybrid placement diagnostic output, matching qwen35moe's format: [laguna-hybrid] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB, kv_cache=1.25 GiB (ctx=8192), warm=200 MB, safety=512 MB, expert_budget=10.61 GiB (of 17.68 GiB total experts) [laguna-hybrid] storage ready: ... source=uniform Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Always compute dynamic placement on startup instead of requiring DFLASH_LAGUNA_HOTNESS or DFLASH_EXPERT_BUDGET_PCT env vars. The model now automatically: 1. Partial-loads core tensors (non-expert) to GPU 2. Computes VRAM budget accounting for KV cache, warm, and safety reserves 3. If all experts fit → reloads full model to GPU (non-hybrid path) 4. Otherwise → uses hybrid hot/cold split automatically This matches qwen35moe behavior: the budget breakdown and placement decision are always printed regardless of whether hybrid mode is needed. Example output (ctx=12000, not all fit): [laguna] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB, kv_cache=1.83 GiB (ctx=12000), warm=200 MB, safety=512 MB, expert_budget=17.19 GiB (of 17.68 GiB total experts) [laguna] dynamic placement result: 9726 hot experts, 258 cold experts Example output (ctx=8192, all fit): [laguna] dynamic placement result: 9984 hot experts, 0 cold experts [laguna] all experts fit in VRAM, loading fully to GPU Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…yers - Remove kFfnSafeBatch=8 sub-batching that caused ~23× more graph builds/computes per MoE layer (the dummy-slot routing fix already prevents the mul_mat_id OOB issue) - Keep StepGraph persistent across layers to reuse GPU gallocr buffer - Add p_hot_alloc/p_cold_alloc optional params to eval_moe_hybrid_ffn_batched for persistent allocator reuse - Applied to both Laguna and qwen35moe backends Result on Laguna (95 tokens, 258 cold experts): prefill: 84 tok/s → 266 tok/s (3.2× speedup) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add fine-grained timing to diagnose why AR decode uses ~50% GPU time at 60% hot expert occupancy: - PipelinedDecodeTelemetry new fields: routed_sync_us, routed_readback_us, routed_cpu_remap_us, routed_ffn_dispatch_us, routed_final_sync_us, routed_cold_expert_hits/total_expert_slots - Per-layer timing in the routed FFN fast path separates: prefn dispatch vs GPU sync stall vs D2H readback vs CPU remap vs FFN dispatch - run_pipelined_decode_path now measures full per-token budget: embed / layers / logits / sample with percentage breakdown - Cold expert hit rate tracking reveals effective coverage gaps Enable with DFLASH_QWEN35MOE_TELEMETRY=1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The CUDA mul_mat_id kernel triggers GGML_ASSERT (process abort) on sm_75 when cold experts exist in a layer. Since the assert kills the process before eval_moe_hybrid_ffn_batched can return false, the try/catch fallback approach doesn't work. Fix: check storage.cold_expert_ids.empty() upfront and skip the batched path entirely for layers with cold experts, going straight to per-token eval_moe_hybrid_ffn_single. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previously, attention layers (full_attention_interval=4) used the expensive split path: full GPU sync → host readback → hot/cold partition → separate GPU + CPU compute → combine. This cost ~1.6ms per mixed layer. Now all 40 layers use the routed FFN fast path with cold-masking: - Cold experts get weight=0, local_id=0 (mapped to expert 0, result zeroed) - All 8 expert slots still computed on GPU (async, no PCIe roundtrip) - Eliminates CPU cold compute and D2H/H2D transfers entirely Results at 60% hot experts in VRAM (RTX 2080 Ti): - Decode: 12.1 → 15.5 tok/s (+28%) - GPU utilization: 79.8% → 89.8% - FFN/token: 9.56ms → 2.70ms - Eliminated: tensor_io (2.6ms), cold_cpu (6.2ms), combine (0.7ms) - Sync stall/layer: 384µs → 212µs 100% hot baseline unchanged (14.2 → 14.9 tok/s, slight improvement). 60% hot now matches or exceeds old 100% hot performance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When DFLASH_COLD_COMPUTE=1 is set, layers with cold experts use the split path (hot GPU + cold CPU/Halo) instead of cold-masking (weight=0). This produces exact results at the cost of speed. Behavior: - All-hot layers: still use fast routed FFN path (no quality difference) - Mixed layers (have cold experts): use split path with actual cold computation on the cold backend (CPU today, AMD Halo in future) Results at 60% hot (RTX 2080 Ti, cold on CPU): - cold_compute=off (default): 15.5 tok/s, drops cold experts - cold_compute=on: 10.5 tok/s, exact output quality The cold_compute time (15ms/token on CPU) is the target for acceleration with a fast-memory device like AMD Halo. Prefill already computes cold experts correctly via per-token fallback (eval_moe_hybrid_ffn_single), no change needed there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace ggml graph dispatch (15 ops + thread pool barriers per layer) with a direct fused kernel calling ggml vec_dot primitives. Uses OpenMP to parallelize row-level matmuls and saturate DDR4 memory bandwidth. Key design: - ColdFfnCompute interface (cold_ffn_compute.h) — scales to Halo/GPU - CpuColdFfnCompute implementation using ggml vec_dot type traits - Per-tensor type support (Q4_K gate, Q5_K down in Q4_K_M quants) - Configurable threads via DFLASH_COLD_THREADS env (default: 8) Results at 60% hot (6155 hot, 4085 cold, RTX 2080 Ti): - Before: 10.4 tok/s decode (38ms cold single-threaded, 15.7ms ggml graph) - After: 14.7-17 tok/s decode (14.5ms cold with 8 threads) - GPU utilization: 50% → 64-70% - Correctness verified (coherent multi-paragraph output) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the cold_compute guard that forced mixed layers onto the slower split path. Both the DeltaNet routed fast path and the split-path routed FFN sub-path now handle cold experts inline: - Partition routing into hot (GPU rffn) + cold (CPU fused kernel) - D2H ffn_post only when cold experts are selected - Cold compute runs on CPU in parallel with GPU rffn dispatch - Upload cold result to combine graph (or keep zeroed if all-hot) When all experts are hot at runtime, the cold branch is skipped entirely with zero overhead — path is identical to the previous routed fast path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

howard0su force-pushed the layersplit_refactor branch from 241bf3e to d45b8fd Compare May 29, 2026 13:05

howard0su marked this pull request as ready for review May 30, 2026 13:34

cubic-dev-ai Bot reviewed May 30, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026

docs: refresh auto-integration manifest

681122e

Record the latest unattended PR containment check, direct conflict probes, and tmux/Codex feasibility audit for PR Luce-Org#305.

howard0su and others added 16 commits June 3, 2026 22:06

fix(test): pass MoeHybridStorage to init_pipelined_decode_state

2138e84

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

howard0su force-pushed the layersplit_refactor branch from 1de45e4 to 32e0675 Compare June 3, 2026 14:07

Remove agents.md

4727d20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract MoE hybrid mode into common layer for qwen and laguna#305

refactor: extract MoE hybrid mode into common layer for qwen and laguna#305
howard0su wants to merge 17 commits into
Luce-Org:mainfrom
howard0su:layersplit_refactor

howard0su commented May 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard0su commented May 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant