Skip to content

refactor: extract MoE hybrid mode into common layer for qwen and laguna#305

Open
howard0su wants to merge 17 commits into
Luce-Org:mainfrom
howard0su:layersplit_refactor
Open

refactor: extract MoE hybrid mode into common layer for qwen and laguna#305
howard0su wants to merge 17 commits into
Luce-Org:mainfrom
howard0su:layersplit_refactor

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

Move all hybrid infrastructure (placement, routing stats, storage, FFN eval, swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_* and add laguna hybrid mode with layer-by-layer decode.

Common abstractions:

  • moe_hybrid_types.h: MoeHybridConfig, MoeLayerDesc (plain data, no vtable)
  • moe_hybrid_placement: greedy knapsack expert placement
  • moe_hybrid_routing_stats: runtime frequency tracking
  • moe_hybrid_storage: hot/cold expert buffer management
  • moe_hybrid_ffn_eval: GPU-resident, batched, single, prefill evaluation
  • moe_hybrid_swap_manager: promote/demote at request boundaries

Laguna hybrid mode:

  • init_hybrid_mode(): VRAM budget, placement, partial GGUF load
  • hybrid_forward_one_token(): per-layer attn+router graph + hybrid FFN
  • generate_hybrid(): prefill via laguna_step, decode via hybrid path
  • load_target_gguf_laguna_partial(): skips expert tensor GPU upload

@howard0su howard0su force-pushed the layersplit_refactor branch from 241bf3e to d45b8fd Compare May 29, 2026 13:05
@howard0su howard0su marked this pull request as ready for review May 30, 2026 13:34
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 102 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="harness/clients/run_opencode.sh">

<violation number="1" location="harness/clients/run_opencode.sh:14">
P2: `require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</violation>
</file>

Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic

CLIENT_OUT="$LOG_DIR/opencode.out"
EXPORT_OUT="$LOG_DIR/opencode-export.json"
OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}"
require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: require_client_binary uses a filesystem -x check only and does not fall back to PATH resolution (command -v), so valid PATH-based OPENCODE_BIN values that previously worked now fail.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_opencode.sh, line 14:

<comment>`require_client_binary` uses a filesystem `-x` check only and does not fall back to PATH resolution (`command -v`), so valid PATH-based `OPENCODE_BIN` values that previously worked now fail.</comment>

<file context>
@@ -11,6 +11,7 @@ source "$SCRIPT_DIR/common.sh"
 CLIENT_OUT="$LOG_DIR/opencode.out"
 EXPORT_OUT="$LOG_DIR/opencode-export.json"
 OPENCODE_BIN="${OPENCODE_BIN:-$CLIENT_WORK_DIR/clients/opencode/npm/bin/opencode}"
+require_client_binary "OpenCode" "$OPENCODE_BIN" "opencode" "OPENCODE_BIN"
 HOME_DIR="$LOG_DIR/opencode-home"
 PROJECT_DIR="$LOG_DIR/opencode-project"
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 11:16 EDT unattended integration pass: refreshed PR-head containment, fresh conflicted probe worktrees, and Codex feasibility reports for the Luce-Org#305 and Luce-Org#237 selective-port candidates. No product-code stack changes were made.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 13:38 EDT unattended reconciliation run, including refreshed PR containment, direct merge probes, and the Codex Luce-Org#305 selective-port recommendation.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 14:00 unattended run, fresh conflict probes, and the Claude/Codex Luce-Org#305 selective-port attempts. No product-code changes were retained.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the latest unattended PR containment check, direct conflict probes, and tmux/Codex feasibility audit for PR Luce-Org#305.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Refresh open PR classification after the 2026-05-30 18:10 EDT unattended run. Record current PR-head containment, repeated conflicted probe results, and the tmux-driven Luce-Org#305 Claude/Codex feasibility audit.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record the 2026-05-30 19:02 cron run, refreshed direct-merge probes, and the tmux/Codex feasibility audit for PR Luce-Org#305. No product-code changes were added.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026
Record 2026-05-30 19:40 ET auto-integration refresh. Reclassify PR Luce-Org#305 as non-draft, capture direct merge probe results for still-pending PRs, and summarize tmux-driven Claude/Codex audit outcomes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Rerun direct merge probes for remaining non-integrated PRs and record the tmux-driven Codex audit for Luce-Org#237/Luce-Org#305. No product-code changes were integrated.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 unattended reconciliation run, fresh conflict probes for the remaining non-ancestor PRs, and the unusable Luce-Org#305 Claude/Codex delegation attempts.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Merge latest origin/main into the integration stack and refresh the manifest with current open PR classification, fresh conflict probes, and Codex feasibility output for PR Luce-Org#305.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 06:16 cron refresh, fresh conflict probes for the six remaining non-ancestor PRs, and the tmux-driven Codex feasibility review for PR Luce-Org#305.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port the small PR Luce-Org#305 control-plane salvage slice into the current Qwen35MoE dynamic expert placement path.\n\nAdds DFLASH_EXPERT_BUDGET_PCT alongside the existing MB cap so profiling/testing can bound hot expert residency by a percentage of total expert bytes. Refreshes the auto-integration manifest with current PR classification, probes, delegation, and validation notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow current-layout slice from PR Luce-Org#305. The batched FFN evaluator already carries the zero-weight dummy-slot balancing needed to avoid the old MMQ stream-k imbalance, so the prefill caller can run full chunks directly while reusing hot/cold gallocr handles across calls.\n\nAlso refresh the auto-integration manifest with current PR classification and validation notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Port a narrow PR Luce-Org#305 slice by feeding the GPU-resident pipelined decode activation directly into the persistent logits graph with a backend-to-backend copy, avoiding the previous GPU-to-CPU readback before logits. Keep host uploads synchronous after review found async upload ordering could race with existing internal tensor copies. Refresh auto-integration metadata.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 06:29 UTC-4 unattended refresh, fresh direct-merge probes for remaining selective-port candidates, and the tmux-driven Codex feasibility review for PR Luce-Org#305.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 09:10 unattended reconciliation pass, including current open PR containment, direct-merge probe counts, and the Luce-Org#305 Codex feasibility result.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Refresh the unattended auto-integration manifest after the 2026-06-01 10:34 run. No contributor PR head advanced; direct probes still leave Luce-Org#305, Luce-Org#237, Luce-Org#221, Luce-Org#154, Luce-Org#153, and Luce-Org#135 as selective-port/runtime-validation candidates.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 12:08 cron pass, including Luce-Org#329's move back to draft status, fresh containment counts, direct conflict probes for the remaining selective-port candidates, and the tmux-driven Codex Luce-Org#305 no-safe-slice report.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Promote a narrow PR Luce-Org#305 slice into the integration stack: reusable common MoE routing statistics, expert placement, and swap-planning helpers plus a focused unit harness. Broader Laguna/Qwen hybrid runtime commonization remains deferred for current-layout and CUDA validation.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Port the narrow Luce-Org#305 dense/no-expert-layer guard into the current Qwen35MoE placement and hybrid-storage paths. Dense layers now receive no byte-budget hot experts and storage builders leave no-expert layers empty. Refresh the auto-integration manifest with the latest probe/delegation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Record the 2026-06-01 20:42 unattended refresh, repeated direct-merge probes for the remaining selective-port candidates, and the Luce-Org#305 Claude/Codex feasibility results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Port the standalone HumanEval and LLM benchmark harnesses from PR Luce-Org#305 as an isolated salvage slice. Current docs already reference these utilities; the conflicted C++/runtime portions remain excluded.

Also keep bench_he.py --help usable without local model artifacts so basic script validation can run in CI-like checkouts.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Record the latest Luce-Org#285 debug-thinking-logits merge, refreshed PR containment, direct conflict probes, and the Codex Luce-Org#305 no-safe-slice review.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Record the 2026-06-02 01:07 unattended run: no new non-draft PR heads advanced, direct-merge probes still conflict for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and a tmux Codex Luce-Org#221 pass found only the already-represented gguf_metadata header slice.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Promote the standalone server/scripts/bench_agent.py utility from PR Luce-Org#305 after fresh worktree conflict probes and tmux-driven feasibility review. Leave the remaining conflicted runtime/docs/CMake hunks for further selective-port work.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 2, 2026
Record the 2026-06-02 03:58 unattended refresh, unchanged open PR containment, fresh conflict probes for the six remaining selective-port candidates, and the tmux-driven Codex no-safe-slice verdict for PR Luce-Org#305.
howard0su and others added 16 commits June 3, 2026 22:06
…aguna

Move all hybrid infrastructure (placement, routing stats, storage, FFN eval,
swap manager) from qwen35moe-specific files into server/src/common/moe_hybrid_*
and add laguna hybrid mode with layer-by-layer decode.

Common abstractions:
- moe_hybrid_types.h: MoeHybridConfig, MoeLayerDesc (plain data, no vtable)
- moe_hybrid_placement: greedy knapsack expert placement
- moe_hybrid_routing_stats: runtime frequency tracking
- moe_hybrid_storage: hot/cold expert buffer management
- moe_hybrid_ffn_eval: GPU-resident, batched, single, prefill evaluation
- moe_hybrid_swap_manager: promote/demote at request boundaries

Laguna hybrid mode:
- init_hybrid_mode(): VRAM budget, placement, partial GGUF load
- hybrid_forward_one_token(): per-layer attn+router graph + hybrid FFN
- generate_hybrid(): prefill via laguna_step, decode via hybrid path
- load_target_gguf_laguna_partial(): skips expert tensor GPU upload

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build per-layer CachedFfnGraph at init with n_expert_used slots for
the routed FFN pipeline. The decode loop uses a StreamMoE-inspired
async pattern: prefn(async) → sync → CPU remap → upload → rffn(async)
→ combine(async), eliminating per-layer graph rebuilds.

Cold experts get weight=0, contributing nothing to output while keeping
the graph structure fixed. This reuses build_cached_hot_graph from
the common module.

Also adds detailed telemetry fields (gpu_idle, tensor_io, cold_cpu,
hot_graph_build, etc.) for GPU utilization diagnosis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- compute_logits accepts optional gpu_src for GPU→GPU copy (no host bounce)
- Use tensor_set_async for embedding upload on compute stream
- Remove act_cur D2H readback after pipelined decode (stays on GPU)
- Accumulate new telemetry fields in decode breakdown output
- Pass MoeHybridStorage& to init_pipelined_decode_state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Print VRAM usage summary and expert placement info on startup for both
full GPU mode and hybrid mode paths, so users can always see how memory
is allocated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dense layers (layer 0 in Laguna with n_layer_dense_lead=1) have no expert
tensors. The placement algorithm was assigning per_layer_floor hot experts
to these layers, and the storage builder was adding 256 'cold' entries that
can't be allocated.

Fix:
- Placement: only set floor for layers where layer_expert_bytes > 0
- Storage: skip layers with no expert tensors (gate/up/down all null)
  in both build_moe_hybrid_storage and build_moe_hybrid_storage_from_file
- Add layer index to hot buffer allocation error message for debugging

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement layer-by-layer hybrid prefill for Laguna MoE, enabling the
60/40 hot/cold expert split during prompt processing:

- Rewrite generate_hybrid() prefill to use per-layer pre-FFN graph +
  eval_moe_hybrid_ffn_batched() instead of monolithic laguna_step()
- Fix KV cache write: permute K/V before copying to cache (matches
  the monolithic graph's layout)
- Fix 'all experts fit' path: always set hybrid_mode_=true in partial-
  load path so hybrid forward is used consistently
- Add DFLASH_EXPERT_BUDGET_PCT env var to trigger hybrid mode with a
  percentage budget (e.g. 60 = keep 60% of expert bytes on GPU)
- Generate uniform hotness when no routing stats file is provided

Tested with 60% budget on Laguna XS.2 (Q4_K_M):
  - 6261 hot experts (10.61 GiB VRAM), 3723 cold experts (7.07 GiB RAM)
  - Prefill: 1047ms, Decode: 12.9 tok/s
  - Output quality verified (coherent, correct answers)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add warm/safety reserve sizes and placement source to the hybrid
placement diagnostic output, matching qwen35moe's format:

  [laguna-hybrid] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB,
    kv_cache=1.25 GiB (ctx=8192), warm=200 MB, safety=512 MB,
    expert_budget=10.61 GiB (of 17.68 GiB total experts)
  [laguna-hybrid] storage ready: ... source=uniform

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Always compute dynamic placement on startup instead of requiring
DFLASH_LAGUNA_HOTNESS or DFLASH_EXPERT_BUDGET_PCT env vars. The model
now automatically:

1. Partial-loads core tensors (non-expert) to GPU
2. Computes VRAM budget accounting for KV cache, warm, and safety reserves
3. If all experts fit → reloads full model to GPU (non-hybrid path)
4. Otherwise → uses hybrid hot/cold split automatically

This matches qwen35moe behavior: the budget breakdown and placement
decision are always printed regardless of whether hybrid mode is needed.

Example output (ctx=12000, not all fit):
  [laguna] dynamic placement: gpu_total=22.00 GiB, core=2.28 GiB,
    kv_cache=1.83 GiB (ctx=12000), warm=200 MB, safety=512 MB,
    expert_budget=17.19 GiB (of 17.68 GiB total experts)
  [laguna] dynamic placement result: 9726 hot experts, 258 cold experts

Example output (ctx=8192, all fit):
  [laguna] dynamic placement result: 9984 hot experts, 0 cold experts
  [laguna] all experts fit in VRAM, loading fully to GPU

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…yers

- Remove kFfnSafeBatch=8 sub-batching that caused ~23× more graph
  builds/computes per MoE layer (the dummy-slot routing fix already
  prevents the mul_mat_id OOB issue)
- Keep StepGraph persistent across layers to reuse GPU gallocr buffer
- Add p_hot_alloc/p_cold_alloc optional params to
  eval_moe_hybrid_ffn_batched for persistent allocator reuse
- Applied to both Laguna and qwen35moe backends

Result on Laguna (95 tokens, 258 cold experts):
  prefill: 84 tok/s → 266 tok/s (3.2× speedup)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add fine-grained timing to diagnose why AR decode uses ~50% GPU time
at 60% hot expert occupancy:

- PipelinedDecodeTelemetry new fields: routed_sync_us, routed_readback_us,
  routed_cpu_remap_us, routed_ffn_dispatch_us, routed_final_sync_us,
  routed_cold_expert_hits/total_expert_slots
- Per-layer timing in the routed FFN fast path separates:
  prefn dispatch vs GPU sync stall vs D2H readback vs CPU remap vs FFN dispatch
- run_pipelined_decode_path now measures full per-token budget:
  embed / layers / logits / sample with percentage breakdown
- Cold expert hit rate tracking reveals effective coverage gaps

Enable with DFLASH_QWEN35MOE_TELEMETRY=1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CUDA mul_mat_id kernel triggers GGML_ASSERT (process abort) on sm_75
when cold experts exist in a layer. Since the assert kills the process
before eval_moe_hybrid_ffn_batched can return false, the try/catch
fallback approach doesn't work.

Fix: check storage.cold_expert_ids.empty() upfront and skip the batched
path entirely for layers with cold experts, going straight to per-token
eval_moe_hybrid_ffn_single.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, attention layers (full_attention_interval=4) used the
expensive split path: full GPU sync → host readback → hot/cold partition
→ separate GPU + CPU compute → combine. This cost ~1.6ms per mixed layer.

Now all 40 layers use the routed FFN fast path with cold-masking:
- Cold experts get weight=0, local_id=0 (mapped to expert 0, result zeroed)
- All 8 expert slots still computed on GPU (async, no PCIe roundtrip)
- Eliminates CPU cold compute and D2H/H2D transfers entirely

Results at 60% hot experts in VRAM (RTX 2080 Ti):
- Decode: 12.1 → 15.5 tok/s (+28%)
- GPU utilization: 79.8% → 89.8%
- FFN/token: 9.56ms → 2.70ms
- Eliminated: tensor_io (2.6ms), cold_cpu (6.2ms), combine (0.7ms)
- Sync stall/layer: 384µs → 212µs

100% hot baseline unchanged (14.2 → 14.9 tok/s, slight improvement).
60% hot now matches or exceeds old 100% hot performance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When DFLASH_COLD_COMPUTE=1 is set, layers with cold experts use the
split path (hot GPU + cold CPU/Halo) instead of cold-masking (weight=0).
This produces exact results at the cost of speed.

Behavior:
- All-hot layers: still use fast routed FFN path (no quality difference)
- Mixed layers (have cold experts): use split path with actual cold
  computation on the cold backend (CPU today, AMD Halo in future)

Results at 60% hot (RTX 2080 Ti, cold on CPU):
- cold_compute=off (default): 15.5 tok/s, drops cold experts
- cold_compute=on:            10.5 tok/s, exact output quality

The cold_compute time (15ms/token on CPU) is the target for
acceleration with a fast-memory device like AMD Halo.

Prefill already computes cold experts correctly via per-token
fallback (eval_moe_hybrid_ffn_single), no change needed there.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace ggml graph dispatch (15 ops + thread pool barriers per layer)
with a direct fused kernel calling ggml vec_dot primitives. Uses OpenMP
to parallelize row-level matmuls and saturate DDR4 memory bandwidth.

Key design:
- ColdFfnCompute interface (cold_ffn_compute.h) — scales to Halo/GPU
- CpuColdFfnCompute implementation using ggml vec_dot type traits
- Per-tensor type support (Q4_K gate, Q5_K down in Q4_K_M quants)
- Configurable threads via DFLASH_COLD_THREADS env (default: 8)

Results at 60% hot (6155 hot, 4085 cold, RTX 2080 Ti):
- Before: 10.4 tok/s decode (38ms cold single-threaded, 15.7ms ggml graph)
- After:  14.7-17 tok/s decode (14.5ms cold with 8 threads)
- GPU utilization: 50% → 64-70%
- Correctness verified (coherent multi-paragraph output)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the cold_compute guard that forced mixed layers onto the slower
split path. Both the DeltaNet routed fast path and the split-path routed
FFN sub-path now handle cold experts inline:

- Partition routing into hot (GPU rffn) + cold (CPU fused kernel)
- D2H ffn_post only when cold experts are selected
- Cold compute runs on CPU in parallel with GPU rffn dispatch
- Upload cold result to combine graph (or keep zeroed if all-hot)

When all experts are hot at runtime, the cold branch is skipped entirely
with zero overhead — path is identical to the previous routed fast path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@howard0su howard0su force-pushed the layersplit_refactor branch from 1de45e4 to 32e0675 Compare June 3, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant